CN104199813B

CN104199813B - Pseudo-feedback-based personalized machine translation system and method

Info

Publication number: CN104199813B
Application number: CN201410491100.9A
Authority: CN
Inventors: 杨沐昀; 朱俊国; 赵铁军; 李生; 徐冰; 曹海龙; 朱聪慧; 郑德权
Original assignee: Harbin Institute of Technology
Current assignee: Harbin University Of Technology High Tech Development Corp
Priority date: 2014-09-24
Filing date: 2014-09-24
Publication date: 2017-05-24
Anticipated expiration: 2034-09-24
Also published as: CN104199813A

Abstract

The invention relates to a pseudo-feedback-based personalized machine translation system and method. The existing traditional machine translation methods are unavailable for the obtaining of high-quality personalized translation systems, and the various translation demands of users cannot be met. The pseudo-feedback-based personalized machine translation system comprises a phrase table filter module, an input module, an initial translation module, a pseudo-feedback search module, a phrase table sorting module and a decoder module. The pseudo-feedback-based personalized machine translation method includes the steps: an inputting step, a user inputs a translation task S; an initial translation step, an initial machine translation result T' of the translation task is obtained with the initial translation module; a pseudo-feedback search step, the pseudo-feedback search module is used to search to obtain initial translation results and standard translations R of similar translation instances; a phrase table sorting step, a trained universal post-editing model is turned into a personalized post-editing model, and filtering is performed to obtain an optimized post-editing model; a decoder module decoding step, the optimized personalized post-editing model is used to decode the initial machine translation result T' of the translation task so as to obtain an optimal final translation result. The pseudo-feedback-based personalized machine translation system and method is applicable to the field of machine translation.

Description

Personalized machine translation system and method based on pseudo feedback

Technical Field

The invention relates to a personalized machine translation system and a personalized machine translation method, and belongs to the field of machine translation.

Background

With the rapid development of machine translation technology in recent years, the translation quality has been greatly improved, and some general online translation services can help people break through the language barrier to read and understand some common cross-language texts. However, significant difficulties have been encountered to further improve the quality of machine translation. On the one hand, the main disadvantage of the existing statistical machine translation technology is that if personalized translation is to be completed, a large amount of user feedback information is needed, and statistical training modeling is performed on the data, so that a personalized machine translation model is realized. The user feedback information required by the training is difficult to obtain, and the existing method cannot effectively utilize the feedback information, so that a high-quality personalized translation system cannot be obtained. Although user feedback information can be utilized through conventional post-editing, the advantages of the statistical post-editing model are difficult to exploit due to the small amount of user data that can be used. On the other hand, the optimization goals of conventional machine translation methods are typically based on open-field rather than on specific translation tasks. Despite research on the problem of domain adaptation, the method still belongs to a professional group, and various translation requirements of users cannot be met by the wide and diverse machine translation users, especially internet online users. Therefore, further improving the quality of machine translation is a technical problem to be solved urgently.

Disclosure of Invention

The invention aims to solve the problem that a high-quality personalized translation system cannot be obtained by a traditional machine translation method, and various translation requirements of a user cannot be met, and provides a personalized machine translation system and a translation method based on pseudo feedback, which can improve the machine translation quality.

A personalized machine translation system based on pseudo-feedback, the translation system comprising:

a phrase table filtering module for filtering each generic post-editing model phrase table of the development set data;

the input module is used for obtaining a translation task S input by a user;

the preliminary translation module is used for translating the translation task S input by the user to obtain a preliminary machine translation result T' of the translation task, and translating the source language sentences of the translation example base provided by the local system to obtain a preliminary translation sentence T of the translation example;

the pseudo feedback retrieval module is used for retrieving and obtaining a preliminary translation result of a similar translation example and a standard translation R in a translation example library of a local system in a word alignment form;

the phrase table classification module is used for classifying the phrase table of the trained post-editing model to obtain an individualized post-editing model;

and the decoder module is used for decoding the preliminary translation result of the similar translation example retrieved by the pseudo feedback retrieval module to obtain a final translation result.

Before a user inputs a translation task S, a general post-editing model is trained by using a statistical method by using a translation example preliminary translation sentence T and a standard translation sentence R in translation memory, and the training process of the general post-editing model is completed; the personalized machine translation method is realized by the following steps:

step one, phrase table filtering module process: filtering each universal post-editing model phrase table of the development set data by using a phrase table filtering module;

employing default weights for each sentence D in the development set data based on the filtered results_iDecoding is carried out to generate an n-best translation result; then, combining the n-best translation results; finally, the MERT tool is used for integrally adjusting parameters of the combined n-best translation result, and the characteristic parameter optimization process can be realized;

step two, an input process: inputting a translation task S into an input module by a user;

step three, a primary translation process: the preliminary translation process comprises two parts, namely before the user inputs the translation task S and after the user inputs the translation task S;

before a user inputs a translation task S, a translation platform set up by a machine translation system of a local system is utilized to initially translate a source language sentence of a translation example library provided by the local system to obtain an initial translation sentence T of a translation example;

meanwhile, after a translation task S input by a user is obtained through an input module, a primary machine translation result T' of the translation task is obtained through translation of a primary translation module;

step four, pseudo feedback retrieval process: according to the translation example preliminary translation sentence T obtained in the third step, in a translation example library in a local word alignment form, a pseudo feedback retrieval module is utilized to perform cosine similarity retrieval through a source language word bag model to obtain a preliminary translation result and a standard translation text R of a similar translation example, and the first 900 most similar words are selected from the preliminary translation result of the similar translation example and the retrieval result of the standard translation text R;

the cosine similarity CS is calculated according to a vector space model taking a source language bag-of-words model as a unit, and the calculation method of the cosine similarity CS comprises the following steps:

wherein Vec (S)_example) Source language sentence vector, Vec (S), which is a translation instance_input) To translate a task vector, Vec (S)_input)·Vec(S_example) Is the inner product of two vectors, | | · | |, is the norm of the vector;

step five, the phrase table classification process: according to the initial translation result and the standard translation text R of the first 900-1100 similar translation examples selected in the fourth step, the phrase table of the trained general post-editing model is classified into positive phrases which are beneficial to improving the translation quality and negative phrases which are capable of integrating noise into the final translation result by using a phrase table classification module, so that the trained general post-editing model is changed into a personalized post-editing model, the positive phrases and the negative phrases in the personalized post-editing model are compared with the initial translation result and the standard translation text R of the similar translation examples retrieved in the pseudo feedback retrieval process in the fourth step, and the phrases are filtered from the personalized post-editing model phrase table, so that an optimized personalized post-editing model is obtained;

step six, the decoding process of the decoder module: and taking the optimized personalized editing model in the step five as a translation model, and decoding the primary machine translation result T' of the translation task obtained in the step three by using a traditional machine translation decoding method by using a decoder module to obtain an optimized final translation result.

The invention has the beneficial effects that: the invention utilizes the pseudo feedback retrieval module to retrieve similar translation examples in the translation example library, classifies general post-editing phrases through the short language table classification module, filters out negative post-editing phrases, and selects post-editing rules to obtain an optimized personalized post-editing model, thereby improving the quality of machine translation. In addition, the characteristic parameter optimization process is applied when the model is edited after being built in the primary translation process, and in the characteristic parameter optimization process, input data are decoded respectively for given development set data, and then overall parameter adjustment is carried out, so that the method has the advantages of effectively optimizing parameters and improving system performance. Particularly, in the process of utilizing the pseudo feedback retrieval module to retrieve in the local translation instance library data set, a parallel statement pair similar to the initial translation result of the sentence to be translated input and obtained by the user is obtained to replace feedback information, so that the problem that the feedback information of the user is difficult to obtain is solved.

In addition, the feedback information is well utilized by the method, an effective post-editing model is established on the initial translation model, and the translation result obtained by the personalized machine translation system and method based on the pseudo feedback is compared with the translation result of Google, so that the translation quality is improved by 19.5 percent; compared with the translation result of a machine translation system trained by a Moses tool, the translation quality is improved by 14.1 percent

Drawings

FIG. 1 is a schematic diagram of the translation process of the present invention.

Detailed Description

The first embodiment is as follows:

the personalized machine translation system based on the pseudo feedback of the embodiment comprises:

the input module is used for obtaining a translation task S input by a user;

The second embodiment is as follows:

different from the specific embodiment, in the personalized machine translation system based on pseudo feedback according to the embodiment, the phrase table filtering module is included in the phrase table classifying module.

The third concrete implementation mode:

in the translation method of the personalized machine translation system based on the pseudo feedback, before a user inputs a translation task S, a general post-editing model is trained by using a statistical method by using a translation example preliminary translation sentence T and a standard translation sentence R in translation memory, and the training process of the general post-editing model is completed; the personalized machine translation method is realized by the following steps:

The fourth concrete implementation mode:

different from the third embodiment, in the translation method of the personalized machine system based on the pseudo feedback according to the third embodiment, the decoding process in the sixth step uses a formula:processing the primary machine translation result T' of the translation task to obtain an optimized final translation result; in the formula, P (T ″, T ') is a translation probability of the general post-editing model, P (S | T ″, T ') is a probability of post-editing model translation of a preliminary machine translation sentence T ' of a given input translation task S by a phrase pair (T ″, T ') in the general post-editing model, and a probability value thereof is defined as 1 or 0, and then a value of P (S | T ″, T ') is obtained by the following two methods:

1) editing phrase pairs (P) in a model upon optimization of personalization_T,P_R) When the two phrases in the translation task are respectively matched with at least one phrase in the initial machine translation result T ' and the standard translation text R of the translation task, the probability value of P (S | T ', T ') is 1, otherwise, 0 is taken; or,

2) editing phrase pairs (P) in a model upon optimization of personalization_T,P_R) The phrase P in (1)_RWhen there is a match with at least one phrase in the standard translation R, the probability value of P (S | T ', T') is taken as 1, otherwise it is taken as 0.

The fifth concrete implementation mode:

different from the third or fourth specific embodiments, in the translation method of the personalized machine system based on the pseudo feedback according to the present embodiment, when the pseudo feedback retrieval process according to the fourth step is performed, the top 1000 most similar translation results are selected from the preliminary translation result of the similar translation example and the retrieval result of the standard translation R.

The Olympic of IWSLT2012 is used as a translation task input by a user, the translation task data is used for testing the personalized machine translation system and method based on the pseudo feedback, the training data provided by the translation task input by the user is in the field of travel spoken language, the specific application occasions of traffic, catering, stadiums, commerce and the like under the application background of Olympic games are covered, 52,603 pairs of Chinese-English double-language sentence pairs, specifically 495,638 Chinese words and 527,599 English words are contained, and the translation task data is used as a personalized local translation example library of the user. Adopting a development set comprising 2,057 pairs of Chinese-English bilingual sentence pairs and a test set comprising 998 pairs of Chinese-English bilingual sentence pairs; the preliminary translation module uses a google online translation system, the translation result of the linguistic data is crawled from the google online translation system, the BLEU-4 is adopted as the translation quality evaluation standard, and the obtained test result is directly compared with the google translation result. Meanwhile, a machine translation system trained by using an open source Moses tool is used as a second group of control tests for comparison.

By taking the BLEU-4 score as an evaluation standard, the translation result obtained by the personalized machine translation system and method based on the pseudo feedback is compared with the translation result of the Google online translation system, and the translation quality is improved by 19.5 percent; compared with the translation result of a machine translation system trained by a Moses tool, the translation quality is improved by 14.1%, and the test result is shown in Table 1:

table 1: and comparing the translation quality of the personalized translation result based on the pseudo feedback with that of the translation result of other systems.

Claims

1. A personalized machine translation system based on pseudo feedback, the translation system comprising:

the input module is used for obtaining a translation task S input by a user;

2. The personalized machine translation system based on pseudo feedback of claim 1, wherein the phrase table filtering module is included in the phrase table classification module.

3. The translation method of the personalized machine translation system based on the pseudo feedback as claimed in claim 2, wherein: before a user inputs a translation task S, training a general post-editing model by using a translation example preliminary translation sentence T and a standard translation sentence R in translation memory and adopting a statistical method, and finishing the training process of the general post-editing model; the personalized machine translation method is realized by the following steps:

step one, phrase table filtering process: filtering each universal post-editing model phrase table of the development set data by using a phrase table filtering module;

C S (S_{i n p u t}, S_{e x a m p l e}) = \frac{V e c (S_{i n p u t}) \cdot V e c (S_{e x a m p l e})}{| | V e c (S_{i n p u t}) | | * | | V e c (S_{e x a m p l e}) | |},

4. The translation method of the personalized machine translation system based on the pseudo feedback according to claim 3, wherein: step six the decoding process utilizes the formula:processing the primary machine translation result T' of the translation task to obtain an optimized final translation result; in the formula, P (T ″, T ') is a translation probability of the general post-editing model, P (S | T ″, T ') is a probability of post-editing model translation of a preliminary machine translation sentence T ' of a given input translation task S by a phrase pair (T ″, T ') in the general post-editing model, and a probability value thereof is defined as 1 or 0, and then a value of P (S | T ″, T ') is obtained by the following two methods:

2) editing phrase pairs (P) in a model upon optimization of personalization_T,P_R) The phrase P in (1)_RAt least one of the standard translation RWhen phrases match, the probability value of P (S | T ', T') is 1, otherwise it is 0.

5. The translation method of the personalized machine translation system based on the pseudo feedback according to claim 3 or 4, wherein: and when the pseudo feedback retrieval process in the step four is carried out, the top 1000 most similar ones are selected from the preliminary translation results of the similar translation examples and the retrieval results of the standard translation R.