CA2971884A1

CA2971884A1 - Method and device for general machine translation engine-oriented individualized translation

Info

Publication number: CA2971884A1
Application number: CA2971884A
Authority: CA
Inventors: Muyun YANG; Junguo ZHU; Sheng Li; Tiejun Zhao
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2013-12-24
Filing date: 2014-09-28
Publication date: 2015-07-02
Anticipated expiration: 2034-09-28
Also published as: WO2015096529A1; CA2971884C; CN104731774A; CN104731774B

Abstract

A universal machine translation engine-oriented individualized translation method and device. The method comprises: acquiring a translation content which is input by a user; acquiring an on-line translation result of the translation content; based on the translation content, conducting similarity retrieval in a bilingual translation example library, so as to acquire at least one translation example; conducting incremental alignment on the translation example using the on-line translation result, so as to generate a confusion network; and decoding the confusion network, so as to acquire a translation candidate result. The method can improve the accuracy of a translation result.

Description

Method and Device for General Machine Translation Engine-Oriented Individualized Translation Technical Field The disclosure relates to the field of computer data processing, and particular to a method and device for general machine translation engine-oriented individualized translation.
Background to Machine translation is a research on a technology for translating words or speeches from a natural language to another natural language by means of a computer program. Generally speaking, the machine translation system may be divided roughly into two strategies, namely the rule-based translation system and the corpus-based translation system. Herein, as for the rule-based translation system, a language expert needs to manually establish a complicated rule system with a dictionary; and different from the rule-based translation system, the corpus-based translation system is centered around corpus, with a knowledge base derived from the segmented and annotated corpora. A corpus-based translation method may be further divided into an instance-based translation method and a statistical translation method.
Generally, the basic idea of the instance-based machine translation method refers to a basic mode of a foreign language beginner. Here, a person who begins to learn a foreign language adopts a translation mode of remembering a most basic bilingual sentence pair and then making substitution drills on some contents in the bilingual sentence pair. The above-mentioned instance-based machine translation basically refers to a learning process which translates only with existing experiential knowledge via an analogy principle without deep analysis. A translation process is that: a source language is correctly decomposed into sentences and then decomposed into phrase segments, these phrase segments are translated into target language phrases by means of an analogy method, and these phrases are combined into a long sentence finally. By means of analysis, it may be obtained that the instance-based machine translation has a very remarkable effect on translation of the same or similar text. With the scale increase of an instance sentence base, the effect is more and more remarkable. For existing texts in an instance library, a high-quality translation result may be directly obtained. For texts similar to instances in the instance library, a translation result may be slightly corrected by analogical inference to construct an approximate translation result. The method needs a large corpus as support, so the actual demand of instances is very huge. But due to limitation of corpus scale, the instance-based machine translation is unlikely to be high in matching rate. Only within a specific professional domain, the translation quality can meet practical requirements.
The core idea of the statistical machine translation method is: statistically o analyzing a great number of parallel corpora, constructing a statistical translation model, and further translating by using the model. The classical word-based statistical machine translation is modelled via a noise channel model. Its basic idea is that the translation is regarded as a decoding process of converting an original text into a translation via a model, and a translation result is a sentence having the Is maximum probability. In the current statistical translation method, translation modelling is performed by using a phrase-based log-linear model generally, so the translation quality is obviously improved. Based on this method, Google, Baidu, Microsoft and other companies provide web-based public on-line translation service for free. Under the limitation of a large-scale corpus statistical processing technology, translation service models established by these systems in advance cannot be adjusted according to different user demands, and therefore most of these pieces of conventional translation service provide translation oriented to the general domain and cannot provide corresponding individualized translation results capable of satisfying different user preferences.
25 To address this issue, in order to meet various different translation demands of users, researchers have proposed a domain-adaptive solution to current public SMT
engines. The core idea is that: either a corresponding domain model is trained by using a corpus with technical domain information or a general translation model is adjusted according to the technical domain information to make it capable of satisfying change of the technical domains of translation tasks, thereby meeting the translation demands of different technical domains. However, in the traditional art, it is necessary to collect a great number of domain corpora to implement these methods. The type and quantity of collectable domain corpora at present are still limited to few domains of news, science and technology, etc. Although the translation

2 quality is slightly improved, diversified individualized translation demands of users cannot be met from the perspective of application. Meanwhile, most personal and enterprise users do not expect to expose accumulated data containing information such as personal privacy or business secrets while expecting to obtain individualized translation service, thereby further increasing the difficulty in implementation of high-quality individualized machine translation service. Thus, it directly causes that the current customized translation service is still limited to few domains and cannot be further popularized and applied.
From the foregoing, the main defects of a conventional statistical machine translation technology are that: if individualized translation needs to be completed, it is necessary to collect a great amount of user translation data in advance and to perform statistical learning on these data, so as to train a new model. It is often difficult to obtain translation data needed for these trainings, and the training process is time-consuming, and does not facilitate protection of the privacy of a translation user.
An effective solution has not been proposed yet currently for the problem of inaccurate translation result caused by imperfect content of a translation database used in a process of completing individualized translation via machine translation in the related art.
This section provides background information related to the present disclosure which is not necessarily prior art.
Summary An effective solution has not been proposed yet currently for the problem of inaccurate translation result caused by imperfect content of a translation database used in a process of completing individualized translation via machine translation in the related art. Thus, the disclosure is mainly directed to a method and device for general machine translation engine-oriented individualized translation which are intended to solve the above-mentioned problem.
To this end, according to an aspect of the disclosure, a method for general machine translation engine-oriented individualized translation is provided, which includes that: a translation content input by a user is acquired; an on-line translation result of the translation content is acquired; based on the translation content, similarity retrieval is performed in a bilingual translation instance library, so as to acquire at least one translation instance; incremental alignment is performed on the

3 translation instance by using the on-line translation result, so as to generate a confusion network; and the confusion network is decoded, so as to acquire a translation candidate result.
To this end, according to another aspect of the disclosure, a device for general machine translation engine-oriented individualized translation is provided, which includes: a first obtaining component, configured to acquire a translation content input by a user; a second obtaining component, configured to acquire an on-line translation result of the translation content; a retrieving component, configured to perform, based on the translation content, similarity retrieval in a bilingual translation instance library, so as to acquire at least one translation instance; an incremental alignment processing component, configured to perform incremental alignment on the translation instance by using the on-line translation result, so as to generate a confusion network; and a translation generation component, configured to decode the confusion network, so as to acquire a translation candidate result.
By means of the disclosure, the translation content input by the user is acquired;
the on-line translation result of the translation content is acquired; based on the translation content, similarity retrieval is performed in the bilingual translation instance library, so as to acquire at least one translation instance;
incremental alignment is performed on the translation instance by using the on-line translation result, so as to generate the confusion network; and the confusion network is decoded, so as to acquire the translation candidate result. The problem of inaccurate translation result caused by imperfect content of a translation database used in a process of completing individualized translation via machine translation in the related art is solved, thereby improving the accuracy of a translation result.
Brief description of the drawings The drawings illustrated herein are intended to provide further understanding of the disclosure, and form a part of the application. The schematic embodiments and illustrations of the disclosure are intended to explain the disclosure, and do not form improper limits to the disclosure. In the drawings:
Fig. 1 is a flowchart of a method for general machine translation engine-oriented individualized translation according to an embodiment of the disclosure;
Fig. 2 is a schematic diagram of a directed graph of a confusion network

4 according to an embodiment of the disclosure; and Fig. 3 is a structural diagram of a device general machine translation engine-oriented individualized translation according to an embodiment of the disclosure.
Detailed description of the embodiments It is important to note that the embodiments in the application and the characteristics in the embodiments may be combined under the condition of no conflicts. The disclosure will be illustrated below with reference to the drawings and in conjunction with the embodiments in detail.
Embodiment 1:
In a most basic configuration, Fig. 1 is a flowchart of a method for general machine translation engine-oriented individualized translation according to an embodiment of the disclosure. As shown in Fig. 1, the method includes the following steps.
Step S10: A translation content input by a user is acquired.
Step S30: An on-line translation result of the translation content is acquired.
Specifically, the on-line translation result in the step may be a general translation result about translation service of an on-line machine such as Google for a translation task provided by the user.
Step S50: Based on the translation content, similarity retrieval is performed in a bilingual translation instance library, so as to acquire at least one translation instance.
Step S70: Incremental alignment is performed on the translation instance by using the on-line translation result, so as to generate a confusion network.
Step 590: The confusion network is decoded, so as to acquire a translation candidate result.
In the above-mentioned of the application, For the existing general machine translation engine, an individualized translation system oriented to specific demands of the user is implemented by using the bilingual translation instance library specified by the user. That is, the translation candidate result of the current translation content is obtained by combining the on-line translation result and a retrieval result in a bilingual translation instance, so the problem of inaccurate translation result caused by imperfect content of a translation database used in a process of completing

5 individualized translation via machine translation in the related art is solved, thereby improving the accuracy of a translation result, providing a high-quality translation result and user experience for the user, avoiding pre-collection of user data, and protecting the privacy of a translation user.
Specifically, the above-mentioned technical solution may be independent of a general machine translation engine, and may post-process a result of any machine translation engine to generate an individualized machine translation result.
The bilingual translation instance library utilized in the method may be only implemented locally at a client, so it may be achieved that data of the user is only effective at the client without being uploaded to a server, thereby protecting the privacy of the translation user. Moreover, in the above-mentioned method, time-consuming large-scale statistical learning training is not needed, so the user can quickly obtain an individualized translation result.
Here, it is important to note that the bilingual translation instance library in the above-mentioned embodiment of the application is a bilingual corpus, and may collect two language texts capable of being mutually translated. In addition, bilingual alignment is: establishing a corresponding relation between the same language units of a source language and a target language in the bilingual translation instance library, i.e., determining language units in source language texts and language units in target language texts in a mutual translated relationship. Bilingual texts have a multi-layer multi-granularity corresponding relationship, including alignment between paragraphs, sentences, phrases and words.
In the above-mentioned embodiment of the application, before Step S10 of obtaining the translation content input by the user, the method may further include the following implementation steps.
Step S101: The bilingual translation instance library is acquired, the bilingual translation instance library including multiple groups of sentence pairs.
Specifically, the bilingual translation instance library in the above-mentioned step of the application may locally select a bilingual translation instance library satisfying a source language and a target language for own translation at the client by the user according to a language direction of the translation task. The instance library may be a resource such as a historical human translation result of the user or a bilingual dictionary about the domain of the user. Bilingual sentence pairs in the bilingual translation instance library may be sentence pairs subjected to artificial word

6 alignment or not subjected to word alignment. It is important to note that the application does not obviously limit the scale of the above-mentioned instance library.
Step S103: Sentences not subjected to word alignment in the bilingual translation instance library are automatically aligned, so as to acquire a word-aligned bilingual sentence pair, wherein the bilingual sentence pair includes: a source language and a target language corresponding to the source language.
Specifically, the above-mentioned embodiment of the application implements automatic alignment on the sentence pairs not subjected to work alignment in the bilingual translation instance library, and outputs the word-aligned bilingual sentence io pairs.
The automatic alignment here estimates an alignment probability between two words by using a coexistence frequency of words of different languages in the same bilingual sentence, and then a coexistence frequency of two words is estimated by using the alignment probability until convergence is achieved. Finally, an alignment of maximum probability is selected to serve as a final alignment result. In order to improve the alignment quality, the application may combine, align and then separate a general corpus and an instance library. The application aims to solve the problem of inaccurate alignment result caused by small scale of an instance library made by the user generally by using a solution of combining the general corpus and the instance library, the accuracy of an alignment result may be improved, and specifically, a large-scale corpus namely the general corpus may be combined with the instance library made by the user to execute a word alignment process, thereby generating a high-quality alignment result.
In the above-mentioned embodiment of the application, Step S50 of performing similarity retrieval in a bilingual translation instance library based on the translation content so as to acquire at least one translation instance may include the following implementation steps.
Step S501: A vector value of the translation content is acquired.
Step S502: Source language vector values of all translation instances in the bilingual translation instance library are acquired.
Specifically, in the above two steps, the translation content and the source languages of all the translation instances in the bilingual translation instance library are described by using a vector space model. That is, all different words occurring in the sentences of the source languages of the translation content or the sentences of the translation instances serve as a dimension of a vector. The frequency of

7 occurrence of a certain word in each sentence is a value of the dimension corresponding to the word. For instance, the vector of a sentence "R, "
may be sparsely expressed as: (1 1 1 -*Th 1.).
Step S503: Similarity calculation is performed according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library, so as to generate a plurality of similarity values of the translation content.
Step S504: N translation instances corresponding to the translation content are selected according to the similarity values, N being a natural number.
Preferably, in the above-mentioned embodiment of the application, Step S503 of performing similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library so as to generate a plurality of similarity values of the translation content may be implemented by means of the following implementation is mode:
calculating a similarity value P of the translation content by means of the following formula:
ex 1; =
("osSinr(ex = ___________ P= *
11'1'11 , wherein er h," is a source ¨
language vector value of a translation instance, F is a vector value of a translation F =
content, ex is an inner product of a source language vector value of a translation instance and a vector value of a translation content, and ilex ¨I;'1*111:11 is a norm of a source language vector value of a translation instance and a vector value of a translation content.
Preferably, in the above-mentioned embodiment of the application, Step S504 of selecting N translation instances corresponding to the translation content according to the similarity values may include the following implementation steps.
Step S5041: Multiple similarity values of the translation content are sorted according to the value size.
Step S5042: Corresponding translation instances are extracted according to the sorted similarity values to acquire N translation instances.
Specifically, the above-mentioned implementation solution achieves similarity

8 calculation on the translation task and the translation instance library according to the vector values of the sentences, so as to obtain a translation instance candidate list most similar to the translation task. The most similar first N (N is 15 generally) translation instances may be selected. Specifically, the COS similarity may be calculated according to a bag-of-word vector space model in the above-mentioned solution of the application. COS similarity calculation is performed on the translation content, input by the user and serving as a current translation task, and the translation instance in the translation instance library. The similarity calculation may be performed according to the following formula:
ex F = F
Co.s-Sini(ex ¨ __ 1 Ilex F * F
, wherein ¨ is a source language vector of a translation instance, F is a translation task vector, ex ¨ is an inner product of two vectors, and 11 I: II is a norm of a vector.
Here, it is important to note that the bag-of-word model is a simple hypothesis in natural language processing and information retrieval. In this model, texts (paragraphs or documents) are regarded as unordered word sets, ignoring grammar or even a word sequence.
Preferably, in the above-mentioned embodiment of the application, Step S70 of performing incremental alignment on the translation instance by using the on-line translation result so as to acquire a confusion network may include the following implementation solutions.
Step S701: The on-line translation result is set as an original translation framework.
Specifically, the translation framework may serve as an initial translation result or a basic translation result, is an alignment datum, and is a sequence composed of one or more sets, wherein the set is composed of one or more words. That is, each position of the sequence contains one or more words. Other translations need to be aligned according to the words on the translation framework.
Step S702: Incremental alignment is performed on the target language of the translation instance and the above-mentioned translation framework obtained currently in sequence, so as to obtain an alignment result.
Step S703: All words of translations in the on-line translation result and N

9 translation instances are connected according to the alignment result to form the confusion network. Specifically, the step implements update of corresponding word information in the original translation framework according to the alignment result. A
confusion network may be obtained.
Specifically, the process may be implemented by using an incremental alignment module in a tool TERp. Four sentences ("from the newspaper", "from newspaper", "newspaper", and "house newspaper") may be illustrated below, wherein "from the newspaper" is an on-line translation result, and "from newspaper", "newspaper", and "house newspaper" are target languages of the above-mentioned translation instance.
The incremental alignment process is: aligning sentence 1 and sentence 0, then aligning sentence 2, and finally aligning sentence 3. An alignment result is as shown in Table 1:
Table 1 0 from the newspaper 1 from NULL newspaper 2 NULL NULL newspaper 3 house NULL newspaper The above-mentioned alignment result is also called as a confusion network, a directed graph thereof being as shown in Fig. 2.
Preferably, in the above-mentioned embodiment of the application, Step S90 of decoding the confusion network so as to acquire a translation candidate result may specifically include the following steps. That is, after Step S703, the following steps may be executed.
Step S704: The obtained confusion network is decoded according to sentence characteristics so as to generate at least one decoding result. That is, a new translation result is obtained. This process is also called as decoding.
Specifically, the decoding process is a process of searching a directed graph of the confusion network for an optimal translation path actually. A word passing through each translation path generates a translation. In the search process, the score of the translation path is calculated by using a log-linear model. The log-linear model takes the log of a characteristic value attached to each word, and then calculates the score of the path according to weighted summation. A translation having the optimal score is selected as a final translation.
The sentence characteristic selected in the above-mentioned embodiment of the application includes:
language model probability: adopting a language model for target language training of all bilingual translation instance libraries of a user, and specifically, adopting an n-element language model most widely applied at present;
word penalty: number of words for generating a translation;
empty word penalty: number of empty words contained in a translation path;
word consistency: calculation of frequency of occurrence of N successive words in a generated translation in a selected translation instance (i.e.,the ratio of number of times for occurrence of N successive words in a generated translation in a selected translation instance to total number of successive words in an instance);
translation generation probability: calculation of an n-element language model by using all the selected instances, and calculation of a language model probability of a translation by using the model, which may be a local language model probability obtained within an instance library range actually;
number of repeated words: number of repeated words in a generated translation;
number of result words of general translation engine: number of words in a result of a general translation engine in a generated translation; and word confidence: confidence of words in a confusion network.
Preferably, when word confidence characteristics are selected as sentence characteristics, the confidence of the confusion network is estimated to obtain a confidence estimation result. Here, it is important to note that the method for estimating the confidence of a word in the confusion network in the above-mentioned solution may be:
t) ¨E
a Wherein n is a total number of translation instances; when i=0, Er is an on-line translation result of a general translation engine, and when I 1, /7', is an ith translation instance; r is a source language similarity value of an ith translation instance; A, is a 0-1 characteristic function, if 11 occurs in I - at a current 7-1=. /4 w ) position, and othe )0 , rwise, `i and I is a posterior probability of a word Iv from a sentence condition, and a calculation method thereof includes estimation according to word alignment information:
p( ivIE) _____________ 1 +e , wherein e is a radix of a natural logarithm, and c is a defined counter.
A specific instance of the above-mentioned estimation algorithm is as follow.
An initial value of c corresponding to each word w is 0. If a word w comes from a translation result of a general translation engine, c keeps unchanged. If a word w comes from an instance and there is not an alignment result about w in the instance, c keeps unchanged. According to a bilingual alignment result of the instance, if a source language word aligned with w occurs in a translation task, c is added with 1.
According to the bilingual alignment result of the instance, if the source language word aligned with w does not occur in the translation task, 1 is subtracted from c.
According to a language direction of the translation task, the user may provide a group of standard bilingual sentence pairs translated from a source language and a target language to serve as development sets of translation tasks. The data is used for a system to optimize the characteristic weight of an individualized translation model. If the data cannot be provided, a defaulted weight is adopted.
From the foregoing, the above-mentioned embodiment of the application implements a translation optimizing technology for learning making information of the user, particularly a method and system for converting a general machine translation result into an individualized translation result.
It is important to note that the steps shown in the flowchart of the drawings may be executed in a computer system including, for instance, a set of computer-executable instructions. Moreover, although a logic sequence is shown in the flowchart, the shown or described steps may be executed in a sequence different from the sequence here under certain conditions.
Embodiment 2:
Fig. 3 is a structural diagram of a device for general machine translation engine-oriented individualized translation according to an embodiment of the disclosure. As shown in Fig. 3, the device for general machine translation engine-oriented individualized translation may include: a first obtaining component 10, a second obtaining component 30, a retrieving component 50, an incremental alignment processing component 70 and a translation generation component 90, wherein the first obtaining component 10 is configured to acquire a translation content input by a user; the second obtaining component 30 is configured to acquire an on-line translation result of the translation content; the retrieving component 50 is configured to perform, based on the translation content, similarity retrieval in a bilingual translation instance library, so as to acquire at least one translation instance;
the incremental alignment processing component 70 is configured to perform incremental alignment on the translation instance by using the on-line translation result, so as to generate a confusion network; and the translation generation component 90 is configured to decode the confusion network, so as to acquire a translation candidate result.
In the above-mentioned of the application, for the existing general machine translation engine, an individualized translation system oriented to specific demands of the user is implemented by using the bilingual translation instance library specified by the user. That is, the translation candidate result of the current translation content is obtained by combining the on-line translation result and a retrieval result in a bilingual translation instance, so the problem of inaccurate translation result caused by imperfect content of a translation database used in a process of completing individualized translation via machine translation in the related art is solved, thereby improving the accuracy of a translation result, providing a high-quality translation result and user experience for the user, avoiding pre-collection of user data, and protecting the privacy of a translation user.
Specifically, the above-mentioned technical solution may be independent of a general machine translation engine, and may post-process a result of any machine translation engine to generate an individualized machine translation result.
The bilingual translation instance library utilized in the method may be only implemented locally at a client, so it may be achieved that data of the user is only effective at the client without being uploaded to a server, thereby protecting the privacy of the translation user. Moreover, in the above-mentioned method, time-consuming large-scale statistical learning training is not needed, so the user can quickly obtain an individualized translation result.
Here, it is important to note that the bilingual translation instance library in the above-mentioned embodiment of the application is a bilingual corpus, and may collect two language texts capable of being mutually translated. In addition, bilingual alignment is: establishing a corresponding relation between the same language units of a source language and a target language in the bilingual translation instance library, i.e., determining language units in source language texts and language units in target language texts in a mutual translated relationship. Bilingual texts have a multi-layer multi-granularity corresponding relationship, including alignment between paragraphs, sentences, phrases and words.
Preferably, the device in the above-mentioned embodiment of the application may further include: a third obtaining component, configured to acquire the bilingual translation instance library, the bilingual translation instance library including multiple groups of sentence pairs; and an automatic alignment component, configured to automatically align sentences not subjected to word alignment in the bilingual translation instance library, so as to acquire a word-aligned bilingual sentence pair, wherein the bilingual sentence pair includes: a source language and a target language corresponding to the source language.
Specifically, the bilingual translation instance library in the above-mentioned third is obtaining component of the application may locally select a bilingual translation instance library satisfying a source language and a target language for own translation at the client by the user according to a language direction of the translation task. The instance library may be a resource such as a historical human translation result of the user or a bilingual dictionary about the domain of the user.
Bilingual sentence pairs in the bilingual translation instance library may be sentence pairs subjected to artificial word alignment or not subjected to word alignment. It is important to note that the application does not obviously limit the scale of the above-mentioned instance library.
Besides, the above-mentioned automatic alignment component of the application implements automatic alignment on the sentence pairs not subjected to work alignment in the bilingual translation instance library, and outputs the word-aligned bilingual sentence pairs. The automatic alignment here estimates an alignment probability between two words by using a coexistence frequency of words of different languages in the same bilingual sentence, and then a coexistence frequency of two words is estimated by using the alignment probability until convergence is achieved.
Finally, a alignment of maximum probability is selected to serve as a final alignment result. In order to improve the alignment quality, the application may combine, align and then separate a general corpus and an instance library. The application aims to solve the problem of inaccurate alignment result caused by small scale of an instance library made by the user generally by using a solution of combining the general corpus and the instance library, the accuracy of an alignment result may be improved, and specifically, a large-scale corpus namely the general corpus may be combined with the instance library made by the user to execute a word alignment process, thereby generating a high-quality alignment result.
Preferably, the retrieving component 50 in the above-mentioned embodiment of the application may include: a first sub-acquisition component, configured to acquire a vector value of the translation content; a second sub-acquisition component, configured to acquire source language vector values of all translation instances in the bilingual translation instance library; a processing component, configured to perform similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library, so as to generate a plurality of similarity values of the translation content; and a selection component, configured to select N
translation is instances corresponding to the translation content according to the similarity values, N being a natural number.
Preferably, the processing component in the above-mentioned embodiment of the application may include: a similarity calculation component, configured to calculate a similarity value P of the translation content by means of the following ex P. F
Co.sSint(ex - * 111 ex formula: P= , wherein ¨ is a source language vector value of a translation instance, F is a vector value of a translation ex F
content, - is an inner product of a source language vector value of a translation instance and a vector value of a translation content, and ilex , I
is a norm of a source language vector value of a translation instance and a vector value of a translation content.
Preferably, the selection component in the above-mentioned embodiment of the application may include: a sorting component, configured to sort multiple similarity values of the translation content according to the value size; and an extraction component, configured to extract corresponding translation instances according to the sorted similarity values to acquire N translation instances.
Preferably, the incremental alignment processing component 70 in the =

above-mentioned embodiment of the application may include: a setting component, configured to set the on-line translation result as an original translation framework; a third sub-acquisition component, configured to perform incremental alignment on the target language of the translation instance and the original translation framework in sequence, so as to obtain an alignment result; and a generation component, configured to connect all words of translations in the on-line translation result and N
translation instances according to the alignment result to form the confusion network, namely update corresponding word information in the original translation framework according to the alignment result, so as to obtain a confusion network.
Preferably, the translation generation component 90 in the above-mentioned embodiment of the application may include: a decoding component, configured to decode the confusion network according to sentence characteristics so as to generate at least one decoding result; and a confidence component, configured to estimate, when word confidence characteristics are sentence characteristics, namely when the word confidence characteristics of the confusion network are calculated, the confidence of the confusion network to obtain a confidence estimation result.
Wherein, the confidence component may include: a calculation component, configured to calculate the confidence estimation result via the following formula:

( '(4') p(Ir ) = __ I 1 e , wherein n is a total number of translation instances; when i=0, 7', is an on-line translation result of a general translation engine, and when i I, /7'. is an ith translation instance; C, is a source language similarity value of an ith translation instance; As is a 0-1 characteristic function; and e is a radix of a natural logarithm, and c is a count value of a counter.
A specific instance of the above-mentioned estimation algorithm is as follow.
An initial value of c corresponding to each word w is 0. If a word w comes from a translation result of a general translation engine, c keeps unchanged. If a word w comes from an instance and there is not an alignment result about w in the instance, c keeps unchanged. According to a bilingual alignment result of the instance, if a source language word aligned with w occurs in a translation task, c is added with 1.
According to the bilingual alignment result of the instance, if the source language word aligned with w does not occur in the translation task, 1 is subtracted from c.
According to a language direction of the translation task, the user may provide a group of standard bilingual sentence pairs translated from a source language and a target language to serve as development sets of translation tasks. The data is used for a system to optimize the characteristic weight of an individualized translation model. If the data cannot be provided, a defaulted weight is adopted.
All functional modules provided in the embodiments of the application may operate in a mobile terminal, a computer terminal or a similar operation device, and may be stored by serving as a part of a storage medium.
Accordingly, the embodiments of the disclosure may provide a computer terminal.
The computer terminal may be any one computer terminal device in a computer terminal group. Alternatively, in the present embodiment, the above-mentioned computer terminal may be replaced with a terminal device such as a mobile terminal.
Alternatively, in the present embodiment, the above-mentioned computer is terminal may be located in at least one network device in multiple network devices of a computer network.
In the present embodiment, the above-mentioned computer terminal may execute program codes of the following steps in a general machine translation engine-oriented individualized translation method: obtaining a translation content input by a user; obtaining an on-line translation result of the translation content;
performing, based on the translation content, similarity retrieval in a bilingual translation instance library, so as to acquire at least one translation instance;
performing incremental alignment on the translation instance by using the on-line translation result, so as to generate a confusion network; and decoding the confusion network, so as to acquire a translation candidate result.
Alternatively, the computer terminal may include: one or more processors, memories, and transmission apparatuses.
Alternatively, the processor of the above-mentioned computer terminal may execute program codes of the following steps: before obtaining the translation content input by the user, obtaining the bilingual translation instance library, the bilingual translation instance library including multiple groups of sentence pairs; and automatically aligning sentence pairs not subjected to word alignment in the bilingual translation instance library, so as to acquire a word-aligned bilingual sentence pair, wherein the bilingual sentence pair includes: a source language and a target language corresponding to the source language.
Alternatively, the processor of the above-mentioned computer terminal may execute program codes of the following steps: obtaining a vector value of the translation content; obtaining source language vector values of all translation instances in the bilingual translation instance library; performing similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library, so as to generate a plurality of similarity values of the translation content; and selecting N
translation instances corresponding to the translation content according to the lo similarity values, N being a natural number.
Alternatively, the processor of the above-mentioned computer terminal may execute program codes of the following steps: calculating a similarity value P
of the translation content by means of the following formula:
ex _1'; = F
CosSint(ex ,E) - __________ P= IICY04111 ex i-, wherein ¨ is a source language vector value of a translation instance, F is a vector value of a translation content, ex 1.µ = r ¨ is an inner product of a source language vector value of a translation instance and a vector value of a translation content, and i'.11 =
is a norm of a source language vector value of a translation instance and a vector value of a translation content.
Alternatively, the processor of the above-mentioned computer terminal may execute program codes of the following steps: sorting multiple similarity values of the translation content according to the value size; and extracting corresponding translation instances according to the sorted similarity values to acquire N
translation instances.
Alternatively, the processor of the above-mentioned computer terminal may execute program codes of the following steps: setting the on-line translation result as an original translation framework; performing incremental alignment on the target language of the translation instance and the original translation framework in sequence, so as to obtain an alignment result; and connecting all words of translations in the on-line translation result and N translation instances according to the alignment result to form the confusion network.

Alternatively, the processor of the above-mentioned computer terminal may execute program codes of the following steps: decoding the confusion network according to sentence characteristics so as to generate at least one decoding result, wherein when the word confidence characteristics of the confusion network are calculated, the confidence of the confusion network is estimated to obtain a confidence estimation result. The above-mentioned step includes: calculating the r( ¨L p(wIE
confidence estimation result via the following formula:
ir 1E, ) 1 + e , wherein n is a total number of translation instances; when i=0, E is an on-line translation result of a general translation engine, and when to E is an ith translation instance; is a source language similarity value of an ith translation instance; A, is a 0-1 characteristic function; and e is a radix of a natural logarithm, and c is a count value of a counter.
Wherein, the memory may be configured to store a software program and a module, for instance, store a program instruction/module corresponding to the general machine translation engine-oriented individualized translation method and device in the embodiments of the disclosure, and the processor executes various function applications and data processing by operating the software program and the module stored in the memory. That is, the above-mentioned general machine translation engine-oriented individualized translation method is implemented.
The memory may include a high-speed random memory, and may further include a nonvolatile memory such as one or more magnetic storage apparatuses, flash memories, or other nonvolatile solid memories. In some instances, the memory may further include memories remotely disposed relative to the processor. These remote memories may be connected to a terminal via a network. The instances for the network include, but are not limited to, internet, intranet, a local area network, a mobile communication network, and a combination.
The above-mentioned transmission device is configured to receive or send data via a network. Specific instances for the network may include a wired network or a wireless network. In an instance, the transmission device includes a Network Interface Controller (NIC), which may be connected to other network devices and routers via network cables so as to be capable of communicating with the internet or local area network. In an instance, the transmission device is a Radio Frequency (RF) module, configured to communicate with the Internet in a wireless mode.
Wherein, specifically, the memory is configured to store a preset action condition, information about a preset authority user, and an application program.
The processor may call the information and application program stored in the memory via the transmission device, so as to execute program codes of the method steps of various alternative or preferred embodiments in the above-mentioned method embodiment.
Those of ordinary skill in the art may understand that the computer terminal may also be a terminal device such as a smart phone (such as an Android phone, and an iOS phone), a tablet computer, a palmtop, Mobile Internet Devices (MID), and a PAD.
Those of ordinary skill in the art may understand that all or some steps in various methods of the above-mentioned embodiment may be completed by instructing relevant hardware of a terminal device via a program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a flash disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disc.
The embodiments of the disclosure also provide a storage medium.
Alternatively, in the present embodiment, the above-mentioned storage medium may be configured to store program codes executed by the general machine translation engine-oriented individualized translation method provided in the above-mentioned method embodiment and device embodiment.
Alternatively, in the present embodiment, the above-mentioned storage medium may be located in any one computer terminal device in a computer terminal group in a computer network, or located in any one mobile terminal in a mobile terminal group.
Alternatively, in the present embodiment, the storage medium is configured to execute program codes of the following steps: obtaining a translation content input by a user; obtaining an on-line translation result of the translation content;
performing, based on the translation content, similarity retrieval in a bilingual translation instance library, so as to acquire at least one translation instance; performing incremental alignment on the translation instance by using the on-line translation result, so as to generate a confusion network; and decoding the confusion network, so as to acquire = =

a translation candidate result.
Alternatively, in the present embodiment, the storage medium may be further configured to store program codes for executing various preferred or alternative method steps provided by the general machine translation engine-oriented individualized translation method.
Alternatively, the storage medium is further configured to execute program codes of the following steps: before obtaining the translation content input by the user, obtaining the bilingual translation instance library, the bilingual translation instance library including multiple groups of sentence pairs; and automatically aligning to sentence pairs not subjected to word alignment in the bilingual translation instance library, so as to acquire a word-aligned bilingual sentence pair, wherein the bilingual sentence pair includes: a source language and a target language corresponding to the source language.
Alternatively, the storage medium is further configured to execute program codes of the following steps: obtaining a vector value of the translation content;
obtaining source language vector values of all translation instances in the bilingual translation instance library; performing similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library, so as to generate a plurality of similarity values of the translation content; and selecting N translation instances corresponding to the translation content according to the similarity values, N
being a natural number.
Alternatively, the storage medium is further configured to execute program codes of the following steps: calculating a similarity value P of the translation content by ex I:
CosSinr(ex r) ' Ifer- 11*
11111, wherein means of the following formula:
ex is a source language vector value of a translation instance, F is a vector , =
value of a translation content, t ¨ is an inner product of a source language vector value of a translation instance and a vector value of a translation content, and _ is a norm of a source language vector value of a translation instance and a vector value of a translation content.

Alternatively, the storage medium is further configured to execute program codes of the following steps: sorting multiple similarity values of the translation content according to the value size; and extracting corresponding translation instances according to the sorted similarity values to acquire N translation instances.
Alternatively, the storage medium is further configured to execute program codes of the following steps: setting the on-line translation result as an original translation framework; performing incremental alignment on the target language of the translation instance and the original translation framework in sequence, so as to obtain an alignment result; and connecting all words of translations in the on-line translation result and N translation instances according to the alignment result to form the confusion network.
Alternatively, the storage medium is further configured to execute program codes of the following steps: The step of decoding the confusion network so as to acquire a translation candidate result includes: decoding the confusion network according to sentence characteristics so as to generate at least one decoding result, wherein when the word confidence characteristics of the confusion network are calculated, the confidence of the confusion network is estimated to obtain a confidence estimation result, the above-mentioned step including: calculating the confidence COO ¨EAcip(wIE,) estimation result via the following formula:

p(sr +
, wherein n is a total number of translation instances; when i=0, Lr is an on-line translation result of a general translation engine, and when i I, is an ith translation instance; r is a source language similarity value of an ith translation instance; A, is a 0-1 characteristic function; and e is a radix of a natural logarithm, and c is a count value of a counter.
From the foregoing description, it may be seen that the disclosure achieves the following technical effects. The problem of inaccurate translation result caused by imperfect content of a translation database used in a process of completing individualized translation via machine translation in the related art is solved, thereby improving the accuracy of a translation result, providing a high-quality translation result and user experience for the user, avoiding pre-collection of user data, and =
protecting the privacy of a translation user. Specifically, the following aspects may be included.
(1) The method is independent of a general machine translation engine, and may post-process a result of any machine translation engine to generate an individualized machine translation result.
(2) The method may achieve that data of a user is only effective at a client without being uploaded to a server, thereby protecting the privacy of the translation user.
(3) In the method, time-consuming large-scale statistical learning training is not to needed, so the user can quickly obtain an individualized translation result.
From the foregoing description of the implementation mode, those skilled in the art may clearly know that the application may be implemented by means of software and a necessary general hardware platform. Based on this understanding, the technical solutions of the application parts contributing to the conventional art may be substantially embodied in the software product form, and a computer software product may be stored in a storage medium such as an ROM/RAM, a magnetic disk and an optical disc, including a plurality of instructions enabling a computer device, which may be a personal computer, a server or a network device, to execute the method according to all or some of the embodiments of the application.
All embodiments in the description are described progressively, identical and similar parts between all the embodiments may refer to each other, and parts emphasized in each embodiment are different from other embodiments.
Particularly, a system embodiment basically similar to a method embodiment is simpler in description, and correlated parts may refer to some illustrations in the method embodiment.
The application may be used in multiple general or dedicated calculation system environments or configurations such as a personal computer, a server computer, a handheld device or portable device, a tablet device, a multi-processor system, a microprocessor-based system, a set top box, a programmable consumer electronic device, a network PC, a small computer, a large computer, and a distributed calculation environment including any of the above systems or devices.
Obviously, those skilled in the art shall understand that all of the above-mentioned modules or steps in the disclosure may be implemented by using a general calculation apparatus, may be centralized on a single calculation apparatus or may be distributed on a network composed of a plurality of calculation apparatuses.
Alternatively, they may be implemented by using executable program codes of the calculation apparatuses. Thus, they may be stored in a storage apparatus and executed by the calculation apparatuses, or they are manufactured into each integrated circuit module respectively, or multiple modules or steps therein are manufactured into a single integrated circuit module. Thus, the disclosure is not limited to a combination of any specific hardware and software.
The above is only the preferred embodiments of the disclosure, and not intended to limit the disclosure. There may be various modifications and variations in the io disclosure for those skilled in the art. Any modifications, equivalent replacements, improvements and the like made within the spirit and principle of the disclosure shall fall within the scope of protection of the disclosure.

Claims

What is claimed is:

1. A method for general machine translation engine-oriented individualized translation, comprising:
obtaining a translation content input by a user;
obtaining an on-line translation result of the translation content;
retrieving, at least one translation instance in a bilingual translation instance library by similarity, based on the translation content;
generating, when performing incremental alignment on the translation instance by using the on-line translation result, a confusion network; and decoding the confusion network, so as to acquire a translation candidate result.

2. The method according to claim 1, wherein before obtaining the translation content input by the user, the method further comprises:
obtaining the bilingual translation instance library, wherein the bilingual translation instance library comprising multiple groups of sentence pairs; and automatically aligning sentence pairs that not subjected to word alignment in the bilingual translation instance library, so as to acquire a word-aligned bilingual sentence pair, wherein the bilingual sentence pair comprising: a source language and a target language corresponding to the source language.

3. The method according to claim 2, wherein the step of retrieving, at least one translation instance in a bilingual translation instance library by similarity, based on the translation content comprises:
obtaining a vector value of the translation content;
obtaining source language vector values of all translation instances in the bilingual translation instance library;
performing similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library, so as to generate a plurality of similarity values of the translation content; and selecting N translation instances corresponding to the translation content according to the similarity values, N being a natural number.

4. The method according to claim 3, wherein the step of performing similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library so as to generate a plurality of similarity values of the translation content, comprises:
calculating a similarity value P of the translation content by means of the following formula:
P= , wherein ex-1 is a source language vector value of the translation instance, F is a vector value of the translation content, ex_ F, F is an inner product of the source language vector value of the translation instance and the vector value of the translation content, and is a norm of the source language vector value of the translation instance and the vector value of the translation content.

5. The method according to claim 4, wherein the step of selecting N
translation instances corresponding to the translation content according to the similarity values comprises:
sorting multiple similarity values of the translation content according to the value size; and extracting corresponding translation instances according to the sorted similarity values to acquire N translation instances.

6. The method according to any one of claims 1 to 5, wherein the step of generating, when performing incremental alignment on the translation instance by using the on-line translation result, a confusion network comprises:
setting the on-line translation result as an original translation framework;
obtaining an alignment result by performing incremental alignment on the target language of the translation instance and the original translation framework in sequence; and connecting all words of translations in the on-line translation result and N
translation instances according to the alignment result to form the confusion network.

7. The method according to claim 6, wherein the step of decoding the confusion network so as to acquire a translation candidate result comprises:
decoding the confusion network according to sentence characteristics so as to generate at least one decoding result;
wherein when the word confidence characteristics of the confusion network are calculated, the confidence of the confusion network is estimated to obtain a confidence estimation result, the step comprising:
calculating the confidence estimation result via the following formula:
wherein n is a total number of translation instances; when i=0, F is an on-line translation result of a general translation engine, and when i >=1, E is an i th translation instance; C
t is a source language similarity value of an i th translation instance; .lambda. i is a 0-1 characteristic function; and e is a radix of a natural logarithm, and c is a count value of a counter.

8. A device for general machine translation engine-oriented individualized translation, comprising:
a first obtaining component, configured to acquire a translation content input by a user;
a second obtaining component, configured to acquire an on-line translation result of the translation content;
a retrieving component, configured to perform, based on the translation content, similarity retrieval in a bilingual translation instance library, so as to acquire at least one translation instance;
an incremental alignment processing component, configured to perform incremental alignment on the translation instance by using the on-line translation result, so as to generate a confusion network; and a translation generation component, configured to decode the confusion network, so as to acquire a translation candidate result.

9. The device according to claim 8, further comprising:
a third obtaining component, configured to acquire the bilingual translation instance library, the bilingual translation instance library comprising multiple groups of sentence pairs; and an automatic alignment component, configured to automatically align sentence pairs not subjected to word alignment in the bilingual translation instance library, so as to acquire a word-aligned bilingual sentence pair, wherein the bilingual sentence pair comprising: a source language and a target language corresponding to the source language.

10. The device according to claim 9, wherein the retrieving component comprises:
a first sub-acquisition component, configured to acquire a vector value of the translation content;
a second sub-acquisition component, configured to acquire source language vector values of all translation instances in the bilingual translation instance library;
a processing component, configured to perform similarity calculation according to a vector value of the translation content and the source language vector values of all the translation instances in the bilingual translation instance library, so as to generate a plurality of similarity values of the translation content; and a selection component, configured to select N translation instances corresponding to the translation content according to the similarity values, N
being a natural number.

11. The device according to claim 10, wherein the processing component comprises:
a similarity calculation component, configured to calculate a similarity value P of the translation content by means of the following formula:
P= wherein ex_ F, is a source language vector value of the translation instance, F is a vector value of the translation content, ex_ F i, F is an inner product of the source language vector value of the translation instance and a vector value of the translation content, and is a norm of the source language vector value of the translation instance and a vector value of the translation content.

12. The device according to claim 11, wherein the selection component comprises:
a sorting component, configured to sort multiple similarity values of the translation content according to the value size; and an extraction component, configured to extract corresponding translation instances according to the sorted similarity values to acquire N translation instances.

13. The device according to any one of claims 8 to 12, wherein the incremental alignment processing component comprises:

a setting component, configured to set the on-line translation result as an original translation framework;
a third sub-acquisition component, configured to perform incremental alignment on the target language of the translation instance and the original translation framework in sequence, so as to obtain an alignment result; and a generation component, configured to connect all words of translations in the on-line translation result and N translation instances according to the alignment result to form the confusion network.

14. The device according to claim 13, wherein the translation generation component comprises:
a decoding component, configured to decode the confusion network according to sentence characteristics so as to generate at least one decoding result; and a confidence component, configured to estimate, when the word confidence characteristics of the confusion network are calculated, the confidence of the confusion network to obtain a confidence estimation result;
wherein the confidence component comprises:
a calculation component, configured to calculate the confidence estimation result via the following formula:
, wherein n is a total number of translation instances; when i=0, E i is an on-line translation result of a general translation engine, and when i >= I, E f is an i th translation instance; C i is a source language similarity value of an i th translation instance; .lambda. i is a 0-1 characteristic function; and e is a radix of a natural logarithm, and c is a count value of a counter.

15. A computer terminal, configured to execute program codes of the steps provided by the general machine translation engine-oriented individualized translation method according to any one of claims 1 to 7.

16. A storage medium, configured to store program codes executed by the general machine translation engine-oriented individualized translation method according to any one of claims 1 to 7.