CN111611814B - Neural machine translation method based on similarity perception - Google Patents

Neural machine translation method based on similarity perception Download PDF

Info

Publication number
CN111611814B
CN111611814B CN202010384024.7A CN202010384024A CN111611814B CN 111611814 B CN111611814 B CN 111611814B CN 202010384024 A CN202010384024 A CN 202010384024A CN 111611814 B CN111611814 B CN 111611814B
Authority
CN
China
Prior art keywords
similarity
template
translation
sentences
potential
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010384024.7A
Other languages
Chinese (zh)
Other versions
CN111611814A (en
Inventor
冯冲
张天夫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202010384024.7A priority Critical patent/CN111611814B/en
Publication of CN111611814A publication Critical patent/CN111611814A/en
Application granted granted Critical
Publication of CN111611814B publication Critical patent/CN111611814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a neural machine translation method based on similarity perception, and belongs to the technical field of natural language processing machine translation. Firstly, a structure translation memory base and a corresponding structure similarity algorithm are constructed. And then constructing a template translation memory base and a corresponding template similarity algorithm. And then, pre-identifying the high-potential sentences of the character strings, the structures and the template dimensions in the test set. And then constructing multi-dimensional similarity priori knowledge, and performing multi-dimensional similarity retrieval on all parallel sentences in the training set. Meanwhile, discrete similarity priori knowledge is blended into the neural machine translation objective function by utilizing a posterior regularization objective function, parameters of the priori knowledge are continuously updated in an iterative mode, and a training process is guided. And finally, respectively translating the multi-dimensional high-potential sentences to be translated by using the trained neural translation model. The invention can acquire the most similar sentences at a finer granularity, and reduces the review time of a human translator.

Description

Neural machine translation method based on similarity perception
Technical Field
The invention relates to a technology for modeling multidimensional similarity of parallel corpora in neural machine translation and identifying high-potential sentences in a test set and optimizing corresponding translation performance of the sentences, in particular to a neural machine translation system and method based on similarity perception, and belongs to the technical field of natural language processing machine translation.
Background
At present, neural machine translation is better and better than traditional statistical machine translation in various natural languages, and is widely adopted in multi-scenario computer-aided translation tasks. However, most existing neuro-mechanical translation methods focus on improving overall translation performance, and little attention has been paid to human translator workload.
In a computer-aided translation scenario, a human translator receives a translation generated by a machine translation model, first checks whether the translation has errors and performs necessary corrections, and then performs post-translation editing on the translation errors to ensure the final translation quality. Measuring review and post-compilation time is the most straightforward and effective way to quantify the workload of a human translator. When using conventional neural-machine translation methods, the human translator does not know the quality of the translation, which means that the human translator must expend an equal amount of work reviewing each translation. In this case, only how to improve the translation performance of the entire test set is studied, and only the post-translation editing time can be reduced.
The concept of high-potential sentences is worthwhile exploring how to make computer-aided translation save human translator workload. The high-potential sentences refer to source-end sentences with the potential of generating high-confidence translation close to standard translation, so that translation corresponding to the high-potential sentences does not need a human translator to spend a large amount of review and post-translation editing time. However, the relevant research on high potential statements in neural machine translation is very limited. At present, a dynamic neural machine translation model is used for processing a test set in a fine-grained manner by calculating the similarity of character strings with a training set, but the concept of a high-potential sentence is not defined, and translation optimization training is performed on the part of the high-potential sentence.
In addition, whether the definition of the high-potential sentence concept can exceed the character string level or not can identify a source-end sentence with higher potential according to the syntactic dimension similarity and the semantic dimension similarity of the training set sentences; whether the machine translation system can carry out targeted optimization on the high-potential sentences or not is ensured, and output of partial high-quality translations is ensured; this targeted optimization is an unresolved problem, whether it is more friendly to human translators of the actual translation flow, saving overall post-translation review editing time, and so on.
In summary, the recognition and translation performance optimization of high-potential sentences still is one of the problems to be solved urgently for machine translation. However, a machine translation system or related technology disclosure with more complete consideration for high-potential sentence concepts is not seen at present.
Disclosure of Invention
The invention aims to solve the technical problems that a human translator faces redundant review and large editing workload after translation due to the fact that high-potential sentence recognition and optimization cannot be carried out on a sentence to be translated in practical application in the existing machine translation system, and provides a neural machine translation method based on similarity perception.
The invention firstly constructs a structure translation memory base and a corresponding structure similarity algorithm. And then constructing a template translation memory base and a corresponding template similarity algorithm. And then, identifying high-potential sentences of the dimensions of character strings, structures and templates in the test set in advance. And then, constructing multi-dimensional similarity priori knowledge, and performing multi-dimensional similarity retrieval on all parallel sentences in the training set. Meanwhile, a posterior regular target function is utilized, the discrete similarity priori knowledge is merged into a neural machine translation target function (namely a neural translation model), and parameters of the priori knowledge are continuously updated in an iterative manner to guide a training process. And finally, respectively translating the multi-dimensional high-potential sentences to be translated by using the trained neural translation model.
The invention reduces the review time of a human translator by pre-identifying high potential sentences of string, structure and template dimensions in a test set.
The technical scheme adopted by the invention is as follows:
a neural machine translation method based on similarity perception comprises the following steps:
step 1: and constructing a structure translation memory base and a corresponding structure similarity algorithm for retrieving the high-potential structural sentences.
Step 2: and constructing a template translation memory base and a corresponding template similarity algorithm for searching high-potential template sentences.
And step 3: and (3) pre-identifying the high-potential sentences of the dimensions of the character strings, the structures and the templates in the test set based on the identification strategy from coarse granularity to fine granularity of a multi-dimensional translation memory method.
And 4, step 4: and constructing multi-dimensional similarity priori knowledge, and performing multi-dimensional similarity retrieval on all parallel sentences in the training set. And integrating the multi-dimensional similarity priori knowledge into a neural translation model for guiding a training process.
And 5: and respectively translating the multi-dimensional high-potential sentences to be translated by utilizing the trained translation neural model.
Advantageous effects
Compared with the prior art, the invention has the following beneficial effects and advantages:
1. the invention uses the self-defined structure translation memory and template translation memory method for representing the text, structure and semantic similarity between sentences, can be used as the standard for retrieving sentences, and can obtain the most similar sentences with finer granularity by combining the traditional translation memory with the translation memory method defined in the invention.
2. The system is different from the existing machine translation system, focuses more on improving the translation effect of partial high-potential sentences instead of all sentences, can inform a human translator of high translation quality, requires less time for review, and reduces the time for editing after translation.
3. The recognition module in the system is also suitable for the preprocessing process of a general machine translation system on the test set, and can simply and conveniently realize the recognition of high-potential sentences, thereby reducing the review time of a human translator.
4. The invention firstly provides the definition of the workload of the human translator, namely the sum of the review time and the post-translation editing time, and a measuring standard which can better reflect the workload of the human translator, namely the Translation Error Rate (TER), is used in an experiment for verifying the effectiveness of the system.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of data samples for various translation memory methods used in the method of the present invention;
FIG. 3 is a graphical illustration of the effect of prior knowledge of similarity in different dimensions on human translator review time.
Detailed Description
The invention is further described in the following with reference to the accompanying figure 1 of the specification.
A neural machine translation method based on similarity perception comprises the following steps:
step 1: and constructing a structure translation memory base and a corresponding structure similarity algorithm for retrieving the high-potential structural sentences.
Specifically, first, a component syntax tree analysis method is used to extract a parallel syntax tree pair on a parallel sentence pair of a corpus set. And then, constructing a structure translation memory library by using the parallel syntax tree pair, and designing a corresponding structure similarity algorithm to retrieve the high-potential structure sentences and calculate the structure similarity.
As shown in fig. 1, the structure translation memory includes two data composition methods:
(1) and in the recognition module, a structure translation memory library is formed by all the source language syntax trees of the training set.
(2) In the translation module, according to a cross validation method, K structural translation memory libraries are formed by partial training set source end and target end language parallel syntax tree pairs. And if K is 10 in the cross validation, dividing the training set into 10 sub-training sets, and constructing each structure translation memory bank by 9 sub-sets to obtain 10 structure translation memory banks in total.
Wherein, the construction of the structure translation memory library by using the parallel syntax tree pair comprises the following steps:
firstly, a component tree syntactic analysis tool is utilized to carry out component tree syntactic analysis on parallel sentences in a training set, and the extracted parallel syntactic trees are serialized.
Then, all leaf nodes (i.e., terminals) of the serialized parallel syntax tree are deleted, resulting in a lexical-information-free serialized syntax tree. Since the syntax structures in the same field are overlapped, about 90% of the number of the duplication-removed sequential syntax trees is obtained through syntax tree duplication removal.
Then, an open source indexing tool Lucene and the like can be utilized to build an index for the de-duplicated syntactic tree, and the matching return quantity value is set to be 10, so that a structure translation memory engine which can be retrieved according to the similarity of the character strings is obtained. The engine is capable of providing a first-stage structure retrieval service.
And finally, using a tree structure similarity algorithm based on a zhang-shashashashaha tree edit distance algorithm as a second-stage structure retrieval service to find the syntax tree most similar to the current retrieved syntax tree and the structure similarity information between the syntax tree and the current retrieved syntax tree. The tree structure similarity is shown as follows:
Figure BDA0002483246180000041
wherein, X represents the syntax tree of the current retrieval statement, and X' represents the syntax tree returned by the translation memory engine in the first stage.
And 2, step: and constructing a template translation memory base and a corresponding template similarity algorithm for searching high-potential template sentences.
Specifically, first, semantic information is retained and a parallel template syntax tree pair is obtained on the parallel component syntax tree pair by using a template rule method. And then, constructing a template translation memory library by using the parallel template syntax tree pairs, and searching high-potential template sentences and calculating the template similarity by using a corresponding template similarity algorithm.
As shown in FIG. 1, the template translation memory has two data composition modes:
(1) and the recognition module is a template translation memory library formed by all the source language syntax trees of the training set.
(2) And the translation module is provided with K template translation memory banks which are formed by a part of training set source end and target end language parallel template syntax tree pairs according to a cross validation method. And if K is 10 in the cross validation, dividing the training set into 10 sub-training sets, and constructing each template translation memory library by 9 sub-sets to obtain 10 template translation memory libraries in total.
The method for constructing the template translation memory library by using the parallel syntax tree pair comprises the following steps of:
firstly, a syntactic analysis tool is used for carrying out component tree syntactic analysis on parallel sentences in a training set, and the extracted parallel syntactic trees are serialized.
And then, deleting partial leaf nodes, namely terminal characters, of the serialized parallel syntax tree pairs according to the template rule to obtain the serialized parallel syntax tree pairs containing key semantic information words.
Because the lack of important semantic similarity information results in inaccurate similarity representation, the present invention combines traditional translation memory and structural translation memory to characterize important semantic information by preserving certain leaf nodes (words) of part-of-speech tags in the syntax tree. Specifically, the template rules may be as shown in table 1:
TABLE 1
Figure BDA0002483246180000051
Meanwhile, the template translation memory library filters out words with large word exchange amount and limited semantic information corresponding to other part-of-speech tags, such as 'NN' (common noun), 'VV' (verb), 'JJ' (adjective or number, serial number) and the like, in a translation task similar to the concept of the translation template, a word in a sentence is replaced by a symbol '\$' to form a generalization template, and the template translation memory library reserves leaf nodes in a component syntax tree and deletes other leaf nodes at the same time to generate a template tree comprising important semantic information. Because the syntax structures in the same field are overlapped, the invention obtains about 90% of the quantity of the duplication-removed serialized parallel template syntax trees through the duplication removal of the template syntax trees.
And then, establishing an index for the de-duplicated and serialized parallel template syntax tree pair by using an open-source index tool Lucene, setting the matching return quantity value to be 10, and obtaining a structure translation memory engine capable of searching according to the similarity of the character strings, wherein the engine provides a first-stage structure searching service.
And finally, using a template structure similarity algorithm based on a zhang-shashashashaha tree edit distance algorithm as a second-stage template retrieval service. Compared with the structure translation memory, the template translation memory can calculate the similarity of terminal words of the reserved part-of-speech tags besides the syntactic tree similarity, and the similarity is calculated as a child node of the part-of-speech tags. And finding the template syntax tree which is most similar to the current retrieval template syntax tree and the structural similarity information between the template syntax tree and the current retrieval template syntax tree.
Finding the most similar template syntax tree with the current retrieval template syntax tree and the template similarity information between the template syntax tree and the current retrieval template syntax tree, wherein the template structure similarity is shown as the following formula:
Figure BDA0002483246180000061
wherein, Y represents the template syntax tree of the current retrieval statement, and Y' represents the template syntax tree returned by the translation memory engine in the first stage retrieval.
And step 3: and (3) pre-identifying the high-potential sentences of the dimensions of the character strings, the structures and the templates in the test set based on the identification strategy from coarse granularity to fine granularity of a multi-dimensional translation memory method.
And combining the structure translation memory base constructed in the step 1 and a corresponding structure similarity algorithm, and the template translation memory base constructed in the step 2 and a corresponding template similarity algorithm with a traditional translation memory method to generate an identification module based on multi-dimensional similarity perception, wherein the identification module is used for identifying the multi-dimensional high-potential sentences to be translated in the test set.
On the basis of the traditional machine translation processing flow, the invention provides an identification module for preprocessing a test set, which is used for identifying a multi-dimensional high-potential sentence to be translated for a human translator, so that the review time of the human translator is reduced. In contrast to the translation module, the recognition module is nonparametric and is applicable to any machine translation model. Considering that only set-source-end sentences need to be tested during recognition, the translation memory database of the module is composed of training set-source-end sentences.
In order to identify high-potential sentences on the dimensions of texts, structures and templates, the invention provides three identification strategies from coarse granularity to fine granularity based on a multi-dimensional translation memory method:
coarse-grained identification strategy: a coarse-grained string decay coarse-grained identification Strategy (SACI) that identifies string high-potential statements of a test set by retrieving string similarities. In the present invention, the similarity gating threshold (sacigante) of SACI is 0.6.
For each source sentence si in the test set, the m most similar candidate sentences are obtained by searching the traditional translation memory base
Figure BDA0002483246180000071
M represents the number of candidate sentences retrieved. Then, the highest string similarity score is calculated using an edit distance algorithm. Finally, 9 sets of S are obtained based on the string similarity score (e.g., FMS ∈ [0.7, 0.8) ] 1 ,S 2 ,...,S 9 Will { S } 6 ,S 7 ,S 8 ,S 9 Define (when the threshold value of similarity gating is 0.6) as a string high potential statement.
Fine-grained identification strategy: two fine-grained identification strategies are involved.
The first is the structure fine granularity identification Strategy (SAFI). The SAFI strategy can assemble S in each character string high-potential statement i (S i ∈{S 6 ,S 7 ,S 8 ,S 9 Identify high potential structural statements. The strategy retrieves the structure translation memory to obtain an initial syntax tree set
Figure BDA0002483246180000072
The highest structural similarity score is calculated using a tree similarity function. The SAFI strategy similarity gating (safigite) threshold is set to 0.9 to determine if the current sentence has a structurally high translation potential. Finally obtaining a structural high potential set { T } 6 ,T 7 ,T 8 ,T 9 }。
The other is a template fine-grained identification strategy (TAFI). The TAFI strategy is able to identify high potential S per string group of high potential sentences i (S i ∈{S 6 ,S 7 ,S 8 ,S 9 The similarity gating (taf gate) threshold of TAFI policy is set to 0.9. Finally obtaining a template high potential set { T } 6 ,T 7 ,T 8 ,T 9 }。
Three examples of translation memory data retrieved by the recognition strategy are shown in FIG. 2.
And 4, step 4: and constructing multi-dimensional similarity priori knowledge, and performing multi-dimensional similarity retrieval on all parallel sentences in the training set. And integrating the multi-dimensional similarity priori knowledge into a neural translation model for guiding a training process.
The translation performance of the high-potential sentences is improved by utilizing the similarity information among the sentences, so that the post-translation editing time of a human translator can be reduced.
Firstly, according to a multidimensional translation memory method, multidimensional similarity calculation is carried out on parallel sentences in a training set.
Specifically, multidimensional similarity priori knowledge carries out multidimensional similarity retrieval on all parallel sentences in a training set by utilizing a cross validation method; a posterior regularization target function, namely punishing a log-liked training target through K-L divergence between the log linear model coding of prior knowledge and the model posterior probability by utilizing a posterior regularization method; and (3) integrating the discrete similarity priori knowledge into a traditional neural machine translation objective function, and continuously updating the parameters of the priori knowledge in an iterative manner.
The translation module comprises a structure similarity characteristic, a template similarity characteristic and a character string similarity characteristic and is used for encoding into a log-linear model. The structure similarity characteristic and the template similarity characteristic are respectively used for searching the structure translation memory library and the template translation memory library. The character string similarity characteristic enables the translated text to be closer to a training set in the dimension of the character string, and is defined as:
φString-Simi(x,y)=(Score TM (x,x′),Score TM (y, y')) (3) where φ String-Simi (x, y) returns a two-dimensional tensor, Score TM (x,x′)、Score TM (y, y') respectively represent the similarity scores of the current training source sentence x and the corresponding sampled subspace candidate translation y. x 'and y' from the m most similar sets of pairs of parallel statements
Figure BDA0002483246180000081
Is the initial set returned by the translation memory search engine in the first retrieval step.
Figure BDA0002483246180000082
Figure BDA0002483246180000083
x∈x (n) ,y∈S(x (n) )
Wherein Dstring _ edge (x, x') represents the string edit distance between the source language sentence and the most similar sentence in the training set; dstring _ edge (y, y') represents the string edit distance of the target language sentence from the most similar sentence in the training set; n represents the number of training set statements.
The above equation shows the calculation of the string fuzzy match scores for x and y using a conventional edit distance algorithm, which corresponds to the second search step. In this way, the String-Simi feature encourages the NMT model to generate candidate translations y that are closer to the training set at the sentence level.
Three similarity characteristics are used for integrating character strings, syntax and important semantic similar information into a neural machine translation model. In particular, using syntactic and semantic similarity information to guide translation is more obvious for the promotion of high-potential sentences, because the high-potential sentences are already highly similar to the training set in the dimension of character strings, and therefore, the key factor preventing high-quality translation is wrong syntactic or semantic information.
And then, fusing multi-dimensional similarity priori knowledge into a neural translation model by using posterior regularization knowledge for guiding a training process.
Specifically, the multidimensional similarity features contained in the training set are encoded into a neural machine translation model of an attention-based coder-decoder architecture by using a posteriori regularization knowledge as a priori knowledge. The machine translation objective function for the a posteriori regularization knowledge improvement is shown as follows:
Figure BDA0002483246180000091
in the above-mentioned formula, the compound of formula,
Figure BDA0002483246180000092
most representativeAnd (3) a loss function of the large likelihood estimation, wherein theta represents a model parameter of the machine translation system. Lambda [ alpha ] 1 And λ 2 Respectively representing the weight proportion of the maximum likelihood estimation and the posterior regular loss function in the final loss function. Posterior regularization log-linear model coding with a priori knowledge
Figure BDA0002483246180000093
And model posterior probability P (y | x) (n) (ii) a θ) penalizes the log-likelihood training target. Wherein the prior knowledge
Figure BDA0002483246180000094
Calculated from the following formula:
Figure BDA0002483246180000095
firstly, an initialization translation module is trained through maximum likelihood estimation, and theta is updated to be in a convergence state so as to avoid the risk of posterior regular initialization. Then, a logarithm linear posterior regularization method is used for coding the similarity characteristics into a neural machine translation model based on attention, and a new parameter theta and a new parameter gamma of KL divergence are obtained.
And 5: and respectively translating the multi-dimensional high-potential sentences to be translated by utilizing the trained neural translation model.
Example verification
The present invention was tested on Chinese English (zh-en).
(1) Experimental data set-up
In order to compare with the research results of predecessors and approach the actual translation scenario, the present invention was experimented in the legal and news fields of Chinese-English translation. The present invention utilizes all available parallel libraries in the LDC04 dataset to train, validate and test the model. Most of the sentences in LDC04 are related, making it an ideal test platform for building translation memory datasets and evaluating models of the present invention. The invention uses a word segmentation tool Jieba to segment Chinese sentences, and uses scripts provided by Moses to segment English sentences. For the training set, the maximum sentence length is limited to 50.
In both areas, the present invention sets up test sets of different sizes to test the model. For the news domain, we randomly selected 1000 and 3000 sentences from the corpus as development sets and test sets, respectively, and the remaining data was used to create training data. Because the test set similarity distribution is unbalanced, the invention divides the test set into 9 groups of character string similarity sets according to the character string similarity scores, and randomly selects about the same number of sentences to create a new test set from each similarity set. For the legal domain, the present invention sets the size of the development and test set to 1, 000 and 1, 000. The above steps are then repeated. Specific data sizes are shown in the following table.
TABLE 2
Figure BDA0002483246180000101
The BLEU value is used as an evaluation index. In this work, to evaluate the primary objective, i.e., to reduce the degree of human translator workload, the present invention also introduces another machine evaluation index, the Translation Error Rate (TER). TER is defined as the minimum number of edit steps required to modify an output translation to a standard translation divided by the sentence length of the standard translation, as shown by the following equation:
Figure BDA0002483246180000102
(2) baseline system experimental setup:
RNNSEARCH: a standard attention-based neural machine translation system.
PR-NMT: a machine translation system that extends RNNSEARCH with a posterior regularization of a log-linear model.
For the PR-NMT baseline system, the present invention selects the two best performing features as its prior knowledge, Bilingual Dictionary (BD) and Length Ratio (LR). We keep the neural network and a priori knowledge parameters unchanged. To save computation time, experiments performed approximate estimates of the P and Q distributions in a posteriori regularization, and 10 rather than 80 candidate translations were chosen. The invention realizes the above-mentioned baseline system and the model proposed by the invention on the open-source machine translation tool THUMT.
(3) The main experimental results are as follows: table 3 shows the results of the experiment.
TABLE 3
Figure BDA0002483246180000111
Figure BDA0002483246180000121
First, for each domain, a significant improvement in BLEU and TER scores was observed from left to right, with significant differences in translation performance between the full test set and the string high potential sentence columns, and between the string high potential sentences and the structural high potential sentences or template high potential sentence columns. From the test results, the TER value, which reflects the post-compilation workload, drops sharply from left to right. For example, in the third last line of the news domain, the entire test set yielded a TER score of 53.29%, meaning that if a translation contained 50 words, a minimum of 26 words would need to be post-translated. Compared with the whole test set, the post-compilation steps of the character string high-potential sentences and the structural high-potential sentences are respectively reduced to 19 and 9. It can be observed that there are fewer translation errors in high-potential sentences, which will reduce the review time for manual translation. Furthermore, these results show that if the test set source statement is highly similar to the training set in syntactic or template dimensions, then the review time for manual translation will be minimal.
Secondly, the result shows that the proposed similarity characteristic of the translation module is helpful to reduce the editing time after translation, and the most effective characteristic is the character string similarity and the structure similarity. On the one hand, in the news domain, the method of this experiment using the string similarity feature performed much better than RNNSEARCH and PR-NMT (using the LR feature), with a drop in the TER values of-2.29 and-1.37 points, respectively, on the overall test set. On a multi-dimensional set of high-potential sentences, the gap expands to about-3 and-2 TER points, indicating that the proposed similarity feature is more efficient on the set of high-potential sentences. On the other hand, the method using the structural similarity feature in the present invention obtains the highest BLEU score. In almost all test sets in both domains, the model has 2 to 3 BLEU value advantages over the baseline system.
(4) Review time comparison analysis:
FIG. 3 illustrates the effect of a priori knowledge of different similarity features on human translator review time. On the left side of fig. 3, it can be observed that the structure and template high potential statements achieve almost the same high score BLEU value of 80%. Meanwhile, the BLEU score difference between the sentences with the high potential of the character strings is continuously expanded along with the reduction of the similarity interval, which indicates that the recognition module recognizes the text to be translated with the translated text approaching the standard translation, the review time of a human translator is almost not needed, and a large number of possible sentences are found out to improve the recall rate. On the other hand, as can be seen from the right side of fig. 3, the number of structure and template high potential sentences is only a small fraction of the string high potential sentences. Considering the two factors, the high potential sentences of the structure and the template exceed the high potential sentences of the character string.
(5) Post-translation edit time analysis:
this experiment analyzed the ability of the model of the present invention to reduce post-translational editing time. Assuming that the average time of the manual translator for editing operation after translation is 1 second, the TER formula can be used for obtaining the approximate editing time after translation of the whole test set and the multi-dimensional high-potential sentences respectively. As shown in Table 4, the human translator may save 4.12% of the time compared to RNNSEARCH (427.67 vs 446.03). In addition, the method of the invention achieves remarkable effect when processing high-potential sentences, especially when the high-potential sentences of the structure and the template are processed (13.92% and 10.88% of post-translation editing time is saved respectively). Focusing on time-based metrics is important to measure the post-editing effort, and these results directly demonstrate that the similarity features are important to reduce the post-editing time.
TABLE 4
Figure BDA0002483246180000131

Claims (7)

1. A neural machine translation method based on similarity perception is characterized by comprising the following steps:
step 1: constructing a structure translation memory base and a corresponding structure similarity algorithm for retrieving high-potential structure sentences;
step 2: constructing a template translation memory base and a corresponding template similarity algorithm for retrieving high-potential template sentences;
and step 3: pre-identifying high-potential sentences of the dimensions of character strings, structures and templates in a test set based on a coarse-grained to fine-grained identification strategy of a multi-dimensional translation memory method;
combining the structure translation memory base constructed in the step 1 and a corresponding structure similarity algorithm, and the template translation memory base constructed in the step 2 and a corresponding template similarity algorithm with a traditional translation memory method to form an identification module based on multi-dimensional similarity perception, wherein the identification module is used for identifying multi-dimensional high-potential sentences to be translated in a test set;
in order to identify high-potential sentences on the dimensions of texts, structures and templates, three identification strategies from coarse granularity to fine granularity based on a multi-dimensional translation memory method are adopted:
coarse-grained identification strategy: a coarse-grained character string attenuation coarse-grained identification strategy is used for identifying character string high-potential sentences of a test set by retrieving character string similarity, and the similarity gating threshold value is 0.6;
for each source sentence si in the test set, m most similar candidate sentences are obtained by searching the traditional translation memory base
Figure FDA0002483246170000011
M represents the number of retrieved candidate sentences; then, using the editingCalculating the highest character string similarity score by a distance algorithm; finally, 9 groups of S are obtained according to the character string similarity score 1 ,S 2 ,...,S 9 Will { S } 6 ,S 7 ,S 8 ,S 9 Defining the sentence as a character string high potential sentence;
the fine-grained identification strategy comprises two fine-grained identification strategies:
the first is a structure fine-grained identification strategy, which can be used to identify a high-potential statement set S in each string i (S i ∈{S 6 ,S 7 ,S 8 ,S 9 Identifying high potential structure sentences; the strategy retrieves the structure translation memory to obtain an initial syntax tree set
Figure FDA0002483246170000012
Calculating the highest structural similarity score by using a tree similarity function; the similarity gating threshold of the strategy is set to 0.9 to determine whether the current sentence has a structurally high translation potential; finally obtaining a structural high potential set { T } 6 ,T 7 ,T 8 ,T 9 };
Another is a template fine-grained recognition strategy that can recognize high-potential S per string group of high-potential sentences i (S i ∈{S 6 ,S 7 ,S 8 ,S 9 A similarity gating threshold of the strategy is set to 0.9; finally, a template high potential set { T } is obtained 6 ,T 7 ,T 8 ,T 9 };
And 4, step 4: constructing multi-dimensional similarity priori knowledge, and performing multi-dimensional similarity retrieval on all parallel sentences in a training set; the multi-dimensional similarity priori knowledge is integrated into a neural translation model and used for guiding a training process;
and 5: and respectively translating the multi-dimensional high-potential sentences to be translated by utilizing the trained translation neural model.
2. The neural machine translation method based on similarity perception according to claim 1, wherein the step 1 comprises the following steps:
firstly, extracting parallel syntactic tree pairs on parallel sentence pairs of a training corpus set by utilizing a component syntactic tree analysis method;
and then, constructing a structure translation memory library by using the parallel syntax tree pair, and designing a corresponding structure similarity algorithm to retrieve the high-potential structure sentence and calculate the structure similarity.
3. The neural machine translation method based on similarity perception according to claim 2, wherein the building of the structural translation memory base by using the parallel syntax tree pairs comprises the following steps:
firstly, carrying out component tree syntactic analysis on parallel sentences in a training set by utilizing a component tree syntactic analysis tool, and carrying out serialization processing on an extracted parallel syntactic tree;
then, deleting all leaf nodes of the serialized parallel syntax tree to obtain a serialized syntax tree without word information; obtaining a duplication-removed sequenced syntax tree with the quantity of about 90% through syntax tree duplication removal;
then, an index is built for the de-duplicated sequencing syntax tree, and the matching return quantity value is set to be 10, so that a structure translation memory engine capable of being searched according to the similarity of the character strings is obtained, and the engine can provide a first-stage structure search service;
and finally, using a tree structure similarity algorithm as a second-stage structure retrieval service to find the syntax tree most similar to the current retrieved syntax tree and the structure similarity information between the syntax tree and the current retrieved syntax tree, wherein the tree structure similarity is shown as the following formula:
Figure FDA0002483246170000021
wherein, X represents the syntax tree of the current retrieval statement, and X' represents the syntax tree returned by the translation memory engine in the first stage.
4. The neural machine translation method based on similarity perception according to claim 1, wherein the step 2 comprises the following steps:
firstly, reserving semantic information and obtaining a parallel template syntax tree pair on the parallel component syntax tree pair by utilizing a template rule method;
and then, constructing a template translation memory base by using the parallel template syntax tree pairs, and searching high-potential template sentences and calculating the template similarity by using a corresponding template similarity algorithm.
5. The neural machine translation method based on similarity perception according to claim 4, wherein the building of the template translation memory library by using the parallel syntax tree pairs comprises the following steps:
firstly, carrying out component tree syntactic analysis on parallel sentences in a training set by utilizing a syntactic analysis tool, and carrying out serialization processing on an extracted parallel syntactic tree;
then, according to a set template rule, deleting partial leaf nodes, namely terminal characters, of the serialized parallel syntax tree pairs to obtain serialized parallel syntax tree pairs containing key semantic information words;
then, an index is established for the de-duplicated serialized parallel template syntax tree pair, and the matching return quantity value is set to be 10, so that a structure translation memory engine capable of being retrieved according to the similarity of the character strings is obtained, and the engine provides a first-stage structure retrieval service;
finally, a template structure similarity algorithm is used as a second-stage template retrieval service; the template translation memory calculates the similarity of terminal words of reserved part-of-speech tags besides the syntactic tree similarity, and the similarity is calculated by taking the terminal words as sub-nodes of the part-of-speech tags, so that the most similar template syntactic tree to the current retrieval template syntactic tree and the structural similarity information between the syntactic tree and the current retrieval template syntactic tree are found; the template structural similarity is shown as follows:
Figure FDA0002483246170000031
wherein Y represents the template syntax tree of the currently retrieved sentence, and Y' represents the template syntax tree returned by the translation memory engine at the first stage.
6. The neural machine translation method based on similarity perception according to claim 4, wherein the method for performing multidimensional similarity calculation on parallel sentences in the training set according to the multidimensional translation memory method in step 4 is as follows:
multidimensional similarity priori knowledge is used for carrying out multidimensional similarity retrieval on all parallel sentences in a training set by using a cross validation method; a posterior regularization method is utilized, and a training target of log likelihood is punished through K-L divergence between the logarithm linear model coding of prior knowledge and the model posterior probability; the discrete similarity priori knowledge is merged into a traditional neural machine translation target function, and parameters of the priori knowledge are continuously updated in an iterative mode;
integrating character strings, syntax and important semantic similar information into a neural machine translation model through three similarity characteristics;
the three similarity characteristics comprise a structure similarity characteristic, a template similarity characteristic and a character string similarity characteristic and are used for being coded into a log-linear model; the structure similarity characteristic and the template similarity characteristic are respectively used for searching a structure translation memory library and a template translation memory library; the character string similarity characteristic enables the translated text to be closer to a training set in the dimension of the character string, and is defined as follows:
φString-Simi(x,y)=(Score TM (x,x′),Score TM (y,y′)) (3)
where φ String-Simi (x, y) returns a two-dimensional tensor, Score TM (x,x′)、Score TM (y, y') respectively representing the similarity scores of the current training source sentence x and the corresponding sampling subspace candidate translation y; x 'and y' from the m most similar sets of pairs of parallel statements
Figure FDA0002483246170000041
Returned by the translation memory search engine in the first retrieval stepAn initial set;
Figure FDA0002483246170000042
wherein Dstring _ edge (x, x') represents the string edit distance between the source language sentence and the most similar sentence in the training set; dstring _ edge (y, y') represents the string edit distance between the target language sentence and the most similar sentence in the training set; n represents the number of training set statements.
7. The neural machine translation method based on similarity perception according to claim 4, wherein step 4 incorporates multi-dimensional similarity prior knowledge into the neural translation model by using posterior regularization knowledge, and the method for guiding the training process is as follows:
encoding the multi-dimensional similarity characteristics contained in the training set into a neural machine translation model of an encoder/decoder framework based on an attention mechanism by using posterior regularization knowledge as prior knowledge; the machine translation objective function for the a posteriori regularization knowledge improvement is shown as follows:
Figure FDA0002483246170000051
in the above-mentioned formula, the compound of formula,
Figure FDA0002483246170000052
a loss function representing a maximum likelihood estimate, wherein θ represents a model parameter of the machine translation system; lambda 1 And λ 2 Respectively representing the weight proportion of the maximum likelihood estimation and the posterior regular loss function in the final loss function; posterior regularization log-linear model coding with a priori knowledge
Figure FDA0002483246170000053
And model posterior probability P (y | x) (n) (ii) a Theta) training of K-L divergence penalty log-likelihoodTraining a target; wherein the prior knowledge
Figure FDA0002483246170000054
Calculated from the following formula:
Figure FDA0002483246170000055
firstly, training an initialization translation module through maximum likelihood estimation, and updating theta into a convergence state so as to avoid the risk of posterior regular initialization;
then, by utilizing a log-linear posterior regularization method, similarity features are coded into the attention-based neural machine translation model, so as to obtain a new parameter theta and a KL divergence parameter gamma.
CN202010384024.7A 2020-05-08 2020-05-08 Neural machine translation method based on similarity perception Active CN111611814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010384024.7A CN111611814B (en) 2020-05-08 2020-05-08 Neural machine translation method based on similarity perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010384024.7A CN111611814B (en) 2020-05-08 2020-05-08 Neural machine translation method based on similarity perception

Publications (2)

Publication Number Publication Date
CN111611814A CN111611814A (en) 2020-09-01
CN111611814B true CN111611814B (en) 2022-09-23

Family

ID=72203430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010384024.7A Active CN111611814B (en) 2020-05-08 2020-05-08 Neural machine translation method based on similarity perception

Country Status (1)

Country Link
CN (1) CN111611814B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113283250B (en) * 2021-05-26 2024-06-21 南京大学 Automatic machine translation test method based on syntactic component analysis
CN113408307B (en) * 2021-07-14 2022-06-14 北京理工大学 Neural machine translation method based on translation template
CN113919371B (en) * 2021-09-06 2022-05-31 山东智慧译百信息技术有限公司 Matching method of translation corpus
CN114564933A (en) * 2022-01-12 2022-05-31 甲骨易(北京)语言科技股份有限公司 Personalized machine translation training method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329961A (en) * 2017-07-03 2017-11-07 西安市邦尼翻译有限公司 A kind of method of cloud translation memory library Fast incremental formula fuzzy matching
CN109299479A (en) * 2018-08-21 2019-02-01 苏州大学 Translation memory is incorporated to the method for neural machine translation by door control mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107329961A (en) * 2017-07-03 2017-11-07 西安市邦尼翻译有限公司 A kind of method of cloud translation memory library Fast incremental formula fuzzy matching
CN109299479A (en) * 2018-08-21 2019-02-01 苏州大学 Translation memory is incorporated to the method for neural machine translation by door control mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Understanding Subtitles by Character-Level Sequence-to-Sequence Learning";Haijun Zhang 等;《IEEE Transactions on Industrial Informatics》;20170430;全文 *
"多策略切分粒度的藏汉双向神经机器翻译研究";沙九 等;《厦门大学学报(自然科学版)》;20200323;第59卷(第2期);全文 *

Also Published As

Publication number Publication date
CN111611814A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN111611814B (en) Neural machine translation method based on similarity perception
CN110348016B (en) Text abstract generation method based on sentence correlation attention mechanism
Yu et al. Beyond Word Attention: Using Segment Attention in Neural Relation Extraction.
Ahmed et al. Improving tree-LSTM with tree attention
Zettlemoyer et al. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars
CN105068997B (en) The construction method and device of parallel corpora
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN112926337B (en) End-to-end aspect level emotion analysis method combined with reconstructed syntax information
CN113408307B (en) Neural machine translation method based on translation template
Zhang et al. Similarity-aware neural machine translation: reducing human translator efforts by leveraging high-potential sentences with translation memory
Long [Retracted] The Construction of Machine Translation Model and Its Application in English Grammar Error Detection
CN113515955A (en) Semantic understanding-based online translation system and method from text sequence to instruction sequence
Vashistha et al. Active learning for neural machine translation
Rikters Hybrid machine translation by combining output from multiple machine translation systems
NL2031111B1 (en) Translation method, device, apparatus and medium for spanish geographical names
Seifossadat et al. Stochastic Data-to-Text Generation Using Syntactic Dependency Information
CN115186671A (en) Method for mapping noun phrases to descriptive logic concepts based on extension
Shen et al. Evaluating Code Summarization with Improved Correlation with Human Assessment
CN113971403A (en) Entity identification method and system considering text semantic information
Gao et al. Syntax-based chinese-vietnamese tree-to-tree statistical machine translation with bilingual features
Yao et al. Chinese long text summarization using improved sequence-to-sequence lstm
Botha Probabilistic modelling of morphologically rich languages
Zhu Exploration on Korean-Chinese collaborative translation method based on recursive recurrent neural network
CN109977418B (en) Short text similarity measurement method based on semantic vector
Zhao Design of Intelligent Proofreading System Based on Artificial Intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant