Summary of the invention
The phrase rule table that extracts for heuristic rule abstracting method in the prior art is very large, take hard drive space many, contain the more weak points such as noise data, the technical problem to be solved in the present invention provide a kind of generate phrase rule collection compact, that contain more contextual information based on the phrase rule abstracting method that makes up.
For solving the problems of the technologies described above, the technical solution used in the present invention is:
A kind of phrase rule abstracting method based on combination of the present invention may further comprise the steps: one " minimum phrase rule " of structure in bilingual corpora;
Construct one and contain the more phrase rule collection of multi-context information by making up minimum phrase rule, form by " the phrase rule collection of combination "; Based on the phrase rule collection of combination, from the given bilingual parallel corpora that contains word alignment information, generate minimum phrase rule collection, and leave in the Hash data structure;
The value of combination frequency n is set, and the phrase rule of tectonic association judges that by minimum phrase rule collection the phrase rule of this combination is comprised of several minimum phrase rules;
If the phrase rule of this combination forms by being less than or equal to the concentrated minimum phrase rule of the minimum phrase rule of n bar, put it in the new Hash data structure;
Export the concentrated phrase rule of phrase rule of new minimum phrase rule collection and combination, once the phrase rule extraction process based on combination finishes.
Described minimum phrase rule is: in the situation consistent with the word alignment Information preservation, can not be broken down into two or more rule again.
The phrase rule of described combination is: a phrase rule is consistent with the word alignment Information preservation, and this phrase rule is by n of same training sentence centering or merge less than n minimum phrase rule and to form simultaneously, and this rule-like is the phrase rule of combination.
If the phrase rule of this combination is comprised of the minimum phrase rule of concentrating greater than the minimum phrase rule of n bar, then do not process, this phrase rule extraction process based on combination finishes.
The size of the phrase rule collection of described combination is adjusted by the value of combination frequency n in the phrase rule of combination, and the value that namely makes up frequency n is larger, and the phrase rule collection of the combination that obtains is larger.
The present invention has following beneficial effect and advantage:
The present invention can effectively generate high-quality, compact, contain the phrase rule collection of more contextual information simultaneously, guaranteeing that the phrase rule collection that the inventive method extracts reduces 56.5% than the phrase rule collection that pedestal method extracts in the situation that the translation performance does not reduce.
2. find by the analysis to experimental result, on some data set, by using the phrase extraction method based on combination, can obtain the raising of BLEU value, simultaneously by a large amount of experiments, to having carried out rational checking based on the validity of the phrase rule abstracting method that makes up.
Embodiment
The present invention is further elaborated below in conjunction with Figure of description.
A kind of phrase rule abstracting method based on combination of the present invention may further comprise the steps:
One " minimum phrase rule collection " of structure in bilingual corpora;
Construct one and contain multi-context information more, superior in quality phrase rule collection by making up minimum phrase rule collection, form " the phrase rule collection of combination " n-composed;
Based on the phrase rule of combination, from the given bilingual parallel corpora that contains word alignment information, generate minimum phrase rule collection minimal, and leave in the Hash data structure of minimal by name;
The value of combination frequency n is set, and the phrase rule collection n-composed of tectonic association detects all possible phrase rule by minimum phrase rule collection minimal, judges that namely the phrase rule of this combination is comprised of several minimum phrase rules;
If the phrase rule of this combination is comprised of the minimum phrase rule that is less than or equal among the minimum phrase rule collection of the n bar minimal, put it among the new Hash data structure composed;
Phrase rule among output minimal and the composed, once the phrase rule extraction process based on combination finishes.
If the phrase rule of this combination does not then process by forming greater than the minimum phrase rule among the minimum phrase rule collection of the n bar minimal.
In order to obtain the phrase rule collection of exercisable, a regular reasonable quantity, the present invention proposes the phrase rule abstracting method based on combination.
As shown in Figure 1, before implementing the inventive method, at first prepare bilingual panel data and word alignment, and pre-set the combination frequency n;
Read data line, comprise source language, target language and word alignment;
Construct minimum phrase rule collection, put into Hash structure 1;
The tectonic association rule judges whether this rule of combination meets the requirement of making up frequency n, and is satisfactory, i.e. the phrase rule of this combination is comprised of the minimum phrase rule that is less than or equal among the minimum phrase rule collection of the n bar minimal, then puts into Hash structure 2;
Judge whether in addition other possible rules of combination, if there are not other possible rules of combination, then with the output of the content in the Hash structure 1,2 and preservation, then once finish based on the phrase rule extraction process that makes up
Judge whether to also have untreated data, if there is not untreatment data, then finish whole control procedure.
If also have untreated data, turn back to and read data line, comprise source language, target language and word alignment step.
If also have other possible rules of combination, then return tectonic association rule step, continuing judges whether to meet the step that requires of combination frequency n.
If do not meet the requirement of combination frequency n, namely the phrase rule of combination judges whether to also have other possible rule of combination steps by forming greater than the minimum phrase rule among the minimum phrase rule collection of the n bar minimal, going to.
As shown in Figure 2, the basic concept of this Rule Extracting Algorithm is, at first in bilingual corpora (large-scale parallel sentence to), construct " minimum phrase rule " minimal(and refer to rule the most basic, unit granularity minimum, the definition of certain bar phrase rule), then construct one and contain multi-context information more, superior in quality phrase rule collection, i.e. the phrase rule collection n-composed of combination by making up minimum phrase rule.Among the present invention, the implication of n-composed phrase rule collection is that this rule can be comprised of 1 ~ n minimum phrase rule, and namely (n-1)-composed phrase rule collection is a subset of (n-1)-composed rule fairground n-composed rule set among being included in the n-composed rule set.In the methods of the invention, the size of rule set is to adjust by the value of n in the rule of combination, and namely the n value is larger, and the rule set that obtains is larger.This with Rule Extracting Algorithm in the past in different by the maximum number of restriction source language and the contained word of target language phrase.
In the phrase rule abstracting method based on combination that the present invention proposes, the problem of at first being concerned about is which type of rule is only minimum phrase rule.
Minimum phrase rule is exactly in the situation consistent with the word alignment Information preservation, can not be broken down into two or more rule again, and minimum rule set is the minimum unit of translation, comprises the required essential information of translation.
Minimum rule set has consisted of a translation model the most succinct.The minimum phrase rule that the phrase rule abstracting method that proposes with the present invention extracts from the sentence centering of containing word alignment information of example is shown in right side Minimal tabulation among Fig. 2.In the phrase rule shown in Fig. 2, the first five rule meets the present invention to the definition of minimum rule.For example, (Liaoning liaoning) can not be broken down into two or more phrase rules, so this rule is minimum phrase rule.
Minimum rule also not exclusively refers to the source language and target language end phrase only contains the phrase rule of a word.When word alignment be more than 1 pair or many to 1 situation under, the phrase rule that is consistent with word alignment that extracts meets the definition of minimum rule equally.For example, (import and export, import and export) in the rule, " import and export " relative target language word in word alignment information is " import " and " export ", this rule is consistent with the word alignment Information preservation, is a rational phrase rule, meets simultaneously the definition to minimum phrase rule, when the minimum phrase rule collection of structure, it is added minimum phrase rule concentrate.In addition, if when the word word alignment that links to each other with minimum phrase rule source language and target language end is empty, this minimum rule can be to expanding null word, and the phrase rule of constructing still meets minimum phrase rule definition.For example, in that (Liaoning is liaoning's) in the rule, the target language word ' s appears at the edge of target language phrase, simultaneously in word alignment information to sky, this rule is equally only by a minimum phrase rule (Liaoning, liaoning) consist of, so this rule is minimum phrase rule.
The definition of minimum phrase rule meets people's intuition, namely when translating, always wishes that the translation rule that uses is as far as possible short and small, and translation quality is higher simultaneously.Yet, also only contain the most basic word that uses in the translation process just because of minimum phrase rule, the minimum phrase rule of final structure is concentrated and has been lost a large amount of contextual informations, and these contextual informations are based on one of the key factor of the statictic machine translation system excellent performance of phrase.In extreme situation, when the source of the minimum phrase rule that extracts language and target language end only had a word, translation system then degenerated to the translation system based on word.In order to improve the quality of phrase rule, make phrase rule can comprise more contextual information, the present invention proposes by making up minimum phrase rule and obtain and contain more words, the method for the extracting phrase rule of multi-context information more.
Article one, phrase rule is consistent with the word alignment Information preservation, and this phrase rule claims that this rule-like is the n-composed phrase rule, i.e. the phrase rule of combination by n of same training sentence centering or combine less than n minimum phrase rule simultaneously.
Can find out, (n-1)-composed phrase rule collection is included in the n-composed phrase rule and concentrates.Right side 2-Composed tabulation shows that phrase rule abstracting method with the present invention's combination contains from Fig. 2 that the sentence centering of word alignment information extracts among Fig. 2 by the phrase rule of two or the combination that combines less than two minimum phrase rules.For example, (Liaoning import and export, liaoning's import and export) (Liaoning is liaoning's) with (importing and exporting import and export) combination, so it is the 2-composed phrase rule by minimum rule.For universalization, minimum phrase rule is defined as the 1-composed phrase rule.
Clearly, if when the number of the minimum phrase rule that combination is comprised in the phrase rule was not limited, the method that the present invention proposes can extract the phrase rule of random length.Yet in most of the cases, the number definition that combination is comprised minimum phrase rule in the phrase rule is excessive, can't obviously good impact be arranged to the quality of the phrase rule collection that constructs.
By benchmark phrase rule extraction algorithm is carried out simple modification, the phrase rule abstracting method based on combination that the present invention proposes is highly susceptible to realizing.The given bilingual parallel corpora that contains word alignment information is by rationally arranging parameter n among the n-composed.
The phrase rule abstracting method that present embodiment will the present invention is based on combination is applied to NiuTrans and increases income in the translation system based on phrase in the system, at NIST(National Institute of Standards andTechnology) on the Chinese-English translation task, by comparing with benchmark phrase extraction method, estimate this combination phrase rule abstracting method to the translation system performance impact.
Used the feature of all standards of the Moses of the system use of increasing income as the benchmark translation system based on the translation framework of phrase.In addition, in translation system, transfer the order models for integrated two: based on vocabulary accent order model and the stratification phrase accent order model of maximum entropy.The baseline system demoder accelerates decoding with bundle beta pruning and a cube technology of prunning branches, trains to optimize feature weight with minimal error rate.Acquiescence transfers the order longest distance to be set to 8, and the source language end of phrase rule and target language end comprise the word number to be restricted to 7(identical with the Moses default setting).For the phrase rule collection, each source language phrase only keeps front 30 translation candidates according to the phrase translation probability.
It is right that the training data that uses in the present embodiment comprises 1,900,000 Chinese-English bilingual sentences, and this training data comes from NIST MT 2008 NIST partial data in the extensive bilingual expectation that provides is is provided.At first, with the GIZA++ instrument training data is carried out two-way word alignment, use afterwards " grow-diag-final-and " heuristic algorithm that two-way word alignment result is carried out symmetrization and process.In addition, make the Xinhua part of GIZAWORD in English and the target language of bilingual data partly train 5 gram language model in this experiment.About exploitation collection and test set, present embodiment has used the test set (919) of NIST MT2003 as the exploitation collection of weight tuning, uses simultaneously the test set (containing respectively 1788 and 1082 sentences) of NIST MT 2004 and NIST MT 2005 as the test set of evaluation system translation quality.Translation quality is estimated by the BLEU evaluation index of using the insensitive IBM version of context.
Table 1. baseline system and the combined method Comparison of experiment results on exploitation collection (NIST MT 2003) and test set (NIST MT2004 and NIST MT 2005), wherein every group of experimental result taken turns to test by 5 and averaged
The experimental result of rule of combination abstracting method under various combination value n arranges that table 1 expression benchmark abstracting method and the present invention propose, the evaluation of result index is by the BLEU value representation.From table 1, can find out in " minimum rule " row, when only extracting minimum rule, the inventive method will obtain a very little phrase rule collection, but because minimum rule set has been lost a large amount of contextual informations in the process that extracts, so the average translation Performance Ratio baseline system on exploitation collection and test set reduces by 1.37 BLEU points.When carrying out rule of combination when extracting, can obtain comprising the more phrase rule collection of multi-context information, simultaneously the BLEU value anywhere rule quantity increase sustainable growth.For example, compare by " pedestal method " in the table 1 and " 2-Composed " method, can find when extracting 2-composed phrase rule collection, can obtain the translation performance suitable with pedestal method, meanwhile, the size of the phrase rule collection of 2-Composed method acquisition reduces 44.3% than pedestal method.Prove further that by experiment when extracting the phrase rule of 3-Composed and 4-Composed, the exploitation collection all improves than baseline system and 2-Composed method with the average BLEU value of test set.Consider at the same time in the situation of translation performance and phrase rule size, peak performance in the translation performance of 2-Composed phrase rule and table 1 experiment is comparable, the phrase rule size but decreases drastically simultaneously, and namely the 2-Comopsed phrase rule reaches optimum substantially.Find out that from the experimental result of table 1 that the method that the present invention proposes can effectively generate is high-quality, compact, contain simultaneously the phrase rule collection of more contextual information.
In benchmark phrase rule abstracting method, when the maximum number that comprises word when source language and target language phrase is set to different value, can effectively adjust the size of phrase rule collection.Fig. 3 has compared pedestal method and the BLEU value of combined method under different the setting.Wherein transverse axis is expressed as the size (unit 1,000,000) of phrase table, and the longitudinal axis is the BLEU value.Solid line represents among Fig. 3 is situation when phrase length is set to different value in the benchmark Rule Extracting Algorithm, the expression of solid square point is concrete experiment setting in solid line, as " length=3 " expression be that the source language of phrase rule in baseline system and the maximum length of target language phrase all are set to 3, other is similar with it.Situation when n is set to different value in the phrase extraction method that is based on combination that dotted line represents among Fig. 3.As can be seen from Figure 3, in the n-composed phrase rule abstracting method that the present invention proposes, when n 〉=2, can obtain the translation performance suitable with the benchmark abstracting method; Can find out simultaneously that the present invention proposes to make up the balance that the phrase rule abstracting method can reach rule set size and translation system faster.From then on can be observed among the figure, when only using minimum rule set, the value of translation Performance Ratio (〉=2)-composed combined method has significantly reduction, this has also illustrated the validity based on combination phrase extraction method that the present invention proposes from the side, and the phrase rule that simultaneously explanation contains multi-context information more has very large impact to the performance of translation system.
The present invention uses the ratio situation of minimum phrase rule and rule of combination to add up to demoder, and the 30-best translation result of this statistics on exploitation collection and test set carries out.Fig. 4 represents is statistical conditions on exploitation collection and test set, and wherein n-composed* represents the rule of combination that only formed by the individual minimum principle combinations of n.As can be seen from Figure 4, demoder tends to select the rule (such as minimal and 2-composed*) of lacking in most cases when using phrase rule to translate.The rule of combination that is made of more minimum phrase rule then seldom uses (such as 4-composed*) when translating.Why the experimental result of Fig. 4 has been explained simultaneously and has been used the 2-Composed rule of combination can obtain superior performance in the table 1.
The phrase rule abstracting method that the application of the invention proposes, can obtain one is based on the high-quality of the statictic machine translation system service of phrase, the phrase rule collection of simplifying.By comparing with using heuristic phrase extraction method the most extensive, that the performance performance is excellent, guaranteeing that the phrase rule collection that the method that the present invention proposes extracts reduces 56.5% than the phrase rule collection that pedestal method extracts in the situation that the translation performance does not reduce.Find by the analysis to experimental result, on some data set, by using the phrase extraction method based on combination, can obtain the raising of BLEU value.Simultaneously by a large amount of experiments, to having carried out rational checking based on the validity of the phrase rule abstracting method that makes up.
Through increase income the checking based on the statictic machine translation system of phrase in the system of NiuTrans, compare with the Rule Extracting Algorithm of default setting among the Moses, what the present invention proposed is guaranteeing can to have obtained a more succinct phrase rule collection in the situation that the translation performance does not reduce based on the rule of combination abstracting method.When extracting the 2-composed phrase rule, the quality of the translation rule that abstracting method of the present invention obtains is suitable with the default set of rules of Moses, and phrase rule integrates size as 56.5% of Moses default setting rule set simultaneously.Experimental result shows equally, and when along with the increasing of the minimum phrase rule number of times of combination, the performance of translation system is not compared with 2-composed phrase rule performance and shown a marked increase.In the situation of taking into account system translation performance and phrase rule collection size, it is optimum that the 2-composed phrase rule reaches substantially at the same time.