Summary of the invention
The phrase rule table extracted for heuristic rule abstracting method in prior art is very big, take hard drive space
Many, containing weak points such as more noise datas, the technical problem to be solved in the present invention be to provide a kind of generate compact,
The phrase rule abstracting method based on combination of the phrase rule collection containing more contextual information.
For solving above-mentioned technical problem, the technical solution used in the present invention is:
A kind of phrase rule abstracting method based on combination of the present invention comprises the following steps: construct one in bilingual corpora
" minimum phrase rule ";
Construct a phrase rule collection containing more contextual informations by the minimum phrase rule of combination, form " group
The phrase rule collection closed ";Phrase rule collection based on combination, raw from the given bilingual parallel corpora containing word alignment information
Become minimum phrase rule collection, and leave in hash structure;
The value of combination frequency n, the phrase rule of tectonic association are set, judge the short of this combination by minimum phrase rule collection
Language rule is made up of several minimum phrase rules;
If the phrase rule of this combination is by the minimum phrase rule group concentrated less than or equal to n bar minimum phrase rule
Become, put it in a new hash structure;
Export the phrase rule of new minimum phrase rule collection and the phrase rule concentration of combination, once based on combination short
Language rule extraction process terminates.
Described minimum phrase rule is: in the case of consistent with the holding of word alignment information, it is impossible to be broken down into two again
Or more rule.
The phrase rule of described combination is: a phrase rule keeps consistent with word alignment information, this phrase rule simultaneously
By the n of same training sentence centering or forming less than minimum phrase rule merging individual for n, this rule-like is the phrase rule of combination
Then.
If the phrase rule of this combination is made up of, the most not the minimum phrase rule concentrated more than n bar minimum phrase rule
Processing, this phrase rule extraction process based on combination terminates.
The size of the phrase rule collection of described combination combines the value of frequency n and adjusts in the phrase rule by combination
Whole, the value i.e. combining frequency n is the biggest, and the phrase rule collection of the combination obtained is the biggest.
The invention have the advantages that and advantage:
1. the present invention can effectively generate high-quality, compact, simultaneously to contain more contextual information phrase rule
Then collecting, in the case of ensureing that translation performance does not reduces, the phrase rule collection of the inventive method extraction extracts than pedestal method
Phrase rule collection reduces 56.5%.
2. by the analysis of experimental result is found, on some data set, by using phrase extraction based on combination
Method, it is possible to obtain the raising of BLEU value, simultaneously by substantial amounts of experiment, to having of phrase rule abstracting method based on combination
Effect property has carried out rational checking.
Detailed description of the invention
Below in conjunction with Figure of description, the present invention is further elaborated.
A kind of phrase rule abstracting method based on combination of the present invention comprises the following steps:
One " minimum phrase rule collection " is constructed in bilingual corpora;
One is constructed containing more contextual informations, superior in quality phrase rule by the minimum phrase rule collection of combination
Collection, forms " the phrase rule collection of combination " n-composed;
Phrase rule based on combination, generates minimum phrase from the given bilingual parallel corpora containing word alignment information
Rule set minimal, and leave in the hash structure of entitled minimal;
Arranging the value of combination frequency n, the phrase rule collection n-composed of tectonic association, by minimum phrase rule collection
Minimal detects all possible phrase rule, i.e. judges that the phrase rule of this combination is made up of several minimum phrase rules;
If the phrase rule of this combination is by less than or equal to the minimum phrase in n bar minimum phrase rule collection minimal
Rule composition, puts it in new hash structure composed;
Phrase rule in output minimal and composed, once phrase rule extraction process based on combination terminates.
If the phrase rule of this combination is by more than the minimum phrase rule group in n bar minimum phrase rule collection minimal
Become, do not process.
In order to obtain the phrase rule collection of a reasonable quantity exercisable, regular, the present invention proposes based on combination
Phrase rule abstracting method.
As it is shown in figure 1, before implementing the inventive method, first prepare bilingual panel data and word alignment, and set in advance
Put and combine frequency n;
Read data line, including source language, target language and word alignment;
The minimum phrase rule collection of structure, puts in hash data structure 1;
Tectonic association rule, it is judged that whether this rule of combination meets the requirement of combination frequency n, satisfactory, i.e. this group
The phrase rule closed is formed by less than or equal to the minimum phrase rule in n bar minimum phrase rule collection minimal, then put into Kazakhstan
In uncommon structure 2;
Judge whether also other possible rule of combinations, without the rule of combination that other are possible, then Hash is tied
Content in structure 1,2 exports and preserves, and the most once phrase rule extraction process based on combination terminates
Judge whether the most untreated data, without untreatment data, then terminate whole control process.
If the most untreated data, return to read data line, including source language, target language and word alignment step.
If also having other possible rule of combinations, then return tectonic association rule step, continue and judge whether to meet group
Close the requirement step of frequency n.
If not meeting the requirement of combination frequency n, the phrase rule i.e. combined is by more than n bar minimum phrase rule collection
Minimum phrase rule composition in minimal, goes to judge whether also other possible rule of combination steps.
As in figure 2 it is shown, the basic concept of this Rule Extracting Algorithm is, first at bilingual corpora (large-scale parallel sentence to)
Middle structure one " minimum phrase rule " minimal(refers to rule most basic, that unit granularity is minimum, is certain phrase rule
Definition then), then construct one containing more contextual informations, superior in quality phrase by the minimum phrase rule of combination
Rule set, the phrase rule collection n-composed i.e. combined.In the present invention, n-composed phrase rule collection is meant that this rule
Then can be made up of 1 ~ n minimum phrase rule, i.e. (n-1)-composed phrase rule collection is included in n-composed rule
It it is a subset of (n-1)-composed rule fairground n-composed rule set among collection.In the methods of the invention, rule
The size of collection is to be adjusted by the value of n in rule of combination, i.e. n value is the biggest, and the rule set obtained is the biggest.This with
In Rule Extracting Algorithm different by limiting the maximum number of word contained by source language and target language phrase.
In the phrase rule abstracting method based on combination that the present invention proposes, which type of rule is first concern be
It is only minimum phrase rule.
Minimum phrase rule be exactly in the case of consistent with the holding of word alignment information, it is impossible to be broken down into again two or
More rule, minimum rule set is the minimum unit of translation, comprises the essential information needed for translation.
Minimum rule set constitutes a translation model the most succinct.In Fig. 2, right side Minimal list is shown and is carried by the present invention
The minimum phrase rule that the phrase rule abstracting method gone out extracts from the sentence centering containing word alignment information of example.In fig. 2
In shown phrase rule, first five rule meets the present invention definition to minimum rule.Such as, (Liaoning, liaoning) no
Two or more phrase rule can be broken down into, so this rule is minimum phrase rule.
Minimum rule does not comprise only the phrase rule of a word all referring to source language and target language end phrase.When word pair
Be together 1 to many or multipair 1 in the case of, the consistent phrase rule of keeping with word alignment extracted also corresponds to minimum rule
Definition.Such as, in (import and export, import and export) rule, " import and export " in word alignment information relative to target
Language word is " import " and " export ", and this rule keeps consistent with word alignment information, is a rational phrase rule, with
The Shi Fuhe definition to minimum phrase rule, when the minimum phrase rule collection of structure, is added into minimum phrase rule and concentrates.This
Outward, if the word word alignment being connected with minimum phrase rule source language and target language end is for time empty, this minimum rule can be to right
Null word extends, and the phrase rule constructed still conforms to minimum phrase rule definition.Such as, advise at (Liaoning, liaoning's)
In then, target language word ' s occurs in the edge of target language phrase, simultaneously to sky in word alignment information, this rule the most only by
One minimum phrase rule (Liaoning, liaoning) is constituted, so this rule is minimum phrase rule.
The definition of minimum phrase rule meets the intuition of people, i.e. when translating, it is always desirable to the translation rule of use
The shortest and the smallest, translation quality is higher simultaneously.But, also contain only use in translation process just because of minimum phrase rule
Most basic word, ultimately constructed minimum phrase rule concentration lost substantial amounts of contextual information, and these contextual informations are
One of key factor of statictic machine translation system excellent performance based on phrase.In extreme situations, when extracting
When the source language of little phrase rule and target language end only have a word, translation system then degenerates to translation system based on word.
In order to improve the quality of phrase rule, making phrase rule can comprise more contextual information, the present invention proposes by combination
Minimum phrase rule obtains containing more words, the method for the extracting phrase rule of more contextual information.
Article one, phrase rule keeps consistent with word alignment information, and this phrase rule is by the n of same training sentence centering simultaneously
Or the minimum phrase rule less than n combines, this rule-like is called n-composed phrase rule, the phrase rule i.e. combined
Then.
Concentrate it can be seen that (n-1)-composed phrase rule collection is included in n-composed phrase rule.The right side in Fig. 2
The sentence centering containing word alignment information from Fig. 2 of the phrase rule abstracting method with present invention combination is shown in side 2-Composed list
Extraction by two or the phrase rule of combination that combines less than two minimum phrase rules.Such as, (Liaoning is imported and exported,
Liaoning's import and export) by minimum rule (Liaoning, liaoning's) and (import and export, import and
Export) combination, so it is 2-composed phrase rule.For generalization, minimum phrase rule is defined as 1-
Composed phrase rule.
If during it is obvious that the number combining the minimum phrase rule comprised in phrase rule is not any limitation as, this
The method of bright proposition can extract the phrase rule of random length.But, in most of the cases, will combination phrase rule comprise
The number definition of minimum phrase rule is excessive, the quality of the phrase rule collection constructed can't be had the best impact.
By benchmark phrase rule extraction algorithm being carried out simple modification, the phrase rule based on combination that the present invention proposes
Abstracting method is highly susceptible to realizing.The given bilingual parallel corpora containing word alignment information, by parameter n in n-composed
Rationally arrange.
Based on combination for present invention phrase rule abstracting method is applied in NiuTrans open source system by the present embodiment
In translation system based on phrase, at NIST(National Institute of Standards andTechnology) Chinese
In English translation duties, by comparing with benchmark phrase extraction method, evaluate this combination phrase rule abstracting method to translation
Systematic function affects.
Translation framework based on phrase employs, as benchmark translation system, all standards that open source system Moses uses
Feature.Additionally, in translation system, it is integrated with two and adjusts sequence models: Lexical tune sequence model based on maximum entropy and stratification
Sequence model adjusted in phrase.Benchmark system decoder uses bundle beta pruning to accelerate decoding with a cube technology of prunning branches, uses minimal error rate
Training optimizes feature weight.Acquiescence adjusts sequence longest distance to be set to 8, and the source language end of phrase rule and target language end comprise word
It is identical that number is limited to 7(with Moses default setting).For phrase rule collection, each source language phrase turns over according to phrase
Translate probability and only retain front 30 translation candidates.
It is right that the training data used in the present embodiment comprises 1,900,000 Chinese-English bilingual sentences, and this training data comes from
NIST part data in the extensive bilingual expectation that NIST MT 2008 evaluation and test provides.First, with GIZA++ instrument to training number
According to carrying out two-way word alignment, it is right to carry out two-way word alignment result with " grow-diag-final-and " heuristic algorithm afterwards
Titleization processes.Additionally, this experiment makes the Xinhua part of GIZAWORD in English and the target language part training of bilingual data
One 5 gram language model.About development set and test set, the present embodiment employs the test set (919) of NIST MT2003
As the development set of weight tuning, use the test set of NIST MT 2004 and NIST MT 2005 (to contain 1788 respectively simultaneously
With 1082 sentences) as the test set evaluating system translation quality.Translation quality is by using the insensitive IBM version of context
This BLEU evaluation index is evaluated.
Table 1. benchmark system and combined method development set (NIST MT 2003) and test set (NIST MT2004 and
NIST MT 2005) on Comparison of experiment results, the most often group experimental result is taken turns experiment by 5 and is averaged
Table 1 represents that the rule of combination abstracting method of benchmark abstracting method and present invention proposition is under various combination value n is arranged
Experimental result, evaluation of result index by BLEU value represent.It can be seen that ought only extract in " minimum rule " row from table 1
During little rule, the inventive method will obtain a phrase rule collection the least, but owing to minimum rule set is in the process of extraction
In lost substantial amounts of contextual information, so the average translation performance in development set and test set reduces than benchmark system
1.37 BLEU points.When being combined rule extraction, can obtain comprising the phrase rule collection of more contextual information, simultaneously
BLEU value anywhere rule quantity increase sustainable growth.Such as, carried out with " 2-Composed " method by " pedestal method " in table 1
Relatively, it appeared that when extracting 2-composed phrase rule collection, the available translation performance suitable with pedestal method, with this
Meanwhile, the size of the phrase rule collection that 2-Composed method obtains reduces 44.3% than pedestal method.By experiment card further
Bright, when extracting the phrase rule of 3-Composed Yu 4-Composed, the average BLEU value of development set and test set compared to
Benchmark system all improves with 2-Composed method.Consider the situation of translation performance and phrase rule size at the same time
Under, the peak performance during the translation performance of 2-Composed phrase rule is tested with table 1 is comparable, and phrase rule size but has simultaneously
Obvious decline, i.e. 2-Comopsed phrase rule basically reached optimum.Finding out from the experimental result of table 1, the present invention carries
The method gone out can effectively generate high-quality, compact, simultaneously to contain more contextual information phrase rule collection.
In benchmark phrase rule abstracting method, the maximum number comprising word when source language and target language phrase is set to not
With when being worth, can effectively adjust the size of phrase rule collection.Fig. 3 compares pedestal method from combined method under different setting
BLEU value.Wherein transverse axis is expressed as the size (unit million) of phrase table, and the longitudinal axis is BLEU value.What in Fig. 3, solid line represented is
Situation when phrase length is set to different value in benchmark Rule Extracting Algorithm, what in solid line, solid square point represented is concrete
Setup Experiments, represent such as " length=3 " is source language and the greatest length of target language phrase of phrase rule in benchmark system
Being disposed as 3, other is similar to therewith.What in Fig. 3, dotted line represented is that in phrase extraction method based on combination, n is set to different value
Time situation.From figure 3, it can be seen that in the n-composed phrase rule abstracting method of present invention proposition, when n >=2, can
Obtain the translation performance suitable with benchmark abstracting method;Simultaneously it can be seen that the present invention proposes to combine phrase rule abstracting method
The balance of rule set size and translation system can be reached faster.Can be observed from this figure, when only using minimum rule set, turn over
Translate the value of performance ratio (>=2)-composed combined method to have and reduce significantly, this also proposition of the present invention from side illustration
Effectiveness based on combination phrase extraction method, explanation simultaneously containing the phrase rule of more contextual informations to translation system
Performance have the biggest impact.
Decoder is used the ratio situation of minimum phrase rule and rule of combination to be added up by the present invention, and this statistics exists
Carry out on 30-best translation result in development set and test set.What Fig. 4 represented is the statistics feelings in development set and test set
Condition, wherein n-composed* represents the rule of combination only combined by n minimum rule.Figure 4, it is seen that decoding
Device, when using phrase rule to translate, tends to select shorter rule (such as minimal and 2-in most cases
Composed*).The rule of combination being made up of more minimum phrase rule is then rarely employed (such as 4-when translation
Composed*).The experimental result of Fig. 4 explains simultaneously and why uses 2-Composed rule of combination can obtain relatively in table 1
High-performance.
The phrase rule abstracting method that the application of the invention proposes, can obtain one for statistical machine based on phrase
The high-quality of translation system service, the phrase rule collection simplified.By with use the most extensively, that performance is excellent is heuristic short
Language abstracting method is compared, in the case of ensureing that translation performance does not reduces, and the phrase rule of the method extraction that the present invention proposes
Then collect and reduce 56.5% than the phrase rule collection of pedestal method extraction.By the analysis of experimental result is found, at some data set
On, by using phrase extraction method based on combination, it is possible to obtain the raising of BLEU value.Simultaneously by substantial amounts of experiment, right
The effectiveness of phrase rule abstracting method based on combination has carried out rational checking.
The checking of the statictic machine translation system based on phrase in NiuTrans open source system, with Moses are write from memory
The Rule Extracting Algorithm recognizing setting is compared, based on rule of combination abstracting method, what the present invention proposed is ensureing that translation performance does not reduces
In the case of, a more succinct phrase rule collection can have been obtained.When extracting 2-composed phrase rule, the present invention
The quality of translation rule that obtains of abstracting method suitable with the default set of rules of Moses, phrase rule collection size is simultaneously
The 56.5% of Moses default setting rule set.Experimental result again shows that, when increasing along with the minimum phrase rule number of times of combination,
The performance of translation system does not show a marked increase compared with 2-composed phrase rule performance.Consider system translation at the same time
In the case of performance and phrase rule collection size, 2-composed phrase rule has basically reached optimum.