Summary of the invention
For in syntactic translation system in the prior art cannot to the short-movie section of sentence carry out it is good translation and sequencing with
And rule it is sparse caused by system robustness problem, and sentence element of the model to long range in non-syntactic translation system
Not can be carried out effective sequencing problem, the problems such as framework information manually marked is time-consuming and laborious, the invention solves technology ask
Topic is to provide a kind of statictic machine translation system based on syntax skeleton, and the syntax skeleton high-level to source language models, and
And good translation is carried out to the phrase of low level, while proposing a kind of novel representation of syntax skeleton, so that machine turns over
Translate system use.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
A kind of statictic machine translation system based on syntax skeleton of the present invention, comprising the following steps:
1) probability SCFG level Rule Extracting Algorithm extracts non-syntactic translation rule, is used for sentence non-skeleton section to be translated
Translation:
Using the method for the heuristic limitation for extracting level rule, is passing through word alignment but do not carrying out the parallel of syntactic analysis
Probability SCFG grammar rule is extracted in sentence pair, utilizes level phrase rule, that is, non-syntactic translation rule process sentence low layer to be translated
The translation of secondary structure;
2) GHKM rule and method extracts syntactic translation rule, the translation of the skeleton part for sentence to be translated:
Using GHKM Rule Extracting Algorithm in the syntactic analysis result of parallel sentence pairs and original language end Jing Guo word alignment
GHKM rule is extracted, is rewritten into syntactic translation rule using the GHKM rule of above-mentioned extraction.It is high using syntactic translation rule process
The generation and translation of level skeleton structure;
3) non-fully syntactic translation rule generates:
Non-fully syntactic translation rule is generated using syntactic translation rule, is advised in conjunction with non-syntactic translation rule and syntactic translation
Then, the integration of two kinds of translation system advantages of non-syntactic translation system and syntactic translation system is realized;
4) model generates:
According to above-mentioned non-fully syntactic translation rule, according to different translation duties to syntax translation system and non-syntax
The syntax i.e. translation rule set of translation system are integrated, and are generated non-fully syntactic translation and are derived, are turned over using non-syntax
The translation for translating the phrase or phrase of rule process sentence low level to be translated, completes sentence to be translated using syntactic translation rule
The translation duties of high-level syntax skeleton structure;Utilize non-fully syntactic translation rules guide skeleton generating process and translated
Journey;It collects non-syntactic translation rule, syntactic translation rule and non-fully syntactic translation rule generates one with big coverage
SCFG grammar system, and complete by non-fully syntactic translation rule the combination of the different form syntax.
Syntactic translation rule i.e. syntactic translation rule is rewritten into using the GHKM rule of above-mentioned extraction are as follows: by the GHKM of extraction
Rule, rule format are as follows:
Source language phrase syntactic marker is<using above-mentioned syntactic marker as the source statement method tree segment>→ target language string of root node
Wherein " the source language phrase syntactic marker " of left part of a rule is is defined phrase structure class by linguistics syntactic knowledge
Type label, i.e. syntax nonterminal symbol;" the syntax subtree segment " of left part of a rule is the segment of sentence parsing tree, is tree knot
Structure, leaf node can be terminal symbol word or nonterminal symbol, and these nonterminal symbols must belong to the analysis of source statement method
In certain a kind of syntactic marker;" the target language string " of right part of a rule is the string that target language terminal symbol word and nonterminal symbol are constituted,
Nonterminal symbol label and the nonterminal symbol of source statement method tree segment leaf node correspond.
Above-mentioned GHKM can be advised by keeping the nonterminal symbol of syntax subtree segment boundaries and giving up internal tree construction
Then it is rewritten as syntactic translation rule
Source language phrase syntactic marker →<source language string, target language string>
Wherein " source language string " indicates source language terminal symbol word, the sequence that nonterminal symbol is constituted and corresponding " syntactic marker " is constituted
Column, the sequence are the leaf node sequence of source statement method tree segment in the rule of GHKM corresponding to syntactic rule;" target language string "
For the string being made of target language terminal symbol word, nonterminal symbol and corresponding " syntactic marker ", nonterminal symbol label and source language
The nonterminal symbol of syntax subtree segment leaf node corresponds.
Non-fully syntactic translation rule, non-fully syntactic translation are generated using non-syntactic translation rule and syntactic translation rule
Rule format statement are as follows:
Source language phrase syntactic marker → < source language string*, target language string*>
Wherein, " the source language phrase syntactic marker " of left part is a nonterminal symbol, " source language string*" it is source language terminal symbol word
The string that language, nonterminal symbol and extensive label X are constituted, " target language string*" it is target language terminal symbol word, nonterminal symbol and extensive mark
Remember that the string that X is constituted, nonterminal symbol label and the nonterminal symbol of source statement method tree segment leaf node correspond;
Non-fully syntactic translation rule and the difference of syntactic translation rule are: non-fully syntactic translation rule is not required for
All nonterminal symbols must belong to a kind of phrase method label of certain in the analysis of source statement method in rule, and part therein non-end
Knot symbol is X by reduction, indicates the nonterminal symbol and is not belonging to any syntactic analysis type.
Realize the combination of two kinds of translation system advantages of non-syntactic translation system and syntactic translation system are as follows:
What syntactic translation rule regular by the syntactic translation at source language end, non-and non-fully syntactic translation rule generated covers greatly
The cover degree SCFG syntax create syntax skeleton in decoding process;
In the generating process of above-mentioned syntax skeleton structure, capture between the sequencing in original language in syntactic structure ingredient,
The high-level translation duties of sentence to be translated are distributed into syntactic translation system to handle.And sentence low level to be translated
Translation duties distribute to non-syntactic translation system to complete;The advantages of realizing different translation systems contributes to the translation being respectively good at
In task.
It is integrated according to the syntax of the different translation duties to non-syntactic translation system and syntactic translation system are as follows:
In SCFG system, each translation rule is derived and carries out weight calculation, more accurately to be derived using various translation rules,
The score that each translation rule derives d is calculated using following formula:
Wherein, s (d) is the score that translation rule derives d, and t is the character string at target language end, and the score of d is then defined as more
The product of a factor, comprising:
The weight product for the strictly all rules that syntax skeleton (ds) is included in factor 1:dWherein riIt is ds
In the i-th rule, w (r*) is the weight of regular r*;
Non-skeleton section (d in factor 2:dh) included strictly all rules weight productWherein rj
For dhIn j-th strip rule, w (r*) is the weight of regular r*;
The exponential weighting score of factor 3:n gram language model lm (t)λlmIndicate the power of n gram language model
Weight;
The factor 4: vocabulary rewards exp (λwb| t |), wherein exp (| t |) indicates the e index calculated result of translation length, when
Sentence is longer, and this " reward " is bigger, λwbIt is the weight of vocabulary reward.
The invention has the following beneficial effects and advantage:
1. the special syntactic structure information (syntax skeleton or referred to as skeleton) that present system has used oneself to define
The method translated, syntax skeleton that can be high-level to source language model, and so as to machine translation system use, it is one
It is good in a frame combine two advantages: 1) translation and long range to be carried out to syntax skeleton using syntactic translation rule
Sequencing problem;2) vocabulary translation and sequencing of low level are handled using the rule of non-syntactic translation system.
2. model of the invention is very flexible, normal form can be decoded by an independent succinct syntax to cover non-sentence
The derivation of method, non-fully syntax or full syntactic translation rule may be implemented between syntactic translation rule and non-syntactic translation rule
It is two-way gradually excessively, allow translation system between syntactic translation system and non-syntactic translation system selectively using turning over
Translate system.Therefore, non-syntactic translation system and syntactic translation system can be considered as obtain using the method two kinds of special cases, mould
Type Yi Shixian, and significant effect.
3. present system is also applied for being generally basede on the translation system for synchronizing Grammars (SCFGs) frame up and down, can
With easy realization in the translation system of a support SCFG syntax decoder, and confirm to accelerate the translation of system.
4. being the first automatic to the progress of syntax framework information invention defines a kind of novel skeleton structure representation
It obtains, it can be in syntactic translation rule, non-fully under the guidance of syntactic translation rule and non-syntactic translation rule, realization skeleton
The automatic acquisition of information avoids a large amount of hand labors of mark framework information waste.
5. the present invention is different from traditional syntactic translation system, in the decoding process of translation system, which is realized
First to the translation of syntax structural framing, and sequencing is controlled, local segment is then realized under good syntax skeleton
Non- syntax translation, this is to use this kind of method for the first time in current translation system.
Specific embodiment
The present invention is further elaborated with reference to the accompanying drawings of the specification.
As shown in Figure 1, a kind of statictic machine translation system based on syntax skeleton of the present invention the following steps are included:
1) probability SCFG level Rule Extracting Algorithm extracts non-syntactic translation rule, is used for sentence non-skeleton section to be translated
Translation:
Using the method for the heuristic limitation for extracting level rule, is passing through word alignment but do not carrying out the parallel of syntactic analysis
Probability SCFG grammar rule is extracted in sentence pair, it is low using non-syntactic translation rule, that is, non-syntactic translation rule process sentence to be translated
The translation of hierarchical structure;
2) GHKM rule and method extracts syntactic translation rule, the translation of the skeleton part for sentence to be translated:
Using GHKM Rule Extracting Algorithm in the syntactic analysis result of parallel sentence pairs and original language end Jing Guo word alignment
GHKM hierarchy type rule is extracted, is rewritten into syntactic translation rule i.e. syntactic translation rule, place using the GHKM rule of above-mentioned extraction
Manage the high-level organization of sentence to be translated, that is, the syntax translation of sentence syntactic structure;
3) the non-fully generation of syntactic translation rule:
Non-fully syntactic translation rule is generated using syntactic translation rule, and combines the use of non-syntactic translation rule, it is real
The combination of existing two kinds of translation system advantages of non-syntactic translation system and syntactic translation system;
4) model generates:
According to above-mentioned non-fully syntactic translation rule, according to different translation duties to non-syntactic translation system and syntax
The syntax (translation rule set) of translation system are integrated, and are generated non-fully syntactic translation and are derived, are advised by non-fully syntax
Then, different translation duties are identified, using the translation of non-syntactic translation rule process text low level (phrase or phrase),
Using syntactic translation rule and non-fully syntactic translation rule complete text high-level (syntactic structure) translation duties;It collects non-
Syntactic translation rule, syntactic translation is regular and non-fully syntactic translation rule generates the SCFG syntax with big coverage
System.
In step 1), probability SCFG level rule extraction: the present invention, which utilizes, passes through word alignment, but does not carry out syntax parsing
In parallel sentence pairs, probability SCFG grammar rule is extracted using the inspiration method for limiting for extracting level phrase rule, it is short using level
The translation of language rule, that is, non-syntactic translation rule process sentence low level structure to be translated;
It is regular using word alignment parallel sentence pairs data pick-up GHKM under original language syntax tree information guiding in step 2),
And be rewritten into the syntactic translation rule of SCFG formula, i.e., by the GHKM of extraction rule, from following rule format:
Source statement method P-marker (source language string attribute source language string syntactic structure source language string) → target language translation
Syntactic translation rule are rewritten as by keeping the nonterminal symbol of syntax tree piece section boundary and giving up internal tree construction
Form then:
Source language phrase syntactic marker →<source language string, target language string>
Wherein " source language string " indicates source language terminal symbol word, the sequence that nonterminal symbol is constituted and corresponding " syntactic marker " is constituted
Column, the sequence are the leaf node sequence of source statement method tree segment in the rule of GHKM corresponding to syntactic rule;" target language string "
For the string being made of target language terminal symbol word, nonterminal symbol and corresponding " syntactic marker ", nonterminal symbol label and source language
The nonterminal symbol of syntax subtree segment leaf node corresponds.
In step 3), using original language end syntactic information, syntax framework information is obtained, by syntax translation rule and non-
The regulation and reorganization of syntactic translation rule obtain non-fully syntactic translation rule, non-fully the form of syntactic translation rule are as follows:
Source language phrase syntactic marker → < source language string*, target language string*>
Wherein, " the source language phrase syntactic marker " of left part is a nonterminal symbol, " source language string*" it is source words and phrases (termination
Symbol), nonterminal symbol and it is extensive label X constitute sequence,;" target language string*" be by target words and phrases (terminal symbol), nonterminal symbol and
The string that extensive label X is constituted, terminal symbol label and the nonterminal symbol of source statement method tree segment leaf node correspond;
Non-fully syntactic translation rule and the difference of syntactic translation rule are: non-fully syntactic translation rule is not required for
All nonterminal symbols must belong to a kind of phrase method label of certain in the analysis of source statement method in rule, and part therein non-end
Knot symbol is X by reduction, indicates the nonterminal symbol and is not belonging to any syntactic analysis type.
For the rule of each syntactic translation, its form can be rewritten, obtain non-fully syntactic translation rule
Then, concrete mode is extensive at X by one or two nonterminal symbol right part of a rule, and keeps left part constant, can be with
It is converted into non-fully syntactic translation rule.
In syntactic translation rule, non-syntactic translation rule, after non-fully syntactic translation rule is collected completely, using all
Rule generates a biggish SCFG grammar system, is realized in sentence decoding process to be translated by non-fully syntactic translation rule
The guidance of derivation, and generate corresponding syntactic structure, the advantages of different interpretative systems are utilized in different sentence levels.Processing
The advantages of can use non-syntactic translation system when the translation of low level (such as phrase) is handled high-level (such as syntactic structure)
Translation duties when can utilize syntactic translation system the advantages of.
Realize the combination of two kinds of translation system advantages of non-syntactic translation system and syntactic translation system are as follows:
By the SCFG grammar system of the big coverage of generation, using non-fully syntactic translation rule, realization is turned over from syntax
The gradually transition of system to non-syntactic translation system is translated, syntax skeleton is created in derivation process;
Using above-mentioned non-fully syntactic translation rule and the capture of syntactic translation rule treat in translation of the sentence different compositions at
Sequencing between point, and the translation duties of low level are distributed to non-syntactic translation rule to handle;By high-level skeleton part
Translation duties distribute to syntactic translation rule and non-fully syntactic rule is handled.
In step 4), according to above-mentioned non-fully syntactic translation rule, according to different translation duties to non-syntactic translation
The syntax of system and syntactic translation system are regulated and controled, and the SCFG syntax system of the big coverage of three types rule composition is generated
System not only can carry out good sequencing to sentence framework ingredient, and realize the syntax skeleton of sentence in SCFG system
Generation, wherein can to each translation rule derive carry out weight calculation, more accurately to be pushed away using various translation rules
It leads, the score that each translation rule derives d is calculated using following formula:
Wherein, s (d) is the score that translation rule derives d, and t is the character string at target language end, and the score of d is then defined as more
The product of a factor, comprising:
Syntax skeleton (d in factor 1:ds) included strictly all rules weight productWherein riIt is ds
In the i-th rule, w (r*) is the weight of regular r*;
Non-skeleton section (d in factor 2:dh) included strictly all rules weight productWherein rj
For dhIn j-th strip rule, w (r*) is the weight of regular r*;
The exponential weighting score of factor 3:n gram language model lm (t)λlmIndicate the power of n gram language model
Weight;
The factor 4: vocabulary rewards exp (λwb| t |), wherein exp (| t |) indicates the e index calculated result of translation length less,
Sentence is longer, and " reward " is bigger, λwbIt is the weight of vocabulary reward.
Decoding application:
This model is in decoding in application, passing through the SCFG synchronous context Grammars pair using the big coverage generated
Source language end sentence to be translated carries out syntax decoding, is treated during analysis using non-fully syntactic rule and syntactic rule
Translation of the sentence is analyzed according to the structure of syntax skeleton, during analysis, generates the syntax skeleton of sentence, and utilize life
At big coverage SCFG synchronize regular target language in upper and lower Grammars and derive the translation that part generates target language end.Each
Segment can then obtain the structural information of local segment, if do not had if there is the non-fully corresponding derivation of syntactic translation rule
Corresponding non-fully syntactic translation rule is found, model can derive space (comprising syntactic translation rule, non-syntactic translation rule
Then, non-fully syntactic translation is regular) find best translation derivation.
In the present invention, the Machine Translation Model frame based on syntax skeleton can substantially be divided into three parts: Rule,
Model generation, model application etc..Model framework is as shown in Figure 1.
It is taken out in the way of different in bilingual alignment data and source statement method tree information using method described above first
Different types of translation rule is taken, then according to source statement method feature, overwritten parts syntactic translation rule is generated appropriate non-complete
Full syntactic translation structure derives, and connects various types of derivation rule.Finally utilize skeleton pattern according to not in decoding
Same level translation duties find suitable derivation mode.
One, translation rule obtains:
In the present invention, different rules is extracted using different methods:
1) non-syntactic translation rule extraction:
Since the present invention is, for SCFG grammar rule, following form to can be used based on realizing on the SCFG syntax
It is expressed:
LHS →<α, β ,~>
Wherein LHS is a nonterminal symbol, and α and β are source language end and target language end respectively by terminal symbol and nonterminal symbol group
At word sequence ,~then indicate the one-to-one relationship of nonterminal symbol in α and β.
For non-syntactic translation rule, using the method for the heuristic limitation for extracting level rule, in process word alignment but
It does not carry out extracting probability SCFG grammar rule in the parallel sentence pairs of syntactic analysis, for the probability SCFG syntax of acquisition, one is given
Fixed translation of the sentence can be decoded by finding the rule of most probable, maximum probability to derive.Fig. 2 gives an extraction
The example of non-syntactic translation rule, wherein nonterminal symbol is only marked as X.If there is the sequence of some such SCFG rule compositions
Column set can be covered completely and derive source statement, then it is assumed that it is the SCFG derivation syntax of this source statement.
Such as the regular h in figure5、h1And h3The derivation of a sentence pair can be produced.
2) syntactic translation Rule:
The form of non-syntactic translation rule and the form of regular syntax (syntax) translation rule are substantially the same,
It is that non-syntactic translation rule is generated not in accordance with the constraint of (source language end or target language end) syntax.If utilizing any one side
The syntactic information at (source language end or target language end) is constrained, our available derivation rules for meeting syntactic information, also
It is syntactic translation rule, for this purpose, we, which can use following manner, obtains syntactic translation rule.
GHKM rule extraction:
In order to generate the syntactic rule of syntactical form, the method that the present invention utilizes mainstream ----utilize the syntax tree at source language end
Information extracts GHKM rule as constraint and guidance in the bilingual sentence pair for have word alignment information.
In the method for extracting GHKM, the present invention is modeling on from original language syntax tree to target language string, a GHKM
Rule is by source language segment sr, target language segment trWith pair of nonterminal symbol in their segments (source language segment and target language segment)
It should be related to composition, such as following formula is a GHKM rule:
VP (VV (raising) x1: NN) → increase x1
GHKM rule is rewritten:
Above-mentioned rule format is rewritten into SCFG rule format by the present embodiment, and concrete operations are to maintain the non-terminal of front end
Symbol annotation is constant, abandons the tree structure information inside nonterminal symbol, such as:
VP → < improve NN1, increase NN1>
Wherein, VP is verb phrase, and VV is verb part of speech, and NN is noun part-of-speech, x1For nonterminal symbol, NN1It is name for part of speech
One variable of word.
GHKM rule is converted with reference to SCFG rule in invention, since all nonterminal symbols are all by original language end
Syntactic label label, so application all will receive the constraint of correct syntax when generating syntactic translation rule.
The process that syntactic translation rule is extracted from a source language tree and target language string centering, the present embodiment are given in Fig. 2
The multi-level tree construction of original GHKM rule is had ignored, but remains the node of regular front end, so such rewriting operation can
To allow system to have a relatively good generation ability to new sentence translation result.
In addition, SCFG syntactic analysis process can be regarded as by being decoded using syntactic translation rule.It is a kind of popular
Method be exactly string parsing (or decoding based on string), this method can in a table decoder to input sentence carry out
It decodes (for example, CYK decoder).And under test set active language end parsing information state, we can use tree parsing (or
Decoding based on tree) method analytic tree is decoded.In this case, since all derivations all must comply with input
Syntax analytic tree, source language end syntactic information can be regarded as applying hard constraint, to increase accuracy.
3) non-fully syntactic translation regular (present invention definition) obtains
Non- syntactic translation system and syntactic translation system have respective advantage and disadvantage, for example, non-syntactic translation model is in word
There is the excellent ability for following Lexical rule in terms of remittance selection and sequencing, but is had very when processing complexity is at componental movement
Multiple constraint.The model of syntactic translation class can annotate to describe the movement of the hierarchy of ingredient by the syntax in linguistics, and
And it also has outstanding performance on high-level syntax-based reordering.But both models, which all have sparse and limited covering degree, asks
Topic.
In the ideal case, the advantage of two kinds of models can be applied to the maximum place of its effect degree: 1) syntax turns over
Translate the sequencing that model is capable of handling between the generation and syntactic constituent of high-level syntax skeleton;2) non-syntactic translation rule can
Handle the vocabulary translation and sequencing of low level.In order to reach this purpose, the invention proposes one kind can be in a model
In conjunction with the method for two kinds of advantages.The syntax of non-syntactic translation and syntactic translation is re-used in translation, and develops one kind
Novel rule --- non-fully syntactic translation rule, for the company that syntactic translation rule and non-syntactic translation rule is transitional
It picks up and.
If the left part (LHS) of a rule is the syntactic label at source language end, and right part (RHS) at least one
Nonterminal symbol band X indicates.Here is a non-fully syntactic translation rule:
VP → < improve X1, increase X1>
NT → < improve X2, increase X2>
Wherein left part represents a verb phrase (VP), as the non-syntactic translation rule of right part and standard, contains
Nonterminal symbol X.This rule can be applied in the non-syntactic translation in a part derives, and generates one using VP as root node
Derivation rule.Then the rule of syntactic translation can replace this VP to push away as usually in syntax machine translation system
It leads, to realize from syntax translation system to the excessive of non-syntactic translation system.
Two, skeleton pattern generates
Since non-fully syntactic translation rule can connect non-syntactic translation rule and syntactic translation rule, so
Both all rules be can use to set up non-fully syntactic translation derivation rule, constitute the text that syntax skeleton can be generated
Method system, that is, the basis of skeleton pattern.Fig. 3 gives one from non-syntactic translation rule, and syntactic translation is regular and non-
The derivation of complete syntax translation rule building.In this derivation, non-syntactic translation rule (h3, h6And h8) it is applied to low level
Translation.By applying syntactic rule (non-fully syntactic translation rule p on the part of X derives3With syntactic translation rule r1With
r4), establish the derivation for meeting sentence syntax skeleton.
This syntactic structure is that ((upper right corner Fig. 3) creates, it is referred to as by the syntactic rule at source language end in the present invention
Syntax skeleton.It is generally a kind of with the tree piece for having terminal symbol or nonterminal symbol on high-level syntax and leaf node
Section.By using this skeleton structure, the sequencing in " between NP VP " ingredient can be easily captured, and low layer
Secondary translation (" answer " and " being satisfied with ") distributes to non-syntactic translation rule to handle.
In order to obtain non-fully syntactic translation rule, a kind of simple direct method is used.Each syntax is turned over
The rule translated by one or two nonterminal symbol reduction right part (RHS) at X, and keeps left part (LHS) constant, can
It is converted to non-fully syntactic translation rule.Such as the system based on tree decodes the procedure chart of a syntactic translation rule
R in (Fig. 4)5(VP → < to NP1 VP2, VP2 with NP1>), available three non-fully syntactic translation rule:
VP → < to X1 X2, X2 with X1)
VP → < to X1 VP2, VP2 with X1>
VP → < to NP1 X2, X2 with NP1>
Once all rules include, non-syntactic translation rule, syntactic translation rule and non-fully syntactic translation rule standard
It is standby to finish, a bigger SCFG derivation syntax just are established using them and are applied it in decoder.Utilize weight
Logarithmic linear method carrys out the weight of computation rule.With standardized based on as SCFG model, for LHS →<α, β ,~>and have with
Under several features:
1. translation probability P (α | β) and P (β | α) it is estimated using the relevant frequency, the two probability are that forward direction is turned over respectively
Translate probability and reverse translation probability.
2. the weight Plex (α | β) and Plex (β | α) of vocabulary are estimated using the method for discovery learning.
3. for non-syntactic translation rule, the rule reward (exp of syntactic translation rule and non-fully syntactic translationization rule
It (1)) is respectively different.
4. defining instruction glue rule, the indicator of lexical rule and nonlexicalized rule can allow for model learning
The specific rule of selection.
5. source language end non-fully syntactic translation rule in nonterminal symbol X number (exp (#)), it controls model and offends
The compatible degree of syntax.
The present invention defines in a model derives weight (score).Define the derivation that d is above-mentioned syntax.In order to syntax
Rule (namely syntactic translation rule and non-fully syntactic translation rule) and non-syntactic translation rule differentiate, definition d be
One tuple < ds,dh>, wherein dsIt is the part derivation of skeleton structure, dhIt is the rule set for setting up the derivation of d remainder
It closes.For example, in Fig. 2, ds={ r4, r1, p3, other dh={ h6, h8, h3}。
The character string that t is target language end coding is defined, then the score of d can be defined as possessing n-gram language model
The result of the regular weight of continued product and vocabulary the reward exp (| t |) of lm (t).
Wherein w (r*) is the weight of regular r*, λlmAnd λwbIt is the feature weight of language model and vocabulary reward respectively.
In addition, frame is very flexible for model of the present invention, it particularly includes syntactic translation and non-syntactic translation mould
Type.For example, (that is if d is only made of non-syntactic translation rule), then it is exactly that a non-syntax turns over
Translate derivation.Equally, if (that is a derivation d is only made of syntactic translation rule), then it is exactly one
Syntactic translation formula derives.What the present invention illustrated is how with non-fully syntactic translation rules guide to non-syntactic translation and syntax
The derivation space of translation formula.Decoder can select best derivation according to model score from widened derivation rule.
Three, model is applied in decoding:
Model of the invention can regard the problem of string parsing as when in use, because it uses the sentence of original language end
Method rule parses the text string at source language end, and the translation knot of target language is generated using the rule induction information at target language end
Fruit.So translation result can be treated as generating and having the target language string of top score by rule induction.The invention
In, system on CYK decoder conducive to based on realizing, and beam search and cube has been utilized in decoder
Pruning technology is able to use the binaryzation rule obtained by synchronous binarization method.
It is introduced into due to having a large amount of non-fully syntactic translation rule, causes decoding speed very slow.In order to raise speed
Decoding system further uses several pruning methods and carries out beta pruning to search space, reduces search space.Firstly, abandoning those
Morphology of the sphere of action greater than 3 or non-fully syntactic translation rule.Why removing these rules is because they are to reduce solution
One main cause of code speed, and they are not very helpful to last translation result.In addition, giving up those right sides
Portion (RHS) only have nonterminal symbol X non-morphological rule and non-fully syntactic translation rule.At most of conditions, this type
The rule of type can not play syntactical limitation guiding function.Such as it says, regular VP → < X1X2, X2X1> existing too universal,
If two without in the continuous blocks of any morphology or syntax sign introduce a VP sentence element, be it is very reasonless,
Because any effect can not be played by doing so.
Other than carrying out beta pruning to rule, a parameter w can also be usedsTo control the depth of syntax skeleton.If being assigned to ws
The value of one very little, then system compulsory can use a smaller syntax skeleton (and less syntactic rule).In pole
In the case of end, if parameter ws=0, system can then retract into a typical non-syntactic translation system;Similarly, if parameter
Value wsThe syntax skeleton of any depth can be considered in=+ ∞, system.So we can be on test set to parameter wsTuning is come
Find an equalization point.
For acceleration system, we are also using the technologies of some tree parsings.In addition to source statement, we are also the sentence of source language
Method analytic tree adds decoder.We are using the non-syntactic translation rule generally used in non-syntactic translation system to source language first
Sentence is parsed, but when we handle segment corresponding with syntax tree ingredient in the language of source, there is no to application rule away from
From being limited.Then, we utilize syntactic translation rule on parsing tree.If a syntactic translation at source language end is advised
It can be then matched to an input tree segment, then: 1) this rule can be converted to non-fully syntactic translation rule (see the 3rd
Point);2) syntactic translation and corresponding non-fully syntactic translation rule can be added in list of rules, these lists and and source
The corresponding CYK grid cell link of language syntax tree segment.Fig. 4 gives in a decoder and sets matched example.Later, remaining
Decoding step (such as building translation hypergraph, language model intersect) can normally handle.This method can be matched effectively
(non-fully) syntactic translation rule of decoding requirements, also, it is not necessary to carry out binary conversion treatment to rule.Due to source statement method tree
Given is hard constraint, is handled as a tradeoff, we can introduce some pairs of sensitive derivations of syntax.
Four, it tests
The method that the present invention tests them on English-Chinese (en-zh) and Chinese-English (zh-en) translation.
1) baseline system experimentation is arranged
The present invention uses the 2740000 Chinese-English bilingual sentence pairs selected in NIST12 OpenMT.It is allowed using GIZA++ tool
After bilingual text generates two-way word alignment, the present invention obtains the word of symmetrization using the method for grow-diag-final-and
It is aligned file.For syntactic analysis, the present invention is first respectively processed both sides data using Berkeley parser, then sharp
Binaryzation is carried out to parsing tree with popular leftmost derivation method, so as to production better on test set.Based on syntax
(or syntactic translation) rule is concentrated from entire training data to be extracted, and at most can only be there are five nonterminal symbol in rule.And it is right
In non-syntactic translation system, hierarchy rule (non-syntactic translation) is extracted from 940,000 sentence subset, and every rule
In nonterminal symbol be no more than two, and phrase rule is extracted from entire training set.Here all rules all make
It is obtained with Open-Source Tools packet NiuTrans.
The present invention has trained two 5 gram language models: one is Xinhua part in English Gigaword data and double
Training on the English components of language data, this model use in Chinese-English translation system;The other is in Chinese
Training on the Xinhua part of Gigaword data and the Chinese part of bilingual data, this model is applied to English-Chinese translation system
In system.All language models are all carried out using corrected Hneser-Ney smoothing method smooth.
For Chinese-English translation system, the present invention respectively evaluates system in News Field and online data.The present invention
Tuning collection (News Field: 1198 sentences, web datas: 1308 sentences) be quote NIST machine translation 04-06 evaluation and test
Data and GALE data.Test set (News Field: 1779 sentences, web:1768 sentence) then includes NIST08,12 machines
The evaluation and test data of evaluation and test and News Field and network data all in 08-progress.For English-Chinese translation system, the present invention
Tuning collection (995 sentences) and test set (1859 sentences) be that SSMT07 and NIST MT08 Chinese-English translation records respectively
Evaluate data.Active language end parsing tree all use with processing training data as method handled.
2) the machine translation system experiment based on syntax skeleton
The present invention method that the application obscure portions in decoding are mentioned to according to model realizes their CYK decoder.Default
Under setting, use has been arrived string and has been parsed in experiment, under initial situation, parameter wsIt is arranged to+∞.All feature weights all use
The method of MERT carries out tuning.Since MERT has to obtain the possibility of local optimum result, so we have carried out 5 to each experiment
Secondary operation, and different initial characteristic values are assigned every time.In evaluation portion, we use respectively unmodified BLEU4 and
Unmodified BLEU5 evaluates Chinese-English and English-Chinese translation system.
3) based on the machine translation system experimental result of syntax skeleton
Table 1 is experimental result, wherein the system based on syntax skeleton is write a Chinese character in simplified form into SYNSKEL.First it can be seen that
SYNSKEL system is all significantly improved on 3 test sets.One is obtained using the parsing tree of CTB formula averagely to exist
0.6 or more BLEU is worth improving, and improves by the available average BLEU value 0.9 or more of the syntax tree of y-bend.And
And (part) syntactic rule can be applied in the non-syntactic translation rule of normal use well using the method for analytic tree, it obtains
Good result.It obtains and goes here and there the comparable BLEU value of analytic method.However, being put into more in the forest of a y-bend
There is no what improvement effects to result for more trees.These it is interesting the result shows that, in very big derivation space,
It is difficult by considering more to introduce some novel derivations by the optional syntactic structure of y-bend.
The not experimental result under homologous ray of table 1
In addition, in different skeleton depth capacity (namely parameter ws) under have studied the result of system.Fig. 5 illustrates too big
Skeleton may not be able to obtain preferably as a result, wherein BLEU be evaluation translation quality index.Control parameter wsWhen≤5
Satisfactory raising can be obtained using shell system, with all using shell system when compare, reduce nearly 27%
Decoding time.
The utilization rate of different each rule-likes is as shown in table 2, it is seen that the rule for the non-fully syntactic translation type that the present invention defines
Then utilization rate highest, and achieve good translation effect.
The different utilization rates derived on 2 tuning collection of table
4) analysis of experimental results
The present invention has studied the frequency that system calls different type to derive after testing.Table 2 is illustrated at three
In different tasks, Systematic selection non-fully syntactic translation derive and non-syntactic translation derive when tendency.English-Chinese translation is appointed
Show in business to syntax and severe that non-fully syntactic translation derives use, immediately after be Chinese-English translation News Field
And network data translation duties.This result reflects analysis quality in different language and FIELD Data to a certain extent
It is discrepant.
1. translation quality is promoted:
Test result shows that the system the present invention is based on syntax skeleton is write a Chinese character in simplified form into SYNSKEL.First it can be seen that
SYNSKEL system is all significantly improved on 3 test sets.Using the syntax tree of CTB formula obtain one it is average 0.6 with
On BLEU be worth improving, improve by the available average BLEU value 0.9 or more of the syntax tree of y-bend.And it utilizes
The method of analytic tree can be well in the non-syntactic translation rule of normal use, using syntactic translation rule and non-fully syntax
Translation rule obtains good result.It obtains and goes here and there the comparable BLEU value of analytic method.
2. good sequencing control:
The comparative experimental data provided according to the present invention arranges 5 most preceding translation results it is found that Fig. 6 tuning is concentrated,
For comparing translation result different on the same tuning collection.In addition, the translation result of non-syntactic translation system output is in disorder, and
And sequencing is also mistake, syntactic translation translation system is to the parsing of original syntax with regard to the " right of difficulty3... it is difficult20" structure translation
Effect is also poor.In contrast, SYNSKEL system has used span source language word " also2... worry22" all above-mentioned rule
Then, and due to being in non-syntactic translation system, so system derives the " right of covering to by the non-syntactic translation in part3……
It is difficult20" structure translation result it is also relatively good.
3. syntactic structure more preferably identifies:
The bottom of Fig. 6 illustrates one and derives the true translation example generated by this rule, it can be seen that SYNSKEL system
This rule coverage source language word is " right in system3... annual pay20", and successfully identify " ... ... " sequencing structure.Note
Meaning, although also there is such rule X → < X in non-syntactic translation system1X2, X2 of X1> can translate " ... ... " (de)
Structure.But when span becomes larger, for example one longer word sequence of table of translation " should8... dollar15", non-syntactic translation system
System will lose the ability of this sequencing.