CN102193915A

CN102193915A - Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Info

Publication number: CN102193915A
Application number: CN2011101486920A
Authority: CN
Inventors: 奚宁; 李博渊; 汤光超; 赵迎功; 陈家骏; 戴新宇; 张建兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2011-06-03
Filing date: 2011-06-03
Publication date: 2011-09-21
Anticipated expiration: 2031-06-03
Also published as: CN102193915B

Abstract

The invention provides a participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation. The method comprises the following steps of: 1, determining skeleton alignment: searching and selecting an optimal skeleton connection by using a connection-confidence-based connection selection algorithm, and forming the skeleton alignment; and 2, projecting the selected skeleton alignment to each participle to obtain various-participle-based word alignment. By the method, the conventional single-participle-based word alignment algorithm is improved, and the word alignment quality of each participle and the machine translation quality can be simultaneously improved. By fusing the characteristics for the word alignment under multiple participles, the final word alignment is more robust, and the number of word alignment errors affected by participle errors or bilingual participle inconsistency can be reduced.

Description

Translate in the English translation word alignment fusion method in a kind of computing machine based on the participle net

Technical field

The present invention relates to a kind of computer software language translation field, translate in the English translation word alignment fusion method in particularly a kind of computing machine based on the participle net.

Background technology

Frequent day by day along with the rapid increase of world today's quantity of information and international exchange, computer networking technology is popularized rapidly and development, aphasis further obviously and serious, people are also increasing to the potential demand of mechanical translation.Mechanical translation is exactly to realize translation between the different language with computing machine.The language that is translated is called source language, and the object language of translating into is called target language, and mechanical translation is exactly to realize the process of conversion from the source language to the target language.In recent years, a series of impressive progresses have been obtained based on the statistical machine translation method of extensive corpus.Statistical machine translation utilizes statistical method, a large amount of bilingual translation rule and the features of study from large-scale bilingual Parallel Corpus, with these rules and feature the sentence of source language is decoded (translation) then, sentence is translated in the sentence conduct that searches out the target language of probability maximum.Wherein, bilingual word-alignment is the prior step that obtains translation rule in the above-mentioned flow process.Word alignment be exactly to find bilingual parallel sentence between speech and the corresponding relation of speech.The quality of the quality of word alignment directly has influence on the quality of the translation rule that extracts, and then has influence on the final performance of machine translation system.If one or both language in bilingual need carry out participle (as Chinese), so common way is before word alignment, to utilize certain participle instrument that the language material that needs participle is carried out participle.This participle instrument is normally trained on single language participle corpus or monolingual dictionary and is obtained, present main flow participle instrument is for single this task of language participle, obtained good performance, yet this needs that might not satisfy bilingual word-alignment towards the participle instrument of single language task, that is to say that such participle may not be optimum for the needs of word alignment, it is inconsistent that the present invention is called bilingual participle with this phenomenon.

The method of the inconsistent problem of the bilingual participle of present existing solution roughly can be divided into two classes: one, directly optimization obtains a kind of participle at word alignment.Optimizing process is very time-consuming and complicated usually like this, and these methods need come training pattern from an initial word alignment result, yet this word alignment itself is not highly reliable as a result.Two, utilize different participles to obtain different translation rule set,, utilize certain means to merge these translation rule set then in decoding (translation) stage.The word alignment quality that this method does not improve various participles is a purpose.

The task of word alignment is to find the corresponding relation of bilingual sentence to a speech and speech.Fig. 1 has showed the correct word alignment of english-chinese bilingual sentence to " raining path and sliding-Road is slippery when raining ", and Chinese sentence participle mode wherein is " rain// road sliding " ("/" expression speech border).

Under Chinese word segmentation as shown in Figure 1, " road is sliding " need correspond to " Road " and " slippery " two speech, and two speech of D score " rain " need correspond to " raining ", could form a correct word alignment.In this participle, such " one-to-many " and the alignment pattern of " many-one " have caused bilingual participle inconsistent, have strengthened the difficulty of word alignment; Otherwise, if the participle mode of Chinese sentence is " raining/road/cunning ", so just can form the alignment pattern of more natural " one to one ", allow this right alignment task become relatively easy.

Table 1 has been showed the participle of " raining path and sliding " under three kinds of participle instruments, as can be seen, except Stanford Segmenter with PKU standard has provided this alignment result who helps word alignment, all the other participles or be bilingual inconsistent (based on the participle of word frequency), or be wrong (Stanford Segmenterwith CTB standard).Yet, also lack an effective method at present, can be fast right in the word alignment corpus each, choose a kind of Chinese word segmentation method that helps word alignment.

The participle example of three kinds of participle instruments of table 1

The participle instrument	Participle
		Participle based on word frequency	Rain/road is sliding
Stanford?Segmenter?with?PKU?standard	Rain/road/cunning
		Stanford?Segmenter?with?CTB?standard	Rainy road/cunning

Summary of the invention

Goal of the invention: technical matters to be solved by this invention is at the deficiencies in the prior art, and the word alignment fusion method of translating in a kind of computing machine in the English translation based on the participle net is provided.

In order to solve the problems of the technologies described above, the invention discloses and translate in a kind of computing machine in the English translation based on the word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:

Step 1 is determined the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, constitute the skeleton alignment;

Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles.

Step 1 of the present invention may further comprise the steps:

The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively _k, wherein, participle

Wherein

Be respectively participle s _kIn j-speech, J _kBe participle s _kThe speech number; The english sentence parallel with Chinese sentence c is E=e ₁e ₂... e _i, wherein

Be respectively i the english of english sentence E, I is a total words in the english sentence;

Respectively at K kind participle s _kK that forms with english sentence E is utilized the word alignment model to obtain K word alignment result to last, is designated as a respectively _k(k=1 ..., K);

With Chinese sentence to K kind participle s _k, structure participle net, the participle net is designated as C=c ₁, c ₂..., c _j, c wherein _j(j=1 2...J) is respectively j skeleton speech among the participle net C; J skeleton speech c among the participle net C _jWith i english e _iBetween be that skeleton connects A _Ij, participle s _kIn j speech

With english e _iBetween for connecting

Use following formula to calculate the degree of confidence that skeleton connects:

C (A_{ij} | C, E) = Σ_{k = 1}^{K} w_{k} \cdot c (a_{i δ_{k} (j)}^{k} | C, E);

Wherein

Be that skeleton connects

The degree of confidence score, For skeleton connects A _IjProject to participle s _kOn connection; W wherein _kBe the weight coefficient of participle k, can try to achieve that the target of hill-climbing algorithm is the F-score that minimizes the word alignment mark language material on certain participle k with hill-climbing algorithm.

At least the skeleton articulation set that has obtained a ballot is designated as B ₀, connect collection as initial skeleton;

According to the degree of confidence score, to B ₀In all skeletons connect descending sorts;

Judge that successively each bar skeleton connects, the skeleton that satisfies following condition connects A _IjBe selected into final skeleton and connect collection:

(1) the degree of confidence score is higher than threshold alpha (threshold value can be determined by hill-climbing algorithm equally); And one of meet the following conditions simultaneously:

Skeleton speech c _jWith english e _iBe not aligned; Perhaps, skeleton speech c _jDo not alignd by any english, and its left neighbours or right neighbours and english e _iThe skeleton connection that constitutes is selected into final skeleton and connects collection; Perhaps, english e _iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c _jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Threshold value wherein can be determined by hill-climbing algorithm equally;

Can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B ₁

In the step 1 of the present invention,, then directly be chosen as final connection collection if find to have the skeleton that has obtained K ballot to connect.

Step 2 of the present invention may further comprise the steps:

According to projection function, skeleton is connected collection B ₁In each skeleton connect and project to each participle s _k, obtain K new alignment word alignment a ' as a result respectively _k, promptly

For each word alignment a ' as a result _k, with word alignment a ' as a result _kIn connection

Be projected preceding skeleton by it and connect A _IjThe degree of confidence ascending order arrange and word alignment a ' as a result relatively successively _kWith word alignment a as a result _kIf new skeleton connects

Not at word alignment a as a result _kIn, and as satisfying following any one condition, then will connect

From word alignment a ' as a result _kMiddle deletion:

Participle s _kIn j speech

With english e _iAt word alignment a ' as a result _kIn all be aligned; Perhaps, there has been one not to be english e _iLeft and right sides neighbours' english e is at word alignment a ' as a result _kIn snap to participle s _kIn j speech

Perhaps, there has been one not to be participle s _kIn j Chinese word

Left and right sides neighbours' Chinese word c is at word alignment a ' as a result _kIn snap to english e _i

A ' as a result will align _kRemaining connection is as final connection alignment result Promptly obtain participle s _kThe fusion results of last word alignment.

In the step 2 of the present invention, if skeleton connects collection B ₁In have two or more skeletons to connect to be projected in some participle s _kOn the connection that obtains identical, then only keep a connection.

Beneficial effect: when a kind of language in the english-chinese bilingual needed to carry out participle, can effectively the multiple participle that carries out under the different participle instruments be fused into the structure of a linearity among the present invention before word alignment.The present invention utilizes the feature that contains in the different participles to carry out the word alignment fusion, thereby the word alignment quality of various participles can both improve, and then improves the performance of computer software translation.

The present invention improves existing word alignment algorithm based on single participle, can improve word alignment quality and mechanical translation quality under each participle simultaneously.By being used for the Feature Fusion of word alignment under the multiple participle, make the word alignment process healthy and strong more, can reduce being subjected to participle mistake or the inconsistent word alignment number of errors that influences of bilingual participle.

Description of drawings

Below in conjunction with the drawings and specific embodiments the present invention is done further to specify, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is the word alignment synoptic diagram.

Fig. 2 is the example of participle lattice WSL.

Fig. 3 is the example of participle net WSN.

Fig. 4 a is respectively the example that skeleton is connected and skeleton aligns of WSN and english sentence with Fig. 4 b.

Fig. 5 is based on the process flow diagram of the inventive method.

Embodiment

The present invention proposes and a kind of multiple many participles are merged, be called participle net (Word Segmentation Network, hereinafter to be referred as WSN), and then proposition is based on the word alignment fusion method of participle net, to alleviate the inconsistent word alignment problem of bringing of bilingual participle.In the prior art, merge multiple participle with participle lattice (Word Segmentation Lattice is hereinafter to be referred as WSL) usually in the natural language processing task.Two kinds of participle S1 for " raining path and sliding ": " rain// road is sliding " and participle S2: " raining/road/cunning " two kinds of participles, Fig. 3 and Fig. 4 are respectively this participle lattice and participle net and represent.

WSN first row and second row are represented participle S respectively ₁With participle S ₂, the third line is participle S ₁With participle S ₂Outside another kind of participle, the present invention is referred to as the skeleton participle.

The skeleton participle is a kind of like this participle, and its speech border is participle S ₁With participle S ₂The union on speech border, the i.e. set of word segmentation point in all participles of its word segmentation point.For example, among Fig. 3 " rain// road/cunning " be exactly a skeleton participle (the third line among Fig. 3).Participle S ₁In the middle of middle D score and " rain " is a speech border, so also is a speech border in the middle of the D score in the skeleton participle and " rain "; And for example, participle S ₂In be a speech border in the middle of " road " and " cunning ", therefore " road " and " cunning " is two skeleton speech in the skeleton participle.

The skeleton speech is each speech in the skeleton participle.For example, the skeleton participle one among Fig. 3 has four skeleton speech.

Each row among the participle net WSN are by a skeleton speech and participle S ₁With participle S ₂In covered this skeleton speech at correspondence position speech form.The participle net WSN one of Fig. 3 has 4 row.As can be seen, the speech number of the columns of WSN and skeleton participle is consistent.

It should be noted that some non-skeleton speech may cover a plurality of row, such as participle S ₁In " road sliding " covered two row because " road is sliding " is at S ₂Middle quilt has been splitted into two speech; Again such as participle S ₂In " raining " covered two row because " raining " is at S ₁Middle quilt has been splitted into two speech.

The present invention has done index (subscript is since 1) to the speech (comprising the skeleton participle) of each row among the WSN.The present invention defines j skeleton speech and projects to participle s _kSpeech δ _k(j) on, and if only if s _kIn δ _k(j) individual speech and it are in row.δ for example ₁(4)=3, δ ₂(3)=2.

Next the skeleton that further defines between WSN and the english sentence connects and the skeleton alignment.

Skeleton connects, the intertranslation relation among the sign WSN between skeleton speech and the English word.

The skeleton alignment, the skeleton alignment is the set that skeleton connects.

Fig. 4 a and Fig. 4 b are the examples that skeleton connects and skeleton aligns correct between above-mentioned WSN and the english sentence " Road is slippery when raining ".Wherein Fig. 4 a is that a skeleton connects, and Fig. 4 b is the skeleton alignment that is connected to form by four skeletons.

The present invention has adopted a kind of the connection based on the optimum skeleton of connection selection algorithm selection that connects degree of confidence to carry out the fusion of word alignment, thereby obtains final skeleton alignment.According to projection function recited above, the skeleton speech can project into S arbitrarily ₁, S ₂In speech, and then any one skeleton connects and just can convert traditional S to ₁And S ₂In speech and the connection between the english.For example, according to projection function, skeleton speech " road " the mapping s among Fig. 4 ₁In " road sliding ", just can be mapped to S so the skeleton among Fig. 4 a connects ₁In " road sliding " arrive the connection and the S of " road " ₂In " road " connection of arriving " road ".

In order to evaluate and test the raising of the present invention in performance aspect the word alignment, the present invention has adopted manual 491 English-Chinese sentences that marked word alignment to as test set of the present invention.Chinese part in the test set uses the Stamford participle instrument based on Binzhou treebank mark standard to carry out participle.Word alignment in the test set connects and is divided into two classes, and a class is to determine that type connects, and is designated as S (sure), and a class is possible to connect by type, is designated as P (possible).Suppose that the word alignment that will evaluate and test is A, the F-score of this word alignment calculates by following formula so

precision (S, A) = \frac{| A \cap S |}{S}

recall (S, A) = \frac{| A \cap S |}{A}

Fscore (S, α, A) = \frac{1}{\frac{α}{precision (S, A)} + \frac{1 - α}{recall (S, A)}} - - - (1)

In the above-mentioned formula, precision refers to the accuracy rate of word alignment A, recall refer to the to align recall rate of A.In the computing formula of Fscore, the present invention has chosen α=0.5, with balance accuracy rate and recall rate.

Word alignment fusion method based on the participle net was divided into for two steps: the first step, and use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, promptly find the skeleton alignment; Second step, will select the skeleton alignment and project on each participle, obtain traditional word alignment.

Connection selection algorithm among the present invention is based on and connects that the degree of confidence score carries out.The sub-c of centering sentence of the present invention carries out participle respectively with K kind participle instrument.For example, can take the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on Stamford participle instrument three kinds of participle instruments such as (being designated as P) of Peking University's mark standard.Their participle is designated as s respectively _k(k=1 ..., K), wherein, participle

Wherein

Be respectively participle s _kIn j speech, J _kBe participle s _kThe speech number; The english sentence parallel with Chinese sentence c is E=e ₁e ₂... e _i, wherein Be respectively i the english of english sentence E, I is an english sentence length.The present invention is respectively at this K kind participle s _k(k=1 ... K) and K of forming of english sentence E utilize traditional word alignment model to obtain K word alignment result to last, be designated as a respectively _k(k=1 ..., k).Next the present invention is with the K kind participle s of Chinese sentence _k(k=1 ..., K), method construct WSN as described above, WSN is designated as C=c ₁c ₂... c _j, c wherein _j(j=1,2 ... J) be respectively j skeleton speech among the C.Suppose A again _IjBe j skeleton speech c among the C _jWith i english e _iBetween skeleton connect,

Be participle s _kIn j speech (promptly

) and e _iConnection.The degree of confidence score that the present invention defines the skeleton connection is as follows:

C (A_{ij} | C, E) = Σ_{k = 1}^{K} w_{k} \cdot c (a_{i δ_{k} (j)}^{k} | C, E) - - - (2)

Wherein

Be to connect

The degree of confidence score,

For skeleton connects A _IjProject to participle s _kOn connection.

W wherein _kBe the weight coefficient of participle k, can try to achieve, hill-climbing algorithm (hill-climbing algorithm, Russell, Stuart J.﹠amp with hill-climbing algorithm; Norvig, Peter (2003), Artificial Intelligence:A Modern Approach), in this experiment, optimization aim of the present invention be testing material preceding 250 to last F-score.Hill-climbing algorithm is summarized as follows: the initial value that weight is set at random is current separating, in current its proximal direction of separating, do search then, if certain in the proximal direction separate than current separate more excellent, then separate and substitute current separating with this, so repeatedly, separate more excellent separating up in its proximal direction, can not find, then current separating the most finally separated than current.The present invention has attempted 20 different initial values, chooses the highest finally the separating as w of F-score then _k(k=1 ..., K).

The degree of confidence score that connects is defined as follows:

c (a_{i δ_{k} (j)}^{k} | C, E) = \sqrt{q_{c 2 e} (a_{i δ_{k} (j)}^{k} | C, E) \cdot q_{e 2 c} (a_{i δ_{k} (j)}^{k} | C, E)} - - - (3)

The posterior probability that the C-E direction connects is defined as follows:

The posterior probability of E-C direction

Can similarly define.Probability in the top formula Be participle s _kIn speech

Translate english e _iTranslation probability, this probability can utilize the word alignment instrument GIZA++ that increases income at participle s _kGoing up training with E obtains.

Can see, on the WSN of linearity, can define skeleton easily and connect, the degree of confidence score that is connected with the calculating skeleton.And WSL is difficult to define the corresponding relation between Chinese word and the english thereon owing to its nonlinear character, and then improves existing word alignment algorithm.And the character of the linearity of WSN, prompting the present invention can be easily with existing word alignment technological expansion to based on coming on the word alignment technology of WSN.

Embodiment:

The used algorithm of the present invention is all write realization by C# language.The type that experiment is adopted is: Intel Xeon X5550 processor, dominant frequency is 2.66G HZ, in save as 16G.The GIZA++ word alignment kit that the present invention uses is the at present general word alignment kit of increasing income, and is obtained the version that finally can move under the windows platform in compiling under the Cygwin by this laboratory.The module of all the other mechanical translation that the present invention uses is rewritten with C# language according to the statistical machine translation open source software Moses based on phrase for this laboratory and is obtained.

Data are prepared as follows before implementing: the Chinese part to English-Chinese parallel language material is used K kind participle instrument participle, obtains participle among the K, i.e. s _k(k=1 ..., K), s _k(k=1 ..., K) be traditional word alignment a with parallel English part respectively _k(k=1 ..., K).

More particularly, as shown in Figure 5, the present invention moves as follows:

1. obtain initial skeleton and connect collection: the multiple participle s that utilizes Chinese sentence _k(k=1 ..., K) make up the participle net, calculate the degree of confidence score that skeletons all between Chinese word segmentation net C and the english sentence E connects according to formula (1).

If skeleton connects

Appear at certain a _k(k=1 ..., K) in, the present invention just claims skeleton to connect A _IjFrom

a _kObtain a ballot.At least the skeleton articulation set that has obtained a ballot is designated as B ₀, connect collection as initial skeleton.

2. obtain final skeleton and connect collection: according to the degree of confidence score, to B ₀In all skeletons connect descending sorts, and investigate each bar skeleton successively and connect.Skeleton connects A _IjMust satisfy following condition and just can be selected into final skeleton connection collection: (1) degree of confidence score is higher than threshold alpha, and one of following condition is set up (threshold value wherein can be determined by above-mentioned hill-climbing algorithm equally):

A) skeleton speech c _jWith english e _iAll be not aligned;

B) skeleton speech c _jDo not alignd by any english, and its left neighbours or right neighbours and english e _iThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;

C) english e _iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c _jThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;

Step above carrying out repeatedly can not be selected into final skeleton connection collection up to there being new skeleton to connect, and final set is designated as B ₁

3. obtain final word alignment and connect collection: according to projection function, with B ₁In each skeleton connect

Project to each s _k, obtain K new alignment a ' respectively _k, promptly For each a ' _k, the present invention is with a ' _kIn in connection connect A by its skeleton before being projected _IjThe degree of confidence ascending order arrange and a ' relatively successively _kAnd a _kIf, the new connection

Not at a _kIn, and following condition is satisfied, then will

From a ' _kMiddle deletion:

A) Chinese word

With english e _iAt alignment a ' _kIn all be aligned;

B) a non-english e has been arranged _iLeft and right sides neighbours' english e, at alignment a ' _kIn snap to Chinese word

C) a non-Chinese word has been arranged Left and right sides neighbours' Chinese word c, at alignment a ' _kIn snap to english e _i

With remaining connection as final connection, the fusion results of word alignment on the promptly different participles.

The algorithm note:

A) in the step 1, if find to have the skeleton that has obtained K ballot to connect, then it will directly be chosen as final connection collection, and further not judge;

B) in the step 3, if B ₁In have two or more skeletons to connect to be projected in certain s _kOn the connection that obtains identical, then only keep a connection;

C) Rule of judgment in the step 2 is that step 4 has taked similar method to delete the connection of potential possible errors for the skeleton of deleting potential possible errors connects.

In order to verify validity of the present invention, the present invention has carried out two groups of experiments.First group of experiment is used for checking the present invention whether can effectively improve the quality of word alignment; Second group of experiment is used for checking the present invention whether can effectively improve the performance of machine translation system.

It is as follows that experimental data is prepared: the bilingual parallel sentence of choosing among the LDC2003E14 is right, be about 190,000 right, training set the most of the present invention; Choose the exploitation collection of NIST ' 06, be used for the weight of various features in the estimating system as machine translation system of the present invention; Choose the test set of NIST ' 08, in order to the performance of estimating system as machine translation system of the present invention.Chinese part for above-mentioned these language materials, the present invention handles with three kinds of participle instruments respectively, they are respectively: the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on the Stamford participle instrument (being designated as P) of Peking University's mark standard.The present invention adopt machine translation system be this laboratory oneself realize that similar Koehn proposed in 2003 a machine translation system based on phrase.This system adopts the 5-gram language model, is trained by Xinhua's language material partly of GIZAWORD to obtain.The minimal error rate training method that the training of systematic parameter has taked Och to propose in 2003.The present invention has adopted two groups of baseline to carry out word alignment fusion of the present invention: first group is GIZA++ word alignment instrument, obtain the word alignment result of both direction with this instrument after, carrying out the fusion of both direction with the didactic method of GDF, this group baseline brief note is GIZA; Second group is the linear discriminent word alignment model that Liu Yang proposes, and notes by abridging to be DIWA.In order to estimate the performance of word alignment, the present invention has adopted foregoing testing material, in the testing material preceding 250 weight w that are used for training in the formula (1) _k(k=1 ..., K) be connected selection algorithm in threshold value a, use and the result who estimates word alignment for back 241.This Chinese part participle C of 491.First group of experiment, the present invention have been estimated the present invention in the qualitative raising of word alignment on these 241.As shown in the table, GIZA wherein and DIWA represent that respectively Fused word alignment result derives from GIZA and two models of DIWA, and P, R, F represent word alignment result's accuracy rate respectively, recall rate and F-score.Usually, represent final word alignment quality with F-score, P and R only do reference.The present invention adopted four groups merge to be provided with, and C is set represents not merge, and promptly based on traditional word alignment method of participle C, C+P is set represented to merge respectively word alignment result based on participle C and P, by that analogy.

Method of the present invention has as can be seen all significantly improved the F-score of word alignment in GIZA group and DIWA group.For the GIZA group, under the setting of C+I+P, F-score falls slightly after rise.This is relevant with GIZA model deflection recall rate itself, if excessively merge for the high model of recall rate, can damage accurate rate (69.68%).But for the DIWA group, the participle of fusion is many more, and the word alignment result is good more.This is relevant with DIWA model deflection accurate rate itself, and fusion method can effectively improve recall rate, and then improves F-score.

Table 2 word alignment experimental result

Second group of experiment, the present invention has estimated the performance of machine translation system on the test set of NIST ' 08, and the index of test and appraisal is BLEU score.B wherein represents baseline, and Comb represents to merge later result through C+P+I.

The experimental result of table 3 mechanical translation

As can be seen, no matter be that the present invention has improved the performance of machine translation system significantly in GIZA group or DIWA group.

The invention provides the thinking of translating in a kind of computing machine in the English translation based on the word alignment fusion method of participle net; the method and the approach of this technical scheme of specific implementation are a lot; the above only is a preferred implementation of the present invention; should be understood that; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment is realized.

Claims

1. translate in the computing machine in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:

Step 1 is determined the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect;

2. according to translating in a kind of computing machine described in the claim 1 in the English translation, it is characterized in that step 1 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:

Wherein

Be respectively participle s _kIn j speech, J _kBe participle s _kThe speech number; The english sentence parallel with Chinese sentence c is E=e ₁e ₂... e _i, wherein

Respectively at K kind participle s _kK that forms with english sentence E is utilized traditional word alignment model based on single participle to obtain K word alignment result to last, is designated as a respectively _k(k=1 ..., K);

With Chinese sentence to K kind participle s _k, structure participle net, the participle net is designated as C, C=c ₁, c ₂... c _jC wherein _j(j=1 2...J) is j skeleton speech among the participle net C; A _IjBe j skeleton speech c among the participle net C _jWith i english e _iBetween skeleton connect A _Ij, Be participle s _kIn j speech

With english e _iBetween connection;

C (A_{ij} | C, E) = Σ_{k = 1}^{K} w_{k} \cdot c (a_{i δ_{k} (j)}^{k} | C, E)

Wherein Be that skeleton connects

The degree of confidence score,

For skeleton connects A _IjProject to participle s _kOn skeleton connect; W wherein _kWeight coefficient for participle k; K is the sum of participle;

According to the degree of confidence score, to B ₀In all skeletons connect A _IjDescending sort;

Judge that successively each bar skeleton connects A _Ij, the skeleton that satisfies following condition connects A _IjBe selected into final skeleton and connect collection:

(1) the degree of confidence score is higher than threshold value a; And one of meet the following conditions simultaneously:

Skeleton speech c _jWith english e _iBe not aligned; Perhaps, skeleton speech c _jDo not alignd by any english, and its left neighbours or right neighbours and english e _iThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Perhaps, english e _iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c _jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection;

Repeat this step, can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B ₁

3. translate in a kind of computing machine according to claim 2 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 1,, then directly be chosen as final skeleton and connect and collect if find to have the skeleton that has obtained K ballot to connect.

4. according to translating in a kind of computing machine described in claim 2 or 3 in the English translation, it is characterized in that step 2 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:

According to projection function, skeleton is connected collection B ₁In each skeleton connect projection branch and be clipped to each participle s _kOn, obtain K new alignment word alignment a ' as a result respectively _k, promptly

For each word alignment a ' as a result _k(k=1 ..., K), with word alignment a ' as a result _kIn connection

Be projected preceding skeleton by it and connect A _IjThe degree of confidence ascending order arrange and word alignment a ' as a result relatively successively _kWith word alignment a as a result _kIf, new connection

From word alignment a ' as a result _kMiddle deletion:

Participle s _kIn j speech

Perhaps, there has been one not to be participle s _kIn j Chinese word

A ' as a result will align _kRemaining connection is as final connection alignment a ' as a result _k, promptly obtain participle s _kThe fusion results of last word alignment.

5. translate in a kind of computing machine according to claim 4 in the English translation based on the word alignment fusion method of participle net, it is characterized in that, in the step 2, if skeleton connects collection B ₁In have two or more skeletons to connect to be projected in some participle s _kOn the connection that obtains identical, then only keep a connection.