CN102193915B

CN102193915B - Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Info

Publication number: CN102193915B
Application number: CN2011101486920A
Authority: CN
Inventors: 奚宁; 李博渊; 汤光超; 赵迎功; 陈家骏; 戴新宇; 张建兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2011-06-03
Filing date: 2011-06-03
Publication date: 2012-11-28
Anticipated expiration: 2031-06-03
Also published as: CN102193915A

Abstract

The invention provides a participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation. The method comprises the following steps of: 1, determining skeleton alignment: searching and selecting an optimal skeleton connection by using a connection-confidence-based connection selection algorithm, and forming the skeleton alignment; and 2, projecting the selected skeleton alignment to each participle to obtain various-participle-based word alignment. By the method, the conventional single-participle-based word alignment algorithm is improved, and the word alignment quality of each participle and the machine translation quality can be simultaneously improved. By fusing the characteristics for the word alignment under multiple participles, the final word alignment is more robust, and the number of word alignment errors affected by participle errors or bilingual participle inconsistency can be reduced.

Description

Translate in the English translation word alignment fusion method in a kind of computing machine based on the participle net

Technical field

The present invention relates to a kind of computer software language translation field, translate in the English translation word alignment fusion method in particularly a kind of computing machine based on the participle net.

Background technology

Frequent day by day along with the rapid increase of world today's quantity of information and international exchange, computer networking technology is popularized rapidly and development, and further obviously with serious, people are also increasing to the potential demand of mechanical translation for aphasis.Mechanical translation is exactly to realize the translation between the different language with computing machine.The language of being translated is called source language, and the object language of translating into is called target language, and mechanical translation is exactly to realize the process of conversion from the source language to the target language.In recent years, a series of impressive progresses have been obtained based on the statistical machine translation method of extensive corpus.Statistical machine translation utilizes statistical method; A large amount of bilingual translation rule and the characteristics of study from large-scale bilingual Parallel Corpus; With these rules and characteristic the sentence of source language is decoded (translation) then, the sentence that searches out the maximum target language of probability is as translating sentence.Wherein, bilingual word-alignment is the prior step that obtains translation rule in the above-mentioned flow process.Word alignment be exactly to find bilingual parallel sentence between speech and the corresponding relation of speech.The quality of the quality of word alignment directly has influence on the quality of the translation rule that extracts, and then has influence on the final performance of machine translation system.If one or both language in bilingual need carry out participle (like Chinese), so common way is before word alignment, to utilize certain participle instrument that the language material that needs participle is carried out participle.This participle instrument is normally trained on single language participle corpus or monolingual dictionary and is obtained; Present main flow participle instrument is for single this task of language participle; Obtained good performance; Yet thisly might not satisfy the needs of bilingual word-alignment towards the participle instrument of single language task, that is to say that such participle may not be optimum for the needs of word alignment, it is inconsistent that the present invention is called bilingual participle with this phenomenon.

The method of the inconsistent problem of the bilingual participle of present existing solution roughly can be divided into two types: one, directly obtain a kind of participle to word alignment optimization.Optimizing process is usually very time-consuming and complicated like this, and these methods need come training pattern from an initial word alignment result, yet this word alignment itself is not highly reliable as a result.Two, utilize different participles to obtain different translation rule set,, utilize certain means to merge these translation rule set then in decoding (translation) stage.The word alignment quality that this method does not improve various participles is a purpose.

The task of word alignment is to find the corresponding relation of bilingual sentence to a speech and speech.Fig. 1 has showed the correct word alignment of english-chinese bilingual sentence to " raining path and sliding-Road is slippery when raining ", and Chinese sentence participle mode wherein is " rain// road sliding " ("/" expression speech border).

Under Chinese word segmentation as shown in Figure 1, " road is sliding " need correspond to " Road " and " slippery " two speech, and two speech of D score " rain " need correspond to " raining ", could form a correct word alignment.In this participle, such " one-to-many " and the alignment pattern of " many-one " have caused bilingual participle inconsistent, have strengthened the difficulty of word alignment; Otherwise, if the participle mode of Chinese sentence is " raining/road/cunning ", so just can form the alignment pattern of more natural " one to one ", let this right alignment task become relatively easy.

Table 1 has been showed the participle of " raining path and sliding " under three kinds of participle instruments; Can find out; Except Stanford Segmenter with PKU standard has provided this alignment result who helps word alignment; All the other participles or be bilingual inconsistent (based on the participle of word frequency), or be wrong (Stanford Segmenter with CTB standard).Yet, also lack an effective method at present, can be fast right in the word alignment corpus each, choose a kind of Chinese word segmentation method that helps word alignment.

The participle example of three kinds of participle instruments of table 1

The participle instrument	Participle
		Participle based on word frequency	Rain/road is sliding
Stanford Segmenter with PKU standard	Rain/road/cunning
		Stanford Segmenter with CTB standard	Rainy road/cunning

Summary of the invention

Goal of the invention: technical matters to be solved by this invention is the deficiency to prior art, and the word alignment fusion method of translating in a kind of computing machine in the English translation based on the participle net is provided.

In order to solve the problems of the technologies described above, the invention discloses and translate in a kind of computing machine in the English translation based on the word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:

Step 1 is confirmed the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, constitute the skeleton alignment;

Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles.

Step 1 of the present invention may further comprise the steps:

The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively _k, wherein, participle

Wherein

Be respectively participle s _kIn j speech, J _kBe participle s _kThe speech number; The english sentence parallel with Chinese sentence c is E=e ₁e ₂E _I, wherein

Be respectively i the english of english sentence E, I is a total words in the english sentence;

Respectively at K kind participle s _kK that forms with english sentence E is utilized the word alignment model to obtain K word alignment result to last, is designated as a respectively _k(k=1 ..., K);

With Chinese sentence to K kind participle s _k, structure participle net, the participle net is designated as C=c ₁, c ₂..., c _J, c wherein _j(j=1,2 ... J) be respectively j skeleton speech among the participle net C; J skeleton speech c among the participle net C _jWith i english e _iBetween be that skeleton connects A _Ij, participle s _kIn j speech

With english e _iBetween for connecting

Use following formula to calculate the degree of confidence that skeleton connects:

C (A_{ij} | C, E) = Σ_{k = 1}^{K} w_{k} * c (a_{i δ_{k} (j)}^{k} | C, E);

Wherein

Be that skeleton connects

The degree of confidence score,

For skeleton connects A _IjProject to participle s _kOn connection; W wherein _kBe the weight coefficient of participle k, can try to achieve that the target of hill-climbing algorithm is the F-score that minimizes the word alignment mark language material on certain participle k with hill-climbing algorithm.

At least the skeleton articulation set that has obtained a ballot is designated as B ₀, connect collection as initial skeleton;

According to the degree of confidence score, to B ₀In all skeletons connect descending sorts;

Judge that successively each bar skeleton connects, the skeleton that satisfies following condition connects A _IjBe selected into final skeleton and connect collection:

(1) the degree of confidence score is higher than threshold alpha (threshold value can be confirmed through hill-climbing algorithm equally); And one of meet the following conditions simultaneously:

Skeleton speech c _jWith english e _iDo not alignd; Perhaps, skeleton speech c _jNot by any english alignment, and its left neighbours or right neighbours and english e _iThe skeleton connection that constitutes is selected into final skeleton and connects collection; Perhaps, english e _iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c _jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Threshold value wherein can be confirmed through hill-climbing algorithm equally;

Can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B ₁

In the step 1 of the present invention,, then directly be chosen as final connection collection if find to have the skeleton that has obtained K ballot to connect.

Step 2 of the present invention may further comprise the steps:

According to projection function, skeleton is connected collection B ₁In each skeleton connect and project to each participle s _k, obtain K new alignment word alignment result respectively Promptly

α_{k}^{'} = {a_{i δ_{k} (j)}^{k} | A_{Ij} &Element; B_{1}}, (k = 1, . . ., K);

For each word alignment result

With the word alignment result

In connection

Connect A by its skeleton before by projection _IjThe degree of confidence ascending order arrange and word alignment result relatively successively

With word alignment a as a result _kIf new skeleton connects

Not at word alignment a as a result _kIn, and like any condition below satisfying, then will connect

From the word alignment result

Middle deletion:

Participle s _kIn j speech

With english e _iIn the word alignment result

In all alignd; Perhaps, there has been one not to be english e _iLeft and right sides neighbours' english e is in the word alignment result

In snap to participle s _kIn j speech Perhaps, there has been one not to be participle s _kIn j Chinese word

Left and right sides neighbours' Chinese word c is in the word alignment result

In snap to english e _i

The result will align

Remaining connection is as final connection alignment result

Promptly obtain participle s _kThe fusion results of last word alignment.

In the step 2 of the present invention, if skeleton connects collection B ₁In have two or more skeletons to connect to be projected in some participle s _kOn the connection that obtains identical, then only keep a connection.

Beneficial effect: when a kind of language in the english-chinese bilingual need carry out participle before word alignment, can effectively the multiple participle that carries out under the different participle instruments be fused into the structure of a linearity among the present invention.The present invention utilizes the characteristic that contains in the different participles to carry out the word alignment fusion, thereby the word alignment quality of various participles can both improve, and then improves the performance of computer software translation.

The present invention improves existing word alignment algorithm based on single participle, can improve word alignment quality and mechanical translation quality under each participle simultaneously.Through with the Feature Fusion that is used for word alignment under the multiple participle, make the word alignment process healthy and strong more, can reduce receiving participle mistake or the inconsistent word alignment number of errors that influences of bilingual participle.

Description of drawings

Below in conjunction with accompanying drawing and embodiment the present invention is done specifying further, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is the word alignment synoptic diagram.

Fig. 2 is the example of participle lattice WSL.

Fig. 3 is the example of participle net WSN.

Fig. 4 a is respectively the example that skeleton is connected and skeleton aligns of WSN and english sentence with Fig. 4 b.

Fig. 5 is based on the process flow diagram of the inventive method.

Embodiment

The present invention proposes and a kind of multiple many participles are merged; Be called participle net (Word Segmentation Network; Hereinafter to be referred as WSN), and then the word alignment fusion method based on the participle net is proposed, to alleviate the inconsistent word alignment problem of bringing of bilingual participle.In the prior art, merge multiple participle with participle lattice (Word Segmentation Lattice is hereinafter to be referred as WSL) usually in the natural language processing task.Two kinds of participle S1 for " raining path and sliding ": " rain// road is sliding " and participle S2: " raining/road/cunning " two kinds of participles, Fig. 3 and Fig. 4 are respectively this participle lattice and participle net and represent.

WSN first row and second row are represented participle S respectively ₁With participle S ₂, the third line is participle S ₁With participle S ₂Outside another kind of participle, the present invention is referred to as the skeleton participle.

The skeleton participle is a kind of like this participle, and its speech border is participle S ₁With participle S ₂The union on speech border, the i.e. set of word segmentation point in all participles of its word segmentation point.For example, among Fig. 3 " rain// road/cunning " be exactly a skeleton participle (the third line among Fig. 3).Participle S ₁Middle D score is middle with " rain " to be a speech border, so also is a speech border in the middle of the D score in the skeleton participle and " rain "; And for example, participle S ₂In be a speech border in the middle of " road " and " cunning ", therefore " road " and " cunning " is two skeleton speech in the skeleton participle.

The skeleton speech is each speech in the skeleton participle.For example, the skeleton participle one among Fig. 3 has four skeleton speech.

Each row among the participle net WSN are by a skeleton speech and participle S ₁With participle S ₂In covered this skeleton speech at correspondence position speech form.The participle net WSN one of Fig. 3 has 4 row.Can find out that the speech number of the columns of WSN and skeleton participle is consistent.

It should be noted that some non-skeleton speech possibly cover a plurality of row, such as participle S ₁In " road sliding " covered two row because " road is sliding " is at S ₂Middle quilt has been splitted into two speech; Again such as participle S ₂In " raining " covered two row because " raining " is at S ₁Middle quilt has been splitted into two speech.

The present invention has done index (subscript is since 1) to the speech (comprising the skeleton participle) of each row among the WSN.The present invention defines j skeleton speech and projects to participle s _kSpeech δ _k(j) on, and if only if s _kIn δ _k(j) individual speech with it one row in.δ for example ₁(4)=3, δ ₂(3)=2.

Next the skeleton that further defines between WSN and the english sentence connects and the skeleton alignment.

Skeleton connects, the intertranslation relation among the sign WSN between skeleton speech and the English word.

The skeleton alignment, the skeleton alignment is the set that skeleton connects.

Fig. 4 a and Fig. 4 b are the examples that skeleton connects and skeleton aligns correct between above-mentioned WSN and the english sentence " Road is slippery when raining ".Wherein Fig. 4 a is that a skeleton connects, and Fig. 4 b is the skeleton alignment that is connected to form by four skeletons.

The present invention has adopted a kind of the connection based on the optimum skeleton of connection selection algorithm selection that connects degree of confidence to carry out the fusion of word alignment, thereby obtains final skeleton alignment.According to projection function recited above, the skeleton speech can project into S arbitrarily ₁, S ₂In speech, and then any skeleton connects and just can convert traditional S to ₁And S ₂In speech and the connection between the english.For example, according to projection function, skeleton speech " road " the mapping s among Fig. 4 ₁In " road sliding ", just can be mapped to S so the skeleton among Fig. 4 a connects ₁In " road sliding " arrive the connection and the S of " road " ₂In " road " connection of arriving " road ".

In order to evaluate and test the raising of the present invention in performance aspect the word alignment, the present invention has adopted manual 491 English-Chinese sentences that marked word alignment to as test set of the present invention.Chinese part in the test set uses the Stamford participle instrument based on Binzhou treebank mark standard to carry out participle.Word alignment in the test set connects and is divided into two types, and one type is to confirm that type connects, and is designated as S (sure), and one type is possibly to connect by type, is designated as P (possible).Suppose that the word alignment that will evaluate and test is A, the F-score of this word alignment calculates by following formula so

precision (S, A) = \frac{| A \cap S |}{S}

recall (S, A) = \frac{| A \cap S |}{A}

Fscore (S, α, A) = \frac{1}{\frac{α}{precision (S, A)} + \frac{1 - α}{recall (S, A)}} - - - (1)

In the above-mentioned formula, precision refers to the accuracy rate of word alignment A, recall refer to the to align recall rate of A.In the computing formula of Fscore, the present invention has chosen α=0.5, with balance accuracy rate and recall rate.

Word alignment fusion method based on the participle net was divided into for two steps: the first step, and use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, promptly find the skeleton alignment; Second step, will select the skeleton alignment and project on each participle, obtain traditional word alignment.

Connection selection algorithm among the present invention is based on and connects that the degree of confidence score carries out.The sub-c of centering sentence of the present invention carries out participle respectively with K kind participle instrument.For example, can take the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on Stamford participle instrument three kinds of participle instruments such as (being designated as P) of Peking University's mark standard.Their participle is designated as s respectively _k(k=1 ..., K), wherein, participle

Wherein

Be respectively i the english of english sentence E, I is an english sentence length.The present invention is respectively at this K kind participle s _k(k=1 ..., K that K) forms with english sentence E is utilized traditional word alignment model to obtain K word alignment result to last, is designated as a respectively _k(k=1 ..., K).Next the present invention is with the K kind participle s of Chinese sentence _k(k=1 ..., K), method construct WSN as described above, WSN is designated as C=c ₁c ₂C _J, c wherein _j(j=1,2 ... J) be respectively j skeleton speech among the C.Suppose A again _IjBe j skeleton speech c among the C _jWith i english e _iBetween skeleton connect,

Be participle s _kIn j speech (promptly

) and e _iConnection.The degree of confidence score that the present invention defines the skeleton connection is following:

C (A_{ij} | C, E) = Σ_{k = 1}^{K} w_{k} * c (a_{i δ_{k} (j)}^{k} | C, E) - - - (2)

Wherein Be to connect

The degree of confidence score, For skeleton connects A _IjProject to participle s _kOn connection.

W wherein _kBe the weight coefficient of participle k, can try to achieve, hill-climbing algorithm (hill-climbing algorithm with hill-climbing algorithm; Russell; Stuart J.&Norvig, Peter (2003), Artificial Intelligence:A Modern Approach); In this experiment, optimization aim of the present invention be testing material preceding 250 to last F-score.Hill-climbing algorithm is summarized as follows: the initial value that weight is set at random is current separating; In current its proximal direction of separating, do search then; Separate more excellently if certain in the proximal direction is separated than current, then separate and substitute current separating, so repeatedly with this; Separate more excellent separating up in its proximal direction, can not find, then separate current separating the most finally than current.The present invention has attempted 20 different initial values, chooses the highest finally the separating as w of F-score then _k(k=1 ..., K).

The confidence level score that connects is defined as follows:

c (a_{i δ_{k} (j)}^{k} | C, E) = \sqrt{q_{c 2 e} (a_{i δ_{k} (j)}^{k} | C, E) * q_{e 2 c} (a_{i δ_{k} (j)}^{k} | C, E)} - - - (3)

The posterior probability that the C-E direction connects defines as follows:

q_{c 2 e} (a_{i δ_{k} (j)}^{k} | C, E) = \frac{p_{k} (e_{i} | c_{δ_{k} (j)}^{k})}{Σ_{i^{'} = 1}^{I} p_{k} (e_{i^{'}} | c_{δ_{k} (j)}^{k})} - - - (4)

The posterior probability of E-C direction

can similarly define.Probability in the top formula

Be participle s _kIn speech

Translate english e _iTranslation probability, this probability can utilize the word alignment instrument GIZA++ that increases income at participle s _kGoing up training with E obtains.

Can see, on the WSN of linearity, can define skeleton easily and connect, the degree of confidence score that is connected with the calculating skeleton.And WSL is difficult to define the corresponding relation between Chinese word and the english above that owing to its nonlinear character, and then improves existing word alignment algorithm.And the character of the linearity of WSN, prompting the present invention can be easily with existing word alignment technological expansion on word alignment technology based on WSN.

Embodiment:

The used algorithm of the present invention is all write realization by C# language.The type that experiment is adopted is: Intel Xeon X5550 processor, dominant frequency is 2.66G HZ, in save as 16G.The GIZA++ word alignment kit that the present invention uses is the at present general word alignment kit of increasing income, and obtains the version that finally can under the windows platform, move by this laboratory compiling Cygwin under.The module of all the other mechanical translation that the present invention uses is rewritten with C# language according to the statistical machine translation open source software Moses based on phrase for this laboratory and is obtained.

Data are prepared as follows before implementing: the Chinese part to English-Chinese parallel language material is used K kind participle instrument participle, obtains participle among the K, i.e. s _k(k=1 ..., K), s _k(k=1 ..., K) be traditional word alignment a with parallel English part respectively _k(k=1 ..., K).

More particularly, as shown in Figure 5, the present invention moves as follows:

1. obtain initial skeleton and connect collection: the multiple participle s that utilizes Chinese sentence _k(k=1 ..., K) make up the participle net, calculate the degree of confidence score that skeletons all between Chinese word segmentation net C and the english sentence E connects according to formula (1).If skeleton connects Appear at certain a _k(k=1 ..., K) in, the present invention just claims that skeleton connects A _IjFrom a _kObtain a ballot.At least the skeleton articulation set that has obtained a ballot is designated as B ₀, connect collection as initial skeleton.

2. obtain final skeleton and connect collection: according to the degree of confidence score, to B ₀In all skeletons connect descending sorts, and investigate each bar skeleton successively and connect.Skeleton connects A _IjMust satisfy following condition and just can be selected into final skeleton connection collection: (1) degree of confidence score is higher than threshold alpha, and one of following condition is set up (threshold value wherein can be confirmed through above-mentioned hill-climbing algorithm equally):

A) skeleton speech c _jWith english e _iAll do not alignd;

B) skeleton speech c _jNot by any english alignment, and its left neighbours or right neighbours and english e _iThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;

C) english e _iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c _jThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;

Step above carrying out repeatedly can not be selected into final skeleton connection collection up to there being new skeleton to connect, and final set is designated as B ₁

3. obtain final word alignment and connect collection: according to projection function, with B ₁In each skeleton connect

Project to each s _k, obtain K new alignment respectively

Promptly

For each

The present invention will

In in connection connect A by its skeleton before by projection _IjThe degree of confidence ascending order arrange, and compare successively

And a _kIf, the new connection Not at a _kIn, and following condition is satisfied, then will

From

Middle deletion:

A) Chinese word

With english e _iIn alignment

In all alignd;

B) a non-english e has been arranged _iLeft and right sides neighbours' english e, in alignment

In snap to Chinese word

C) a non-Chinese word has been arranged

Left and right sides neighbours' Chinese word c, in alignment

In snap to english e _i

With remaining connection as final connection, the fusion results of word alignment on the promptly different participles.

The algorithm note:

A) in the step 1, if find to have the skeleton that has obtained K ballot to connect, then it will directly be chosen as final connection collection, and further not judge;

B) in the step 3, if B ₁In have two or more skeletons to connect to be projected in certain s _kOn the connection that obtains identical, then only keep a connection;

C) Rule of judgment in the step 2 is that step 4 has taked similar method to delete the connection of potential possible errors for the skeleton of deleting potential possible errors connects.

In order to verify validity of the present invention, the present invention has carried out two groups of experiments.First group of experiment is used for checking the present invention whether can effectively improve the quality of word alignment; Second group of experiment is used for checking the present invention whether can effectively improve the performance of machine translation system.

Experimental data is prepared as follows: bilingual parallel of choosing among the LDC2003E14 is right, be about 190,000 right, training set the most of the present invention; Choose the development set of NIST ' 06, be used for the weight of various features in the estimating system as machine translation system of the present invention; Choose the test set of NIST ' 08, in order to the performance of estimating system as machine translation system of the present invention.Chinese part for above-mentioned these language materials; The present invention handles with three kinds of participle instruments respectively; They are respectively: the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences; Based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on the Stamford participle instrument (being designated as P) of Peking University's mark standard.The present invention adopt machine translation system be this laboratory oneself realize that similar Koehn proposed in 2003 a machine translation system based on phrase.This system adopts the 5-gram language model, is trained by Xinhua's language material partly of GIZAWORD to obtain.The minimal error rate training method that the training of systematic parameter has taked Och to propose in 2003.The present invention has adopted two groups of baseline to carry out word alignment fusion of the present invention: first group is GIZA++ word alignment instrument; After this instrument of using obtains the word alignment result of both direction; Carrying out the fusion of both direction with the didactic method of GDF, this group baseline brief note is GIZA; Second group is the linear discriminent word alignment model that Liu Yang proposes, and notes by abridging to be DIWA.In order to estimate the performance of word alignment, the present invention has adopted foregoing testing material, in the testing material preceding 250 be used for training the weight w in the formula (1) _k(k=1 ..., K) be connected selection algorithm in threshold alpha, use and the result who estimates word alignment for back 241.This Chinese part of 491 is used participle C.First group of experiment, the present invention have been estimated the present invention in the qualitative raising of word alignment on these 241.As shown in the table, GIZA wherein and DIWA represent that respectively Fused word alignment result derives from GIZA and two models of DIWA, and P, R, F represent word alignment result's accuracy rate respectively, recall rate and F-score.Usually, represent final word alignment quality with F-score, P and R only do reference.The present invention adopted four groups merge to be provided with, and C is set representes not merge, and promptly based on traditional word alignment method of participle C, C+P is set represented to merge respectively the word alignment result based on participle C and P, by that analogy.

Can find out that method of the present invention has all significantly improved the F-score of word alignment in GIZA group and DIWA group.For the GIZA group, under the setting of C+I+P, F-score falls slightly after rise.This is relevant with GIZA model deflection recall rate itself, if excessively merge for the high model of recall rate, can damage accurate rate (69.68%).But for the DIWA group, the participle of fusion is many more, and the word alignment result is good more.This is relevant with DIWA model deflection accurate rate itself, and fusion method can effectively improve recall rate, and then improves F-score.

Table 2 word alignment experimental result

Second group of experiment, the present invention has estimated the performance of machine translation system on the test set of NIST ' 08, and the index of test and appraisal is BLEU score.B wherein representes baseline, and Comb representes to merge later result through C+P+I.

The experimental result of table 3 mechanical translation

No matter can find out, be that the present invention has improved the performance of machine translation system significantly in GIZA group or DIWA group.

The invention provides the thinking of translating in a kind of computing machine in the English translation based on the word alignment fusion method of participle net; The method and the approach of concrete this technical scheme of realization are a lot, and the above only is a preferred implementation of the present invention, should be understood that; For those skilled in the art; Under the prerequisite that does not break away from the principle of the invention, can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment realizes.

Claims

1. translate in the computing machine in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:

Step 1 is confirmed the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect;

Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles;

It is characterized in that step 1 may further comprise the steps:

K=1 ..., K, wherein

Respectively at K kind participle s _kK that forms with english sentence E is utilized traditional word alignment model based on single participle to obtain K word alignment result to last, is designated as a respectively _k(k=1 ..., K);

With Chinese sentence to K kind participle s _k, structure participle net, the participle net is designated as C, C=c ₁, c ₂..., c _J, c wherein _j(j=1,2 ... J) be j skeleton speech among the participle net C; A _IjBe j skeleton speech c among the participle net C _jWith i english e _iBetween skeleton connect,

Be participle s _kIn j speech

With english e _iBetween connection; Use following formula to calculate the degree of confidence that skeleton connects:

C (A_{ij} | C, E) = Σ_{k = 1}^{K} w_{k} * c (a_{i δ_{k} (j)}^{k} | C, E)

Wherein

Be that skeleton connects

The degree of confidence score,

For skeleton connects A _IjProject to participle s _kOn skeleton connect; W wherein _kWeight coefficient for participle k; K is the sum of participle;

According to the degree of confidence score, to B ₀In all skeletons connect A _IjDescending sort;

Judge that successively each bar skeleton connects A _Ij, the skeleton that satisfies following condition connects A _IjBe selected into final skeleton and connect collection:

(1) the degree of confidence score is higher than threshold alpha; And one of meet the following conditions simultaneously:

Skeleton speech c _jWith english e _iDo not alignd; Perhaps, skeleton speech c _jNot by any english alignment, and its left neighbours or right neighbours and english e _iThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Perhaps, english e _iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c _jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection;

Repeat this step, can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B ₁

2. translate in a kind of computing machine according to claim 1 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 1,, then directly be chosen as final skeleton and connect and collect if find to have the skeleton that has obtained K ballot to connect.

3. according to translating in a kind of computing machine described in claim 1 or 2 in the English translation, it is characterized in that step 2 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:

According to projection function, skeleton is connected collection B ₁In each skeleton connect projection branch and be clipped to each participle s _kOn, obtain K new word alignment result respectively

Promptly

a_{k}^{'} = {a_{i δ_{k} (j)}^{k} | A_{Ij} &Element; B_{1}}, (k = 1, . . ., K);

For each word alignment result

With the word alignment result

In connection

With word alignment a as a result _kIf, new connection Not at word alignment a as a result _kIn, and like any condition below satisfying, then will connect

From the word alignment result

Middle deletion:

Participle s _kIn j speech

With english e _iIn the word alignment result

Left and right sides neighbours' Chinese word c is in the word alignment result In snap to english e _i

With the word alignment result

Remaining connection is as final connection alignment result

Promptly obtain participle s _kThe fusion results of last word alignment.

4. translate in a kind of computing machine according to claim 3 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 2, if skeleton connects collection B ₁In have two or more skeletons to connect to be projected in some participle s _kOn the connection that obtains identical, then only keep a connection.