CN102193915A - Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation - Google Patents

Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation Download PDF

Info

Publication number
CN102193915A
CN102193915A CN2011101486920A CN201110148692A CN102193915A CN 102193915 A CN102193915 A CN 102193915A CN 2011101486920 A CN2011101486920 A CN 2011101486920A CN 201110148692 A CN201110148692 A CN 201110148692A CN 102193915 A CN102193915 A CN 102193915A
Authority
CN
China
Prior art keywords
participle
skeleton
english
word alignment
alignment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101486920A
Other languages
Chinese (zh)
Other versions
CN102193915B (en
Inventor
奚宁
李博渊
汤光超
赵迎功
陈家骏
戴新宇
张建兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN2011101486920A priority Critical patent/CN102193915B/en
Publication of CN102193915A publication Critical patent/CN102193915A/en
Application granted granted Critical
Publication of CN102193915B publication Critical patent/CN102193915B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation. The method comprises the following steps of: 1, determining skeleton alignment: searching and selecting an optimal skeleton connection by using a connection-confidence-based connection selection algorithm, and forming the skeleton alignment; and 2, projecting the selected skeleton alignment to each participle to obtain various-participle-based word alignment. By the method, the conventional single-participle-based word alignment algorithm is improved, and the word alignment quality of each participle and the machine translation quality can be simultaneously improved. By fusing the characteristics for the word alignment under multiple participles, the final word alignment is more robust, and the number of word alignment errors affected by participle errors or bilingual participle inconsistency can be reduced.

Description

Translate in the English translation word alignment fusion method in a kind of computing machine based on the participle net
Technical field
The present invention relates to a kind of computer software language translation field, translate in the English translation word alignment fusion method in particularly a kind of computing machine based on the participle net.
Background technology
Frequent day by day along with the rapid increase of world today's quantity of information and international exchange, computer networking technology is popularized rapidly and development, aphasis further obviously and serious, people are also increasing to the potential demand of mechanical translation.Mechanical translation is exactly to realize translation between the different language with computing machine.The language that is translated is called source language, and the object language of translating into is called target language, and mechanical translation is exactly to realize the process of conversion from the source language to the target language.In recent years, a series of impressive progresses have been obtained based on the statistical machine translation method of extensive corpus.Statistical machine translation utilizes statistical method, a large amount of bilingual translation rule and the features of study from large-scale bilingual Parallel Corpus, with these rules and feature the sentence of source language is decoded (translation) then, sentence is translated in the sentence conduct that searches out the target language of probability maximum.Wherein, bilingual word-alignment is the prior step that obtains translation rule in the above-mentioned flow process.Word alignment be exactly to find bilingual parallel sentence between speech and the corresponding relation of speech.The quality of the quality of word alignment directly has influence on the quality of the translation rule that extracts, and then has influence on the final performance of machine translation system.If one or both language in bilingual need carry out participle (as Chinese), so common way is before word alignment, to utilize certain participle instrument that the language material that needs participle is carried out participle.This participle instrument is normally trained on single language participle corpus or monolingual dictionary and is obtained, present main flow participle instrument is for single this task of language participle, obtained good performance, yet this needs that might not satisfy bilingual word-alignment towards the participle instrument of single language task, that is to say that such participle may not be optimum for the needs of word alignment, it is inconsistent that the present invention is called bilingual participle with this phenomenon.
The method of the inconsistent problem of the bilingual participle of present existing solution roughly can be divided into two classes: one, directly optimization obtains a kind of participle at word alignment.Optimizing process is very time-consuming and complicated usually like this, and these methods need come training pattern from an initial word alignment result, yet this word alignment itself is not highly reliable as a result.Two, utilize different participles to obtain different translation rule set,, utilize certain means to merge these translation rule set then in decoding (translation) stage.The word alignment quality that this method does not improve various participles is a purpose.
The task of word alignment is to find the corresponding relation of bilingual sentence to a speech and speech.Fig. 1 has showed the correct word alignment of english-chinese bilingual sentence to " raining path and sliding-Road is slippery when raining ", and Chinese sentence participle mode wherein is " rain// road sliding " ("/" expression speech border).
Under Chinese word segmentation as shown in Figure 1, " road is sliding " need correspond to " Road " and " slippery " two speech, and two speech of D score " rain " need correspond to " raining ", could form a correct word alignment.In this participle, such " one-to-many " and the alignment pattern of " many-one " have caused bilingual participle inconsistent, have strengthened the difficulty of word alignment; Otherwise, if the participle mode of Chinese sentence is " raining/road/cunning ", so just can form the alignment pattern of more natural " one to one ", allow this right alignment task become relatively easy.
Table 1 has been showed the participle of " raining path and sliding " under three kinds of participle instruments, as can be seen, except Stanford Segmenter with PKU standard has provided this alignment result who helps word alignment, all the other participles or be bilingual inconsistent (based on the participle of word frequency), or be wrong (Stanford Segmenterwith CTB standard).Yet, also lack an effective method at present, can be fast right in the word alignment corpus each, choose a kind of Chinese word segmentation method that helps word alignment.
The participle example of three kinds of participle instruments of table 1
The participle instrument Participle
Participle based on word frequency Rain/road is sliding
Stanford?Segmenter?with?PKU?standard Rain/road/cunning
Stanford?Segmenter?with?CTB?standard Rainy road/cunning
Summary of the invention
Goal of the invention: technical matters to be solved by this invention is at the deficiencies in the prior art, and the word alignment fusion method of translating in a kind of computing machine in the English translation based on the participle net is provided.
In order to solve the problems of the technologies described above, the invention discloses and translate in a kind of computing machine in the English translation based on the word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:
Step 1 is determined the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, constitute the skeleton alignment;
Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles.
Step 1 of the present invention may further comprise the steps:
The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively k, wherein, participle
Figure BDA0000066156770000021
Wherein
Figure BDA0000066156770000022
Be respectively participle s kIn j-speech, J kBe participle s kThe speech number; The english sentence parallel with Chinese sentence c is E=e 1e 2... e i, wherein
Figure BDA0000066156770000023
Be respectively i the english of english sentence E, I is a total words in the english sentence;
Respectively at K kind participle s kK that forms with english sentence E is utilized the word alignment model to obtain K word alignment result to last, is designated as a respectively k(k=1 ..., K);
With Chinese sentence to K kind participle s k, structure participle net, the participle net is designated as C=c 1, c 2..., c j, c wherein j(j=1 2...J) is respectively j skeleton speech among the participle net C; J skeleton speech c among the participle net C jWith i english e iBetween be that skeleton connects A Ij, participle s kIn j speech
Figure BDA0000066156770000031
With english e iBetween for connecting
Figure BDA0000066156770000032
Use following formula to calculate the degree of confidence that skeleton connects:
C ( A ij | C , E ) = Σ k = 1 K w k · c ( a i δ k ( j ) k | C , E ) ;
Wherein
Figure BDA0000066156770000034
Be that skeleton connects
Figure BDA0000066156770000035
The degree of confidence score, For skeleton connects A IjProject to participle s kOn connection; W wherein kBe the weight coefficient of participle k, can try to achieve that the target of hill-climbing algorithm is the F-score that minimizes the word alignment mark language material on certain participle k with hill-climbing algorithm.
At least the skeleton articulation set that has obtained a ballot is designated as B 0, connect collection as initial skeleton;
According to the degree of confidence score, to B 0In all skeletons connect descending sorts;
Judge that successively each bar skeleton connects, the skeleton that satisfies following condition connects A IjBe selected into final skeleton and connect collection:
(1) the degree of confidence score is higher than threshold alpha (threshold value can be determined by hill-climbing algorithm equally); And one of meet the following conditions simultaneously:
Skeleton speech c jWith english e iBe not aligned; Perhaps, skeleton speech c jDo not alignd by any english, and its left neighbours or right neighbours and english e iThe skeleton connection that constitutes is selected into final skeleton and connects collection; Perhaps, english e iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Threshold value wherein can be determined by hill-climbing algorithm equally;
Can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B 1
In the step 1 of the present invention,, then directly be chosen as final connection collection if find to have the skeleton that has obtained K ballot to connect.
Step 2 of the present invention may further comprise the steps:
According to projection function, skeleton is connected collection B 1In each skeleton connect and project to each participle s k, obtain K new alignment word alignment a ' as a result respectively k, promptly
For each word alignment a ' as a result k, with word alignment a ' as a result kIn connection
Figure BDA0000066156770000042
Be projected preceding skeleton by it and connect A IjThe degree of confidence ascending order arrange and word alignment a ' as a result relatively successively kWith word alignment a as a result kIf new skeleton connects
Figure BDA0000066156770000043
Not at word alignment a as a result kIn, and as satisfying following any one condition, then will connect
Figure BDA0000066156770000044
From word alignment a ' as a result kMiddle deletion:
Participle s kIn j speech
Figure BDA0000066156770000045
With english e iAt word alignment a ' as a result kIn all be aligned; Perhaps, there has been one not to be english e iLeft and right sides neighbours' english e is at word alignment a ' as a result kIn snap to participle s kIn j speech
Figure BDA0000066156770000046
Perhaps, there has been one not to be participle s kIn j Chinese word
Figure BDA0000066156770000047
Left and right sides neighbours' Chinese word c is at word alignment a ' as a result kIn snap to english e i
A ' as a result will align kRemaining connection is as final connection alignment result Promptly obtain participle s kThe fusion results of last word alignment.
In the step 2 of the present invention, if skeleton connects collection B 1In have two or more skeletons to connect to be projected in some participle s kOn the connection that obtains identical, then only keep a connection.
Beneficial effect: when a kind of language in the english-chinese bilingual needed to carry out participle, can effectively the multiple participle that carries out under the different participle instruments be fused into the structure of a linearity among the present invention before word alignment.The present invention utilizes the feature that contains in the different participles to carry out the word alignment fusion, thereby the word alignment quality of various participles can both improve, and then improves the performance of computer software translation.
The present invention improves existing word alignment algorithm based on single participle, can improve word alignment quality and mechanical translation quality under each participle simultaneously.By being used for the Feature Fusion of word alignment under the multiple participle, make the word alignment process healthy and strong more, can reduce being subjected to participle mistake or the inconsistent word alignment number of errors that influences of bilingual participle.
Description of drawings
Below in conjunction with the drawings and specific embodiments the present invention is done further to specify, above-mentioned and/or otherwise advantage of the present invention will become apparent.
Fig. 1 is the word alignment synoptic diagram.
Fig. 2 is the example of participle lattice WSL.
Fig. 3 is the example of participle net WSN.
Fig. 4 a is respectively the example that skeleton is connected and skeleton aligns of WSN and english sentence with Fig. 4 b.
Fig. 5 is based on the process flow diagram of the inventive method.
Embodiment
The present invention proposes and a kind of multiple many participles are merged, be called participle net (Word Segmentation Network, hereinafter to be referred as WSN), and then proposition is based on the word alignment fusion method of participle net, to alleviate the inconsistent word alignment problem of bringing of bilingual participle.In the prior art, merge multiple participle with participle lattice (Word Segmentation Lattice is hereinafter to be referred as WSL) usually in the natural language processing task.Two kinds of participle S1 for " raining path and sliding ": " rain// road is sliding " and participle S2: " raining/road/cunning " two kinds of participles, Fig. 3 and Fig. 4 are respectively this participle lattice and participle net and represent.
WSN first row and second row are represented participle S respectively 1With participle S 2, the third line is participle S 1With participle S 2Outside another kind of participle, the present invention is referred to as the skeleton participle.
The skeleton participle is a kind of like this participle, and its speech border is participle S 1With participle S 2The union on speech border, the i.e. set of word segmentation point in all participles of its word segmentation point.For example, among Fig. 3 " rain// road/cunning " be exactly a skeleton participle (the third line among Fig. 3).Participle S 1In the middle of middle D score and " rain " is a speech border, so also is a speech border in the middle of the D score in the skeleton participle and " rain "; And for example, participle S 2In be a speech border in the middle of " road " and " cunning ", therefore " road " and " cunning " is two skeleton speech in the skeleton participle.
The skeleton speech is each speech in the skeleton participle.For example, the skeleton participle one among Fig. 3 has four skeleton speech.
Each row among the participle net WSN are by a skeleton speech and participle S 1With participle S 2In covered this skeleton speech at correspondence position speech form.The participle net WSN one of Fig. 3 has 4 row.As can be seen, the speech number of the columns of WSN and skeleton participle is consistent.
It should be noted that some non-skeleton speech may cover a plurality of row, such as participle S 1In " road sliding " covered two row because " road is sliding " is at S 2Middle quilt has been splitted into two speech; Again such as participle S 2In " raining " covered two row because " raining " is at S 1Middle quilt has been splitted into two speech.
The present invention has done index (subscript is since 1) to the speech (comprising the skeleton participle) of each row among the WSN.The present invention defines j skeleton speech and projects to participle s kSpeech δ k(j) on, and if only if s kIn δ k(j) individual speech and it are in row.δ for example 1(4)=3, δ 2(3)=2.
Next the skeleton that further defines between WSN and the english sentence connects and the skeleton alignment.
Skeleton connects, the intertranslation relation among the sign WSN between skeleton speech and the English word.
The skeleton alignment, the skeleton alignment is the set that skeleton connects.
Fig. 4 a and Fig. 4 b are the examples that skeleton connects and skeleton aligns correct between above-mentioned WSN and the english sentence " Road is slippery when raining ".Wherein Fig. 4 a is that a skeleton connects, and Fig. 4 b is the skeleton alignment that is connected to form by four skeletons.
The present invention has adopted a kind of the connection based on the optimum skeleton of connection selection algorithm selection that connects degree of confidence to carry out the fusion of word alignment, thereby obtains final skeleton alignment.According to projection function recited above, the skeleton speech can project into S arbitrarily 1, S 2In speech, and then any one skeleton connects and just can convert traditional S to 1And S 2In speech and the connection between the english.For example, according to projection function, skeleton speech " road " the mapping s among Fig. 4 1In " road sliding ", just can be mapped to S so the skeleton among Fig. 4 a connects 1In " road sliding " arrive the connection and the S of " road " 2In " road " connection of arriving " road ".
In order to evaluate and test the raising of the present invention in performance aspect the word alignment, the present invention has adopted manual 491 English-Chinese sentences that marked word alignment to as test set of the present invention.Chinese part in the test set uses the Stamford participle instrument based on Binzhou treebank mark standard to carry out participle.Word alignment in the test set connects and is divided into two classes, and a class is to determine that type connects, and is designated as S (sure), and a class is possible to connect by type, is designated as P (possible).Suppose that the word alignment that will evaluate and test is A, the F-score of this word alignment calculates by following formula so
precision ( S , A ) = | A ∩ S | S
recall ( S , A ) = | A ∩ S | A
Fscore ( S , α , A ) = 1 α precision ( S , A ) + 1 - α recall ( S , A ) - - - ( 1 )
In the above-mentioned formula, precision refers to the accuracy rate of word alignment A, recall refer to the to align recall rate of A.In the computing formula of Fscore, the present invention has chosen α=0.5, with balance accuracy rate and recall rate.
Word alignment fusion method based on the participle net was divided into for two steps: the first step, and use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, promptly find the skeleton alignment; Second step, will select the skeleton alignment and project on each participle, obtain traditional word alignment.
Connection selection algorithm among the present invention is based on and connects that the degree of confidence score carries out.The sub-c of centering sentence of the present invention carries out participle respectively with K kind participle instrument.For example, can take the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on Stamford participle instrument three kinds of participle instruments such as (being designated as P) of Peking University's mark standard.Their participle is designated as s respectively k(k=1 ..., K), wherein, participle
Figure BDA0000066156770000073
Wherein
Figure BDA0000066156770000074
Be respectively participle s kIn j speech, J kBe participle s kThe speech number; The english sentence parallel with Chinese sentence c is E=e 1e 2... e i, wherein Be respectively i the english of english sentence E, I is an english sentence length.The present invention is respectively at this K kind participle s k(k=1 ... K) and K of forming of english sentence E utilize traditional word alignment model to obtain K word alignment result to last, be designated as a respectively k(k=1 ..., k).Next the present invention is with the K kind participle s of Chinese sentence k(k=1 ..., K), method construct WSN as described above, WSN is designated as C=c 1c 2... c j, c wherein j(j=1,2 ... J) be respectively j skeleton speech among the C.Suppose A again IjBe j skeleton speech c among the C jWith i english e iBetween skeleton connect,
Figure BDA0000066156770000076
Be participle s kIn j speech (promptly
Figure BDA0000066156770000077
) and e iConnection.The degree of confidence score that the present invention defines the skeleton connection is as follows:
C ( A ij | C , E ) = Σ k = 1 K w k · c ( a i δ k ( j ) k | C , E ) - - - ( 2 )
Wherein
Figure BDA0000066156770000079
Be to connect
Figure BDA00000661567700000710
The degree of confidence score,
Figure BDA00000661567700000711
For skeleton connects A IjProject to participle s kOn connection.
W wherein kBe the weight coefficient of participle k, can try to achieve, hill-climbing algorithm (hill-climbing algorithm, Russell, Stuart J.﹠amp with hill-climbing algorithm; Norvig, Peter (2003), Artificial Intelligence:A Modern Approach), in this experiment, optimization aim of the present invention be testing material preceding 250 to last F-score.Hill-climbing algorithm is summarized as follows: the initial value that weight is set at random is current separating, in current its proximal direction of separating, do search then, if certain in the proximal direction separate than current separate more excellent, then separate and substitute current separating with this, so repeatedly, separate more excellent separating up in its proximal direction, can not find, then current separating the most finally separated than current.The present invention has attempted 20 different initial values, chooses the highest finally the separating as w of F-score then k(k=1 ..., K).
The degree of confidence score that connects is defined as follows:
c ( a i δ k ( j ) k | C , E ) = q c 2 e ( a i δ k ( j ) k | C , E ) · q e 2 c ( a i δ k ( j ) k | C , E ) - - - ( 3 )
The posterior probability that the C-E direction connects is defined as follows:
Figure BDA0000066156770000082
The posterior probability of E-C direction
Figure BDA0000066156770000083
Can similarly define.Probability in the top formula Be participle s kIn speech
Figure BDA0000066156770000085
Translate english e iTranslation probability, this probability can utilize the word alignment instrument GIZA++ that increases income at participle s kGoing up training with E obtains.
Can see, on the WSN of linearity, can define skeleton easily and connect, the degree of confidence score that is connected with the calculating skeleton.And WSL is difficult to define the corresponding relation between Chinese word and the english thereon owing to its nonlinear character, and then improves existing word alignment algorithm.And the character of the linearity of WSN, prompting the present invention can be easily with existing word alignment technological expansion to based on coming on the word alignment technology of WSN.
Embodiment:
The used algorithm of the present invention is all write realization by C# language.The type that experiment is adopted is: Intel Xeon X5550 processor, dominant frequency is 2.66G HZ, in save as 16G.The GIZA++ word alignment kit that the present invention uses is the at present general word alignment kit of increasing income, and is obtained the version that finally can move under the windows platform in compiling under the Cygwin by this laboratory.The module of all the other mechanical translation that the present invention uses is rewritten with C# language according to the statistical machine translation open source software Moses based on phrase for this laboratory and is obtained.
Data are prepared as follows before implementing: the Chinese part to English-Chinese parallel language material is used K kind participle instrument participle, obtains participle among the K, i.e. s k(k=1 ..., K), s k(k=1 ..., K) be traditional word alignment a with parallel English part respectively k(k=1 ..., K).
More particularly, as shown in Figure 5, the present invention moves as follows:
1. obtain initial skeleton and connect collection: the multiple participle s that utilizes Chinese sentence k(k=1 ..., K) make up the participle net, calculate the degree of confidence score that skeletons all between Chinese word segmentation net C and the english sentence E connects according to formula (1).
If skeleton connects
Figure BDA0000066156770000091
Appear at certain a k(k=1 ..., K) in, the present invention just claims skeleton to connect A IjFrom
a kObtain a ballot.At least the skeleton articulation set that has obtained a ballot is designated as B 0, connect collection as initial skeleton.
2. obtain final skeleton and connect collection: according to the degree of confidence score, to B 0In all skeletons connect descending sorts, and investigate each bar skeleton successively and connect.Skeleton connects A IjMust satisfy following condition and just can be selected into final skeleton connection collection: (1) degree of confidence score is higher than threshold alpha, and one of following condition is set up (threshold value wherein can be determined by above-mentioned hill-climbing algorithm equally):
A) skeleton speech c jWith english e iAll be not aligned;
B) skeleton speech c jDo not alignd by any english, and its left neighbours or right neighbours and english e iThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;
C) english e iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c jThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;
Step above carrying out repeatedly can not be selected into final skeleton connection collection up to there being new skeleton to connect, and final set is designated as B 1
3. obtain final word alignment and connect collection: according to projection function, with B 1In each skeleton connect
Figure BDA0000066156770000101
Project to each s k, obtain K new alignment a ' respectively k, promptly For each a ' k, the present invention is with a ' kIn in connection connect A by its skeleton before being projected IjThe degree of confidence ascending order arrange and a ' relatively successively kAnd a kIf, the new connection
Figure BDA0000066156770000103
Not at a kIn, and following condition is satisfied, then will
Figure BDA0000066156770000104
From a ' kMiddle deletion:
A) Chinese word
Figure BDA0000066156770000105
With english e iAt alignment a ' kIn all be aligned;
B) a non-english e has been arranged iLeft and right sides neighbours' english e, at alignment a ' kIn snap to Chinese word
Figure BDA0000066156770000106
C) a non-Chinese word has been arranged Left and right sides neighbours' Chinese word c, at alignment a ' kIn snap to english e i
With remaining connection as final connection, the fusion results of word alignment on the promptly different participles.
The algorithm note:
A) in the step 1, if find to have the skeleton that has obtained K ballot to connect, then it will directly be chosen as final connection collection, and further not judge;
B) in the step 3, if B 1In have two or more skeletons to connect to be projected in certain s kOn the connection that obtains identical, then only keep a connection;
C) Rule of judgment in the step 2 is that step 4 has taked similar method to delete the connection of potential possible errors for the skeleton of deleting potential possible errors connects.
In order to verify validity of the present invention, the present invention has carried out two groups of experiments.First group of experiment is used for checking the present invention whether can effectively improve the quality of word alignment; Second group of experiment is used for checking the present invention whether can effectively improve the performance of machine translation system.
It is as follows that experimental data is prepared: the bilingual parallel sentence of choosing among the LDC2003E14 is right, be about 190,000 right, training set the most of the present invention; Choose the exploitation collection of NIST ' 06, be used for the weight of various features in the estimating system as machine translation system of the present invention; Choose the test set of NIST ' 08, in order to the performance of estimating system as machine translation system of the present invention.Chinese part for above-mentioned these language materials, the present invention handles with three kinds of participle instruments respectively, they are respectively: the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on the Stamford participle instrument (being designated as P) of Peking University's mark standard.The present invention adopt machine translation system be this laboratory oneself realize that similar Koehn proposed in 2003 a machine translation system based on phrase.This system adopts the 5-gram language model, is trained by Xinhua's language material partly of GIZAWORD to obtain.The minimal error rate training method that the training of systematic parameter has taked Och to propose in 2003.The present invention has adopted two groups of baseline to carry out word alignment fusion of the present invention: first group is GIZA++ word alignment instrument, obtain the word alignment result of both direction with this instrument after, carrying out the fusion of both direction with the didactic method of GDF, this group baseline brief note is GIZA; Second group is the linear discriminent word alignment model that Liu Yang proposes, and notes by abridging to be DIWA.In order to estimate the performance of word alignment, the present invention has adopted foregoing testing material, in the testing material preceding 250 weight w that are used for training in the formula (1) k(k=1 ..., K) be connected selection algorithm in threshold value a, use and the result who estimates word alignment for back 241.This Chinese part participle C of 491.First group of experiment, the present invention have been estimated the present invention in the qualitative raising of word alignment on these 241.As shown in the table, GIZA wherein and DIWA represent that respectively Fused word alignment result derives from GIZA and two models of DIWA, and P, R, F represent word alignment result's accuracy rate respectively, recall rate and F-score.Usually, represent final word alignment quality with F-score, P and R only do reference.The present invention adopted four groups merge to be provided with, and C is set represents not merge, and promptly based on traditional word alignment method of participle C, C+P is set represented to merge respectively word alignment result based on participle C and P, by that analogy.
Method of the present invention has as can be seen all significantly improved the F-score of word alignment in GIZA group and DIWA group.For the GIZA group, under the setting of C+I+P, F-score falls slightly after rise.This is relevant with GIZA model deflection recall rate itself, if excessively merge for the high model of recall rate, can damage accurate rate (69.68%).But for the DIWA group, the participle of fusion is many more, and the word alignment result is good more.This is relevant with DIWA model deflection accurate rate itself, and fusion method can effectively improve recall rate, and then improves F-score.
Table 2 word alignment experimental result
Figure BDA0000066156770000111
Second group of experiment, the present invention has estimated the performance of machine translation system on the test set of NIST ' 08, and the index of test and appraisal is BLEU score.B wherein represents baseline, and Comb represents to merge later result through C+P+I.
The experimental result of table 3 mechanical translation
Figure BDA0000066156770000121
As can be seen, no matter be that the present invention has improved the performance of machine translation system significantly in GIZA group or DIWA group.
The invention provides the thinking of translating in a kind of computing machine in the English translation based on the word alignment fusion method of participle net; the method and the approach of this technical scheme of specific implementation are a lot; the above only is a preferred implementation of the present invention; should be understood that; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment is realized.

Claims (5)

1. translate in the computing machine in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:
Step 1 is determined the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect;
Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles.
2. according to translating in a kind of computing machine described in the claim 1 in the English translation, it is characterized in that step 1 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:
The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively k, wherein, participle
Figure FDA0000066156760000011
Wherein
Figure FDA0000066156760000012
Be respectively participle s kIn j speech, J kBe participle s kThe speech number; The english sentence parallel with Chinese sentence c is E=e 1e 2... e i, wherein
Figure FDA0000066156760000013
Be respectively i the english of english sentence E, I is a total words in the english sentence;
Respectively at K kind participle s kK that forms with english sentence E is utilized traditional word alignment model based on single participle to obtain K word alignment result to last, is designated as a respectively k(k=1 ..., K);
With Chinese sentence to K kind participle s k, structure participle net, the participle net is designated as C, C=c 1, c 2... c jC wherein j(j=1 2...J) is j skeleton speech among the participle net C; A IjBe j skeleton speech c among the participle net C jWith i english e iBetween skeleton connect A Ij, Be participle s kIn j speech
Figure FDA0000066156760000015
With english e iBetween connection;
Use following formula to calculate the degree of confidence that skeleton connects:
C ( A ij | C , E ) = Σ k = 1 K w k · c ( a i δ k ( j ) k | C , E )
Wherein Be that skeleton connects
Figure FDA0000066156760000018
The degree of confidence score,
Figure FDA0000066156760000019
For skeleton connects A IjProject to participle s kOn skeleton connect; W wherein kWeight coefficient for participle k; K is the sum of participle;
At least the skeleton articulation set that has obtained a ballot is designated as B 0, connect collection as initial skeleton;
According to the degree of confidence score, to B 0In all skeletons connect A IjDescending sort;
Judge that successively each bar skeleton connects A Ij, the skeleton that satisfies following condition connects A IjBe selected into final skeleton and connect collection:
(1) the degree of confidence score is higher than threshold value a; And one of meet the following conditions simultaneously:
Skeleton speech c jWith english e iBe not aligned; Perhaps, skeleton speech c jDo not alignd by any english, and its left neighbours or right neighbours and english e iThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Perhaps, english e iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection;
Repeat this step, can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B 1
3. translate in a kind of computing machine according to claim 2 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 1,, then directly be chosen as final skeleton and connect and collect if find to have the skeleton that has obtained K ballot to connect.
4. according to translating in a kind of computing machine described in claim 2 or 3 in the English translation, it is characterized in that step 2 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:
According to projection function, skeleton is connected collection B 1In each skeleton connect projection branch and be clipped to each participle s kOn, obtain K new alignment word alignment a ' as a result respectively k, promptly
Figure FDA0000066156760000021
For each word alignment a ' as a result k(k=1 ..., K), with word alignment a ' as a result kIn connection
Figure FDA0000066156760000022
Be projected preceding skeleton by it and connect A IjThe degree of confidence ascending order arrange and word alignment a ' as a result relatively successively kWith word alignment a as a result kIf, new connection
Figure FDA0000066156760000023
Not at word alignment a as a result kIn, and as satisfying following any one condition, then will connect
Figure FDA0000066156760000024
From word alignment a ' as a result kMiddle deletion:
Participle s kIn j speech
Figure FDA0000066156760000031
With english e iAt word alignment a ' as a result kIn all be aligned; Perhaps, there has been one not to be english e iLeft and right sides neighbours' english e is at word alignment a ' as a result kIn snap to participle s kIn j speech
Figure FDA0000066156760000032
Perhaps, there has been one not to be participle s kIn j Chinese word
Figure FDA0000066156760000033
Left and right sides neighbours' Chinese word c is at word alignment a ' as a result kIn snap to english e i
A ' as a result will align kRemaining connection is as final connection alignment a ' as a result k, promptly obtain participle s kThe fusion results of last word alignment.
5. translate in a kind of computing machine according to claim 4 in the English translation based on the word alignment fusion method of participle net, it is characterized in that, in the step 2, if skeleton connects collection B 1In have two or more skeletons to connect to be projected in some participle s kOn the connection that obtains identical, then only keep a connection.
CN2011101486920A 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation Expired - Fee Related CN102193915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101486920A CN102193915B (en) 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101486920A CN102193915B (en) 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Publications (2)

Publication Number Publication Date
CN102193915A true CN102193915A (en) 2011-09-21
CN102193915B CN102193915B (en) 2012-11-28

Family

ID=44601998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101486920A Expired - Fee Related CN102193915B (en) 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Country Status (1)

Country Link
CN (1) CN102193915B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684648A (en) * 2019-01-14 2019-04-26 浙江大学 A kind of Chinese automatic translating method at all times of multiple features fusion
CN116070643A (en) * 2023-04-03 2023-05-05 武昌理工学院 Fixed style translation method and system from ancient text to English

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008943A (en) * 2006-01-23 2007-08-01 富士施乐株式会社 Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
JP2009140499A (en) * 2007-12-07 2009-06-25 Toshiba Corp Method and apparatus for training target language word inflection model based on bilingual corpus, tlwi method and apparatus, and translation method and system for translating source language text into target language
CN101676898A (en) * 2008-09-17 2010-03-24 中国科学院自动化研究所 Method and device for translating Chinese organization name into English with the aid of network knowledge
CN101714136A (en) * 2008-10-06 2010-05-26 株式会社东芝 Method and device for adapting a machine translation system based on language database to new field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008943A (en) * 2006-01-23 2007-08-01 富士施乐株式会社 Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment
JP2009140499A (en) * 2007-12-07 2009-06-25 Toshiba Corp Method and apparatus for training target language word inflection model based on bilingual corpus, tlwi method and apparatus, and translation method and system for translating source language text into target language
CN101676898A (en) * 2008-09-17 2010-03-24 中国科学院自动化研究所 Method and device for translating Chinese organization name into English with the aid of network knowledge
CN101714136A (en) * 2008-10-06 2010-05-26 株式会社东芝 Method and device for adapting a machine translation system based on language database to new field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李茂西,宗成庆: "机器翻译系统融合技术综述", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684648A (en) * 2019-01-14 2019-04-26 浙江大学 A kind of Chinese automatic translating method at all times of multiple features fusion
CN116070643A (en) * 2023-04-03 2023-05-05 武昌理工学院 Fixed style translation method and system from ancient text to English
CN116070643B (en) * 2023-04-03 2023-08-15 武昌理工学院 Fixed style translation method and system from ancient text to English

Also Published As

Publication number Publication date
CN102193915B (en) 2012-11-28

Similar Documents

Publication Publication Date Title
CN102214166B (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
Matci et al. Address standardization using the natural language process for improving geocoding results
CN101539907B (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN106503255A (en) Based on the method and system that description text automatically generates article
CN110852117B (en) Effective data enhancement method for improving translation effect of neural machine
CN103500160B (en) A kind of syntactic analysis method based on the semantic String matching that slides
CN110276077A (en) The method, device and equipment of Chinese error correction
CN104756100A (en) Intent estimation device and intent estimation method
CN104991889A (en) Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN105068997B (en) The construction method and device of parallel corpora
CN101866337A (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN109829173B (en) English place name translation method and device
CN102799579A (en) Statistical machine translation method with error self-diagnosis and self-correction functions
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN105988990A (en) Device and method for resolving zero anaphora in Chinese language, as well as training method
CN103678271B (en) A kind of text correction method and subscriber equipment
CN101093478A (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN106021224A (en) Bilingual discourse annotation method
CN104731774A (en) Individualized translation method and individualized translation device oriented to general machine translation engine
CN107463711A (en) A kind of tag match method and device of data
CN106156013B (en) A kind of two-part machine translation method that regular collocation type phrase is preferential
CN109684928A (en) Chinese document recognition methods based on Internal retrieval
CN106126505A (en) Parallel phrase learning method and device
CN106598951A (en) Dependency structure treebank acquisition method and system
CN102760121A (en) Dependence mapping method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121128

Termination date: 20150603

EXPY Termination of patent right or utility model