CN102193915B - Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation - Google Patents

Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation Download PDF

Info

Publication number
CN102193915B
CN102193915B CN2011101486920A CN201110148692A CN102193915B CN 102193915 B CN102193915 B CN 102193915B CN 2011101486920 A CN2011101486920 A CN 2011101486920A CN 201110148692 A CN201110148692 A CN 201110148692A CN 102193915 B CN102193915 B CN 102193915B
Authority
CN
China
Prior art keywords
participle
skeleton
english
word alignment
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2011101486920A
Other languages
Chinese (zh)
Other versions
CN102193915A (en
Inventor
奚宁
李博渊
汤光超
赵迎功
陈家骏
戴新宇
张建兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN2011101486920A priority Critical patent/CN102193915B/en
Publication of CN102193915A publication Critical patent/CN102193915A/en
Application granted granted Critical
Publication of CN102193915B publication Critical patent/CN102193915B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation. The method comprises the following steps of: 1, determining skeleton alignment: searching and selecting an optimal skeleton connection by using a connection-confidence-based connection selection algorithm, and forming the skeleton alignment; and 2, projecting the selected skeleton alignment to each participle to obtain various-participle-based word alignment. By the method, the conventional single-participle-based word alignment algorithm is improved, and the word alignment quality of each participle and the machine translation quality can be simultaneously improved. By fusing the characteristics for the word alignment under multiple participles, the final word alignment is more robust, and the number of word alignment errors affected by participle errors or bilingual participle inconsistency can be reduced.

Description

Translate in the English translation word alignment fusion method in a kind of computing machine based on the participle net
Technical field
The present invention relates to a kind of computer software language translation field, translate in the English translation word alignment fusion method in particularly a kind of computing machine based on the participle net.
Background technology
Frequent day by day along with the rapid increase of world today's quantity of information and international exchange, computer networking technology is popularized rapidly and development, and further obviously with serious, people are also increasing to the potential demand of mechanical translation for aphasis.Mechanical translation is exactly to realize the translation between the different language with computing machine.The language of being translated is called source language, and the object language of translating into is called target language, and mechanical translation is exactly to realize the process of conversion from the source language to the target language.In recent years, a series of impressive progresses have been obtained based on the statistical machine translation method of extensive corpus.Statistical machine translation utilizes statistical method; A large amount of bilingual translation rule and the characteristics of study from large-scale bilingual Parallel Corpus; With these rules and characteristic the sentence of source language is decoded (translation) then, the sentence that searches out the maximum target language of probability is as translating sentence.Wherein, bilingual word-alignment is the prior step that obtains translation rule in the above-mentioned flow process.Word alignment be exactly to find bilingual parallel sentence between speech and the corresponding relation of speech.The quality of the quality of word alignment directly has influence on the quality of the translation rule that extracts, and then has influence on the final performance of machine translation system.If one or both language in bilingual need carry out participle (like Chinese), so common way is before word alignment, to utilize certain participle instrument that the language material that needs participle is carried out participle.This participle instrument is normally trained on single language participle corpus or monolingual dictionary and is obtained; Present main flow participle instrument is for single this task of language participle; Obtained good performance; Yet thisly might not satisfy the needs of bilingual word-alignment towards the participle instrument of single language task, that is to say that such participle may not be optimum for the needs of word alignment, it is inconsistent that the present invention is called bilingual participle with this phenomenon.
The method of the inconsistent problem of the bilingual participle of present existing solution roughly can be divided into two types: one, directly obtain a kind of participle to word alignment optimization.Optimizing process is usually very time-consuming and complicated like this, and these methods need come training pattern from an initial word alignment result, yet this word alignment itself is not highly reliable as a result.Two, utilize different participles to obtain different translation rule set,, utilize certain means to merge these translation rule set then in decoding (translation) stage.The word alignment quality that this method does not improve various participles is a purpose.
The task of word alignment is to find the corresponding relation of bilingual sentence to a speech and speech.Fig. 1 has showed the correct word alignment of english-chinese bilingual sentence to " raining path and sliding-Road is slippery when raining ", and Chinese sentence participle mode wherein is " rain// road sliding " ("/" expression speech border).
Under Chinese word segmentation as shown in Figure 1, " road is sliding " need correspond to " Road " and " slippery " two speech, and two speech of D score " rain " need correspond to " raining ", could form a correct word alignment.In this participle, such " one-to-many " and the alignment pattern of " many-one " have caused bilingual participle inconsistent, have strengthened the difficulty of word alignment; Otherwise, if the participle mode of Chinese sentence is " raining/road/cunning ", so just can form the alignment pattern of more natural " one to one ", let this right alignment task become relatively easy.
Table 1 has been showed the participle of " raining path and sliding " under three kinds of participle instruments; Can find out; Except Stanford Segmenter with PKU standard has provided this alignment result who helps word alignment; All the other participles or be bilingual inconsistent (based on the participle of word frequency), or be wrong (Stanford Segmenter with CTB standard).Yet, also lack an effective method at present, can be fast right in the word alignment corpus each, choose a kind of Chinese word segmentation method that helps word alignment.
The participle example of three kinds of participle instruments of table 1
The participle instrument Participle
Participle based on word frequency Rain/road is sliding
Stanford Segmenter with PKU standard Rain/road/cunning
Stanford Segmenter with CTB standard Rainy road/cunning
Summary of the invention
Goal of the invention: technical matters to be solved by this invention is the deficiency to prior art, and the word alignment fusion method of translating in a kind of computing machine in the English translation based on the participle net is provided.
In order to solve the problems of the technologies described above, the invention discloses and translate in a kind of computing machine in the English translation based on the word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:
Step 1 is confirmed the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, constitute the skeleton alignment;
Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles.
Step 1 of the present invention may further comprise the steps:
The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively k, wherein, participle
Figure GDA00001966841600021
Wherein
Figure GDA00001966841600022
Be respectively participle s kIn j speech, J kBe participle s kThe speech number; The english sentence parallel with Chinese sentence c is E=e 1e 2E I, wherein
Figure GDA00001966841600023
Be respectively i the english of english sentence E, I is a total words in the english sentence;
Respectively at K kind participle s kK that forms with english sentence E is utilized the word alignment model to obtain K word alignment result to last, is designated as a respectively k(k=1 ..., K);
With Chinese sentence to K kind participle s k, structure participle net, the participle net is designated as C=c 1, c 2..., c J, c wherein j(j=1,2 ... J) be respectively j skeleton speech among the participle net C; J skeleton speech c among the participle net C jWith i english e iBetween be that skeleton connects A Ij, participle s kIn j speech
Figure GDA00001966841600031
With english e iBetween for connecting
Figure GDA00001966841600032
Use following formula to calculate the degree of confidence that skeleton connects:
C ( A ij | C , E ) = Σ k = 1 K w k * c ( a i δ k ( j ) k | C , E ) ;
Wherein
Figure GDA00001966841600034
Be that skeleton connects
Figure GDA00001966841600035
The degree of confidence score,
Figure GDA00001966841600036
For skeleton connects A IjProject to participle s kOn connection; W wherein kBe the weight coefficient of participle k, can try to achieve that the target of hill-climbing algorithm is the F-score that minimizes the word alignment mark language material on certain participle k with hill-climbing algorithm.
At least the skeleton articulation set that has obtained a ballot is designated as B 0, connect collection as initial skeleton;
According to the degree of confidence score, to B 0In all skeletons connect descending sorts;
Judge that successively each bar skeleton connects, the skeleton that satisfies following condition connects A IjBe selected into final skeleton and connect collection:
(1) the degree of confidence score is higher than threshold alpha (threshold value can be confirmed through hill-climbing algorithm equally); And one of meet the following conditions simultaneously:
Skeleton speech c jWith english e iDo not alignd; Perhaps, skeleton speech c jNot by any english alignment, and its left neighbours or right neighbours and english e iThe skeleton connection that constitutes is selected into final skeleton and connects collection; Perhaps, english e iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Threshold value wherein can be confirmed through hill-climbing algorithm equally;
Can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B 1
In the step 1 of the present invention,, then directly be chosen as final connection collection if find to have the skeleton that has obtained K ballot to connect.
Step 2 of the present invention may further comprise the steps:
According to projection function, skeleton is connected collection B 1In each skeleton connect and project to each participle s k, obtain K new alignment word alignment result respectively Promptly α k ′ = { a i δ k ( j ) k | A Ij ∈ B 1 } , ( k = 1 , . . . , K ) ;
For each word alignment result
Figure GDA00001966841600041
With the word alignment result
Figure GDA00001966841600042
In connection
Figure GDA00001966841600043
Connect A by its skeleton before by projection IjThe degree of confidence ascending order arrange and word alignment result relatively successively
Figure GDA00001966841600044
With word alignment a as a result kIf new skeleton connects
Figure GDA00001966841600045
Not at word alignment a as a result kIn, and like any condition below satisfying, then will connect
Figure GDA00001966841600046
From the word alignment result
Figure GDA00001966841600047
Middle deletion:
Participle s kIn j speech
Figure GDA00001966841600048
With english e iIn the word alignment result
Figure GDA00001966841600049
In all alignd; Perhaps, there has been one not to be english e iLeft and right sides neighbours' english e is in the word alignment result
Figure GDA000019668416000410
In snap to participle s kIn j speech Perhaps, there has been one not to be participle s kIn j Chinese word
Figure GDA000019668416000412
Left and right sides neighbours' Chinese word c is in the word alignment result
Figure GDA000019668416000413
In snap to english e i
The result will align
Figure GDA000019668416000414
Remaining connection is as final connection alignment result
Figure GDA000019668416000415
Promptly obtain participle s kThe fusion results of last word alignment.
In the step 2 of the present invention, if skeleton connects collection B 1In have two or more skeletons to connect to be projected in some participle s kOn the connection that obtains identical, then only keep a connection.
Beneficial effect: when a kind of language in the english-chinese bilingual need carry out participle before word alignment, can effectively the multiple participle that carries out under the different participle instruments be fused into the structure of a linearity among the present invention.The present invention utilizes the characteristic that contains in the different participles to carry out the word alignment fusion, thereby the word alignment quality of various participles can both improve, and then improves the performance of computer software translation.
The present invention improves existing word alignment algorithm based on single participle, can improve word alignment quality and mechanical translation quality under each participle simultaneously.Through with the Feature Fusion that is used for word alignment under the multiple participle, make the word alignment process healthy and strong more, can reduce receiving participle mistake or the inconsistent word alignment number of errors that influences of bilingual participle.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is done specifying further, above-mentioned and/or otherwise advantage of the present invention will become apparent.
Fig. 1 is the word alignment synoptic diagram.
Fig. 2 is the example of participle lattice WSL.
Fig. 3 is the example of participle net WSN.
Fig. 4 a is respectively the example that skeleton is connected and skeleton aligns of WSN and english sentence with Fig. 4 b.
Fig. 5 is based on the process flow diagram of the inventive method.
Embodiment
The present invention proposes and a kind of multiple many participles are merged; Be called participle net (Word Segmentation Network; Hereinafter to be referred as WSN), and then the word alignment fusion method based on the participle net is proposed, to alleviate the inconsistent word alignment problem of bringing of bilingual participle.In the prior art, merge multiple participle with participle lattice (Word Segmentation Lattice is hereinafter to be referred as WSL) usually in the natural language processing task.Two kinds of participle S1 for " raining path and sliding ": " rain// road is sliding " and participle S2: " raining/road/cunning " two kinds of participles, Fig. 3 and Fig. 4 are respectively this participle lattice and participle net and represent.
WSN first row and second row are represented participle S respectively 1With participle S 2, the third line is participle S 1With participle S 2Outside another kind of participle, the present invention is referred to as the skeleton participle.
The skeleton participle is a kind of like this participle, and its speech border is participle S 1With participle S 2The union on speech border, the i.e. set of word segmentation point in all participles of its word segmentation point.For example, among Fig. 3 " rain// road/cunning " be exactly a skeleton participle (the third line among Fig. 3).Participle S 1Middle D score is middle with " rain " to be a speech border, so also is a speech border in the middle of the D score in the skeleton participle and " rain "; And for example, participle S 2In be a speech border in the middle of " road " and " cunning ", therefore " road " and " cunning " is two skeleton speech in the skeleton participle.
The skeleton speech is each speech in the skeleton participle.For example, the skeleton participle one among Fig. 3 has four skeleton speech.
Each row among the participle net WSN are by a skeleton speech and participle S 1With participle S 2In covered this skeleton speech at correspondence position speech form.The participle net WSN one of Fig. 3 has 4 row.Can find out that the speech number of the columns of WSN and skeleton participle is consistent.
It should be noted that some non-skeleton speech possibly cover a plurality of row, such as participle S 1In " road sliding " covered two row because " road is sliding " is at S 2Middle quilt has been splitted into two speech; Again such as participle S 2In " raining " covered two row because " raining " is at S 1Middle quilt has been splitted into two speech.
The present invention has done index (subscript is since 1) to the speech (comprising the skeleton participle) of each row among the WSN.The present invention defines j skeleton speech and projects to participle s kSpeech δ k(j) on, and if only if s kIn δ k(j) individual speech with it one row in.δ for example 1(4)=3, δ 2(3)=2.
Next the skeleton that further defines between WSN and the english sentence connects and the skeleton alignment.
Skeleton connects, the intertranslation relation among the sign WSN between skeleton speech and the English word.
The skeleton alignment, the skeleton alignment is the set that skeleton connects.
Fig. 4 a and Fig. 4 b are the examples that skeleton connects and skeleton aligns correct between above-mentioned WSN and the english sentence " Road is slippery when raining ".Wherein Fig. 4 a is that a skeleton connects, and Fig. 4 b is the skeleton alignment that is connected to form by four skeletons.
The present invention has adopted a kind of the connection based on the optimum skeleton of connection selection algorithm selection that connects degree of confidence to carry out the fusion of word alignment, thereby obtains final skeleton alignment.According to projection function recited above, the skeleton speech can project into S arbitrarily 1, S 2In speech, and then any skeleton connects and just can convert traditional S to 1And S 2In speech and the connection between the english.For example, according to projection function, skeleton speech " road " the mapping s among Fig. 4 1In " road sliding ", just can be mapped to S so the skeleton among Fig. 4 a connects 1In " road sliding " arrive the connection and the S of " road " 2In " road " connection of arriving " road ".
In order to evaluate and test the raising of the present invention in performance aspect the word alignment, the present invention has adopted manual 491 English-Chinese sentences that marked word alignment to as test set of the present invention.Chinese part in the test set uses the Stamford participle instrument based on Binzhou treebank mark standard to carry out participle.Word alignment in the test set connects and is divided into two types, and one type is to confirm that type connects, and is designated as S (sure), and one type is possibly to connect by type, is designated as P (possible).Suppose that the word alignment that will evaluate and test is A, the F-score of this word alignment calculates by following formula so
precision ( S , A ) = | A ∩ S | S
recall ( S , A ) = | A ∩ S | A
Fscore ( S , α , A ) = 1 α precision ( S , A ) + 1 - α recall ( S , A ) - - - ( 1 )
In the above-mentioned formula, precision refers to the accuracy rate of word alignment A, recall refer to the to align recall rate of A.In the computing formula of Fscore, the present invention has chosen α=0.5, with balance accuracy rate and recall rate.
Word alignment fusion method based on the participle net was divided into for two steps: the first step, and use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, promptly find the skeleton alignment; Second step, will select the skeleton alignment and project on each participle, obtain traditional word alignment.
Connection selection algorithm among the present invention is based on and connects that the degree of confidence score carries out.The sub-c of centering sentence of the present invention carries out participle respectively with K kind participle instrument.For example, can take the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on Stamford participle instrument three kinds of participle instruments such as (being designated as P) of Peking University's mark standard.Their participle is designated as s respectively k(k=1 ..., K), wherein, participle
Figure GDA00001966841600071
Wherein
Figure GDA00001966841600072
Be respectively participle s kIn j speech, J kBe participle s kThe speech number; The english sentence parallel with Chinese sentence c is E=e 1e 2E I, wherein
Figure GDA00001966841600073
Be respectively i the english of english sentence E, I is an english sentence length.The present invention is respectively at this K kind participle s k(k=1 ..., K that K) forms with english sentence E is utilized traditional word alignment model to obtain K word alignment result to last, is designated as a respectively k(k=1 ..., K).Next the present invention is with the K kind participle s of Chinese sentence k(k=1 ..., K), method construct WSN as described above, WSN is designated as C=c 1c 2C J, c wherein j(j=1,2 ... J) be respectively j skeleton speech among the C.Suppose A again IjBe j skeleton speech c among the C jWith i english e iBetween skeleton connect,
Figure GDA00001966841600074
Be participle s kIn j speech (promptly
Figure GDA00001966841600075
) and e iConnection.The degree of confidence score that the present invention defines the skeleton connection is following:
C ( A ij | C , E ) = Σ k = 1 K w k * c ( a i δ k ( j ) k | C , E ) - - - ( 2 )
Wherein Be to connect
Figure GDA00001966841600078
The degree of confidence score, For skeleton connects A IjProject to participle s kOn connection.
W wherein kBe the weight coefficient of participle k, can try to achieve, hill-climbing algorithm (hill-climbing algorithm with hill-climbing algorithm; Russell; Stuart J.&Norvig, Peter (2003), Artificial Intelligence:A Modern Approach); In this experiment, optimization aim of the present invention be testing material preceding 250 to last F-score.Hill-climbing algorithm is summarized as follows: the initial value that weight is set at random is current separating; In current its proximal direction of separating, do search then; Separate more excellently if certain in the proximal direction is separated than current, then separate and substitute current separating, so repeatedly with this; Separate more excellent separating up in its proximal direction, can not find, then separate current separating the most finally than current.The present invention has attempted 20 different initial values, chooses the highest finally the separating as w of F-score then k(k=1 ..., K).
The confidence level score that connects is defined as follows:
c ( a i δ k ( j ) k | C , E ) = q c 2 e ( a i δ k ( j ) k | C , E ) * q e 2 c ( a i δ k ( j ) k | C , E ) - - - ( 3 )
The posterior probability that the C-E direction connects defines as follows:
q c 2 e ( a i δ k ( j ) k | C , E ) = p k ( e i | c δ k ( j ) k ) Σ i ′ = 1 I p k ( e i ′ | c δ k ( j ) k ) - - - ( 4 )
The posterior probability of E-C direction
Figure GDA000019668416000712
can similarly define.Probability in the top formula
Figure GDA00001966841600081
Be participle s kIn speech
Figure GDA00001966841600082
Translate english e iTranslation probability, this probability can utilize the word alignment instrument GIZA++ that increases income at participle s kGoing up training with E obtains.
Can see, on the WSN of linearity, can define skeleton easily and connect, the degree of confidence score that is connected with the calculating skeleton.And WSL is difficult to define the corresponding relation between Chinese word and the english above that owing to its nonlinear character, and then improves existing word alignment algorithm.And the character of the linearity of WSN, prompting the present invention can be easily with existing word alignment technological expansion on word alignment technology based on WSN.
Embodiment:
The used algorithm of the present invention is all write realization by C# language.The type that experiment is adopted is: Intel Xeon X5550 processor, dominant frequency is 2.66G HZ, in save as 16G.The GIZA++ word alignment kit that the present invention uses is the at present general word alignment kit of increasing income, and obtains the version that finally can under the windows platform, move by this laboratory compiling Cygwin under.The module of all the other mechanical translation that the present invention uses is rewritten with C# language according to the statistical machine translation open source software Moses based on phrase for this laboratory and is obtained.
Data are prepared as follows before implementing: the Chinese part to English-Chinese parallel language material is used K kind participle instrument participle, obtains participle among the K, i.e. s k(k=1 ..., K), s k(k=1 ..., K) be traditional word alignment a with parallel English part respectively k(k=1 ..., K).
More particularly, as shown in Figure 5, the present invention moves as follows:
1. obtain initial skeleton and connect collection: the multiple participle s that utilizes Chinese sentence k(k=1 ..., K) make up the participle net, calculate the degree of confidence score that skeletons all between Chinese word segmentation net C and the english sentence E connects according to formula (1).If skeleton connects Appear at certain a k(k=1 ..., K) in, the present invention just claims that skeleton connects A IjFrom a kObtain a ballot.At least the skeleton articulation set that has obtained a ballot is designated as B 0, connect collection as initial skeleton.
2. obtain final skeleton and connect collection: according to the degree of confidence score, to B 0In all skeletons connect descending sorts, and investigate each bar skeleton successively and connect.Skeleton connects A IjMust satisfy following condition and just can be selected into final skeleton connection collection: (1) degree of confidence score is higher than threshold alpha, and one of following condition is set up (threshold value wherein can be confirmed through above-mentioned hill-climbing algorithm equally):
A) skeleton speech c jWith english e iAll do not alignd;
B) skeleton speech c jNot by any english alignment, and its left neighbours or right neighbours and english e iThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;
C) english e iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c jThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;
Step above carrying out repeatedly can not be selected into final skeleton connection collection up to there being new skeleton to connect, and final set is designated as B 1
3. obtain final word alignment and connect collection: according to projection function, with B 1In each skeleton connect
Figure GDA00001966841600091
Project to each s k, obtain K new alignment respectively
Figure GDA00001966841600092
Promptly
Figure GDA00001966841600093
For each
Figure GDA00001966841600094
The present invention will
Figure GDA00001966841600095
In in connection connect A by its skeleton before by projection IjThe degree of confidence ascending order arrange, and compare successively
Figure GDA00001966841600096
And a kIf, the new connection Not at a kIn, and following condition is satisfied, then will
Figure GDA00001966841600098
From
Figure GDA00001966841600099
Middle deletion:
A) Chinese word
Figure GDA000019668416000910
With english e iIn alignment
Figure GDA000019668416000911
In all alignd;
B) a non-english e has been arranged iLeft and right sides neighbours' english e, in alignment
Figure GDA000019668416000912
In snap to Chinese word
Figure GDA000019668416000913
C) a non-Chinese word has been arranged
Figure GDA000019668416000914
Left and right sides neighbours' Chinese word c, in alignment
Figure GDA000019668416000915
In snap to english e i
With remaining connection as final connection, the fusion results of word alignment on the promptly different participles.
The algorithm note:
A) in the step 1, if find to have the skeleton that has obtained K ballot to connect, then it will directly be chosen as final connection collection, and further not judge;
B) in the step 3, if B 1In have two or more skeletons to connect to be projected in certain s kOn the connection that obtains identical, then only keep a connection;
C) Rule of judgment in the step 2 is that step 4 has taked similar method to delete the connection of potential possible errors for the skeleton of deleting potential possible errors connects.
In order to verify validity of the present invention, the present invention has carried out two groups of experiments.First group of experiment is used for checking the present invention whether can effectively improve the quality of word alignment; Second group of experiment is used for checking the present invention whether can effectively improve the performance of machine translation system.
Experimental data is prepared as follows: bilingual parallel of choosing among the LDC2003E14 is right, be about 190,000 right, training set the most of the present invention; Choose the development set of NIST ' 06, be used for the weight of various features in the estimating system as machine translation system of the present invention; Choose the test set of NIST ' 08, in order to the performance of estimating system as machine translation system of the present invention.Chinese part for above-mentioned these language materials; The present invention handles with three kinds of participle instruments respectively; They are respectively: the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences; Based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on the Stamford participle instrument (being designated as P) of Peking University's mark standard.The present invention adopt machine translation system be this laboratory oneself realize that similar Koehn proposed in 2003 a machine translation system based on phrase.This system adopts the 5-gram language model, is trained by Xinhua's language material partly of GIZAWORD to obtain.The minimal error rate training method that the training of systematic parameter has taked Och to propose in 2003.The present invention has adopted two groups of baseline to carry out word alignment fusion of the present invention: first group is GIZA++ word alignment instrument; After this instrument of using obtains the word alignment result of both direction; Carrying out the fusion of both direction with the didactic method of GDF, this group baseline brief note is GIZA; Second group is the linear discriminent word alignment model that Liu Yang proposes, and notes by abridging to be DIWA.In order to estimate the performance of word alignment, the present invention has adopted foregoing testing material, in the testing material preceding 250 be used for training the weight w in the formula (1) k(k=1 ..., K) be connected selection algorithm in threshold alpha, use and the result who estimates word alignment for back 241.This Chinese part of 491 is used participle C.First group of experiment, the present invention have been estimated the present invention in the qualitative raising of word alignment on these 241.As shown in the table, GIZA wherein and DIWA represent that respectively Fused word alignment result derives from GIZA and two models of DIWA, and P, R, F represent word alignment result's accuracy rate respectively, recall rate and F-score.Usually, represent final word alignment quality with F-score, P and R only do reference.The present invention adopted four groups merge to be provided with, and C is set representes not merge, and promptly based on traditional word alignment method of participle C, C+P is set represented to merge respectively the word alignment result based on participle C and P, by that analogy.
Can find out that method of the present invention has all significantly improved the F-score of word alignment in GIZA group and DIWA group.For the GIZA group, under the setting of C+I+P, F-score falls slightly after rise.This is relevant with GIZA model deflection recall rate itself, if excessively merge for the high model of recall rate, can damage accurate rate (69.68%).But for the DIWA group, the participle of fusion is many more, and the word alignment result is good more.This is relevant with DIWA model deflection accurate rate itself, and fusion method can effectively improve recall rate, and then improves F-score.
Table 2 word alignment experimental result
Figure GDA00001966841600111
Second group of experiment, the present invention has estimated the performance of machine translation system on the test set of NIST ' 08, and the index of test and appraisal is BLEU score.B wherein representes baseline, and Comb representes to merge later result through C+P+I.
The experimental result of table 3 mechanical translation
No matter can find out, be that the present invention has improved the performance of machine translation system significantly in GIZA group or DIWA group.
The invention provides the thinking of translating in a kind of computing machine in the English translation based on the word alignment fusion method of participle net; The method and the approach of concrete this technical scheme of realization are a lot, and the above only is a preferred implementation of the present invention, should be understood that; For those skilled in the art; Under the prerequisite that does not break away from the principle of the invention, can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment realizes.

Claims (4)

1. translate in the computing machine in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:
Step 1 is confirmed the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect;
Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles;
It is characterized in that step 1 may further comprise the steps:
The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively k, wherein, participle
Figure FDA00001966841500011
K=1 ..., K, wherein
Figure FDA00001966841500012
Be respectively participle s kIn j speech, J kBe participle s kThe speech number; The english sentence parallel with Chinese sentence c is E=e 1e 2E I, wherein
Figure FDA00001966841500013
Be respectively i the english of english sentence E, I is a total words in the english sentence;
Respectively at K kind participle s kK that forms with english sentence E is utilized traditional word alignment model based on single participle to obtain K word alignment result to last, is designated as a respectively k(k=1 ..., K);
With Chinese sentence to K kind participle s k, structure participle net, the participle net is designated as C, C=c 1, c 2..., c J, c wherein j(j=1,2 ... J) be j skeleton speech among the participle net C; A IjBe j skeleton speech c among the participle net C jWith i english e iBetween skeleton connect,
Figure FDA00001966841500014
Be participle s kIn j speech
Figure FDA00001966841500015
With english e iBetween connection; Use following formula to calculate the degree of confidence that skeleton connects:
C ( A ij | C , E ) = Σ k = 1 K w k * c ( a i δ k ( j ) k | C , E )
Wherein
Figure FDA00001966841500017
Be that skeleton connects
Figure FDA00001966841500018
The degree of confidence score,
Figure FDA00001966841500019
For skeleton connects A IjProject to participle s kOn skeleton connect; W wherein kWeight coefficient for participle k; K is the sum of participle;
At least the skeleton articulation set that has obtained a ballot is designated as B 0, connect collection as initial skeleton;
According to the degree of confidence score, to B 0In all skeletons connect A IjDescending sort;
Judge that successively each bar skeleton connects A Ij, the skeleton that satisfies following condition connects A IjBe selected into final skeleton and connect collection:
(1) the degree of confidence score is higher than threshold alpha; And one of meet the following conditions simultaneously:
Skeleton speech c jWith english e iDo not alignd; Perhaps, skeleton speech c jNot by any english alignment, and its left neighbours or right neighbours and english e iThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Perhaps, english e iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection;
Repeat this step, can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B 1
2. translate in a kind of computing machine according to claim 1 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 1,, then directly be chosen as final skeleton and connect and collect if find to have the skeleton that has obtained K ballot to connect.
3. according to translating in a kind of computing machine described in claim 1 or 2 in the English translation, it is characterized in that step 2 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:
According to projection function, skeleton is connected collection B 1In each skeleton connect projection branch and be clipped to each participle s kOn, obtain K new word alignment result respectively
Figure FDA00001966841500021
Promptly a k ′ = { a i δ k ( j ) k | A Ij ∈ B 1 } , ( k = 1 , . . . , K ) ;
For each word alignment result
Figure FDA00001966841500023
With the word alignment result
Figure FDA00001966841500024
In connection
Figure FDA00001966841500025
Connect A by its skeleton before by projection IjThe degree of confidence ascending order arrange and word alignment result relatively successively
Figure FDA00001966841500026
With word alignment a as a result kIf, new connection Not at word alignment a as a result kIn, and like any condition below satisfying, then will connect
Figure FDA00001966841500028
From the word alignment result
Figure FDA00001966841500029
Middle deletion:
Participle s kIn j speech
Figure FDA000019668415000210
With english e iIn the word alignment result
Figure FDA000019668415000211
In all alignd; Perhaps, there has been one not to be english e iLeft and right sides neighbours' english e is in the word alignment result
Figure FDA000019668415000212
In snap to participle s kIn j speech Perhaps, there has been one not to be participle s kIn j Chinese word
Figure FDA000019668415000214
Left and right sides neighbours' Chinese word c is in the word alignment result In snap to english e i
With the word alignment result
Figure FDA000019668415000216
Remaining connection is as final connection alignment result
Figure FDA000019668415000217
Promptly obtain participle s kThe fusion results of last word alignment.
4. translate in a kind of computing machine according to claim 3 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 2, if skeleton connects collection B 1In have two or more skeletons to connect to be projected in some participle s kOn the connection that obtains identical, then only keep a connection.
CN2011101486920A 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation Expired - Fee Related CN102193915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101486920A CN102193915B (en) 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101486920A CN102193915B (en) 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Publications (2)

Publication Number Publication Date
CN102193915A CN102193915A (en) 2011-09-21
CN102193915B true CN102193915B (en) 2012-11-28

Family

ID=44601998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101486920A Expired - Fee Related CN102193915B (en) 2011-06-03 2011-06-03 Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation

Country Status (1)

Country Link
CN (1) CN102193915B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684648B (en) * 2019-01-14 2020-09-01 浙江大学 Multi-feature fusion automatic translation method for ancient and modern Chinese
CN116070643B (en) * 2023-04-03 2023-08-15 武昌理工学院 Fixed style translation method and system from ancient text to English

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4961755B2 (en) * 2006-01-23 2012-06-27 富士ゼロックス株式会社 Word alignment device, word alignment method, word alignment program
CN101452446A (en) * 2007-12-07 2009-06-10 株式会社东芝 Target language word deforming method and device
CN101676898B (en) * 2008-09-17 2011-12-07 中国科学院自动化研究所 Method and device for translating Chinese organization name into English with the aid of network knowledge
CN101714136B (en) * 2008-10-06 2012-04-11 株式会社东芝 Method and device for adapting a machine translation system based on language database to new field

Also Published As

Publication number Publication date
CN102193915A (en) 2011-09-21

Similar Documents

Publication Publication Date Title
CN110941722B (en) Knowledge graph fusion method based on entity alignment
Vulić et al. On the role of seed lexicons in learning bilingual word embeddings
CN106844352B (en) Word prediction method and system based on neural machine translation system
CN1770107B (en) Extracting treelet translation pairs
CN101539907B (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN106503255A (en) Based on the method and system that description text automatically generates article
CN102214166B (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN103500160B (en) A kind of syntactic analysis method based on the semantic String matching that slides
Matci et al. Address standardization using the natural language process for improving geocoding results
CN104991889A (en) Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN104008092B (en) Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping
CN109829173B (en) English place name translation method and device
US20090326914A1 (en) Cross lingual location search
CN104756100A (en) Intent estimation device and intent estimation method
CN102799579A (en) Statistical machine translation method with error self-diagnosis and self-correction functions
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN102117270B (en) A kind of based on the statistical machine translation method of fuzzy tree to accurate tree
CN110046261A (en) A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering
CN103678271B (en) A kind of text correction method and subscriber equipment
CN103699528B (en) Translation providing method, device and system
CN103544309A (en) Splitting method for search string of Chinese vertical search
CN103688254B (en) Error-detecting system based on example, method and error-detecting facility for assessment writing automatically
CN109933797A (en) Geocoding and system based on Jieba participle and address dictionary
CN104731774A (en) Individualized translation method and individualized translation device oriented to general machine translation engine
CN107463711A (en) A kind of tag match method and device of data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121128

Termination date: 20150603

EXPY Termination of patent right or utility model