CN102193915B - Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation - Google Patents
Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation Download PDFInfo
- Publication number
- CN102193915B CN102193915B CN2011101486920A CN201110148692A CN102193915B CN 102193915 B CN102193915 B CN 102193915B CN 2011101486920 A CN2011101486920 A CN 2011101486920A CN 201110148692 A CN201110148692 A CN 201110148692A CN 102193915 B CN102193915 B CN 102193915B
- Authority
- CN
- China
- Prior art keywords
- participle
- skeleton
- english
- word alignment
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000013519 translation Methods 0.000 title claims abstract description 39
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 14
- 230000004927 fusion Effects 0.000 claims description 9
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 21
- 238000012360 testing method Methods 0.000 description 11
- 239000000463 material Substances 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 7
- 230000011218 segmentation Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation. The method comprises the following steps of: 1, determining skeleton alignment: searching and selecting an optimal skeleton connection by using a connection-confidence-based connection selection algorithm, and forming the skeleton alignment; and 2, projecting the selected skeleton alignment to each participle to obtain various-participle-based word alignment. By the method, the conventional single-participle-based word alignment algorithm is improved, and the word alignment quality of each participle and the machine translation quality can be simultaneously improved. By fusing the characteristics for the word alignment under multiple participles, the final word alignment is more robust, and the number of word alignment errors affected by participle errors or bilingual participle inconsistency can be reduced.
Description
Technical field
The present invention relates to a kind of computer software language translation field, translate in the English translation word alignment fusion method in particularly a kind of computing machine based on the participle net.
Background technology
Frequent day by day along with the rapid increase of world today's quantity of information and international exchange, computer networking technology is popularized rapidly and development, and further obviously with serious, people are also increasing to the potential demand of mechanical translation for aphasis.Mechanical translation is exactly to realize the translation between the different language with computing machine.The language of being translated is called source language, and the object language of translating into is called target language, and mechanical translation is exactly to realize the process of conversion from the source language to the target language.In recent years, a series of impressive progresses have been obtained based on the statistical machine translation method of extensive corpus.Statistical machine translation utilizes statistical method; A large amount of bilingual translation rule and the characteristics of study from large-scale bilingual Parallel Corpus; With these rules and characteristic the sentence of source language is decoded (translation) then, the sentence that searches out the maximum target language of probability is as translating sentence.Wherein, bilingual word-alignment is the prior step that obtains translation rule in the above-mentioned flow process.Word alignment be exactly to find bilingual parallel sentence between speech and the corresponding relation of speech.The quality of the quality of word alignment directly has influence on the quality of the translation rule that extracts, and then has influence on the final performance of machine translation system.If one or both language in bilingual need carry out participle (like Chinese), so common way is before word alignment, to utilize certain participle instrument that the language material that needs participle is carried out participle.This participle instrument is normally trained on single language participle corpus or monolingual dictionary and is obtained; Present main flow participle instrument is for single this task of language participle; Obtained good performance; Yet thisly might not satisfy the needs of bilingual word-alignment towards the participle instrument of single language task, that is to say that such participle may not be optimum for the needs of word alignment, it is inconsistent that the present invention is called bilingual participle with this phenomenon.
The method of the inconsistent problem of the bilingual participle of present existing solution roughly can be divided into two types: one, directly obtain a kind of participle to word alignment optimization.Optimizing process is usually very time-consuming and complicated like this, and these methods need come training pattern from an initial word alignment result, yet this word alignment itself is not highly reliable as a result.Two, utilize different participles to obtain different translation rule set,, utilize certain means to merge these translation rule set then in decoding (translation) stage.The word alignment quality that this method does not improve various participles is a purpose.
The task of word alignment is to find the corresponding relation of bilingual sentence to a speech and speech.Fig. 1 has showed the correct word alignment of english-chinese bilingual sentence to " raining path and sliding-Road is slippery when raining ", and Chinese sentence participle mode wherein is " rain// road sliding " ("/" expression speech border).
Under Chinese word segmentation as shown in Figure 1, " road is sliding " need correspond to " Road " and " slippery " two speech, and two speech of D score " rain " need correspond to " raining ", could form a correct word alignment.In this participle, such " one-to-many " and the alignment pattern of " many-one " have caused bilingual participle inconsistent, have strengthened the difficulty of word alignment; Otherwise, if the participle mode of Chinese sentence is " raining/road/cunning ", so just can form the alignment pattern of more natural " one to one ", let this right alignment task become relatively easy.
Table 1 has been showed the participle of " raining path and sliding " under three kinds of participle instruments; Can find out; Except Stanford Segmenter with PKU standard has provided this alignment result who helps word alignment; All the other participles or be bilingual inconsistent (based on the participle of word frequency), or be wrong (Stanford Segmenter with CTB standard).Yet, also lack an effective method at present, can be fast right in the word alignment corpus each, choose a kind of Chinese word segmentation method that helps word alignment.
The participle example of three kinds of participle instruments of table 1
The participle instrument | Participle |
Participle based on word frequency | Rain/road is sliding |
Stanford Segmenter with PKU standard | Rain/road/cunning |
Stanford Segmenter with CTB standard | Rainy road/cunning |
Summary of the invention
Goal of the invention: technical matters to be solved by this invention is the deficiency to prior art, and the word alignment fusion method of translating in a kind of computing machine in the English translation based on the participle net is provided.
In order to solve the problems of the technologies described above, the invention discloses and translate in a kind of computing machine in the English translation based on the word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:
Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles.
The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively
k, wherein, participle
Wherein
Be respectively participle s
kIn j speech, J
kBe participle s
kThe speech number; The english sentence parallel with Chinese sentence c is E=e
1e
2E
I, wherein
Be respectively i the english of english sentence E, I is a total words in the english sentence;
Respectively at K kind participle s
kK that forms with english sentence E is utilized the word alignment model to obtain K word alignment result to last, is designated as a respectively
k(k=1 ..., K);
With Chinese sentence to K kind participle s
k, structure participle net, the participle net is designated as C=c
1, c
2..., c
J, c wherein
j(j=1,2 ... J) be respectively j skeleton speech among the participle net C; J skeleton speech c among the participle net C
jWith i english e
iBetween be that skeleton connects A
Ij, participle s
kIn j speech
With english e
iBetween for connecting
Use following formula to calculate the degree of confidence that skeleton connects:
Wherein
Be that skeleton connects
The degree of confidence score,
For skeleton connects A
IjProject to participle s
kOn connection; W wherein
kBe the weight coefficient of participle k, can try to achieve that the target of hill-climbing algorithm is the F-score that minimizes the word alignment mark language material on certain participle k with hill-climbing algorithm.
At least the skeleton articulation set that has obtained a ballot is designated as B
0, connect collection as initial skeleton;
According to the degree of confidence score, to B
0In all skeletons connect descending sorts;
Judge that successively each bar skeleton connects, the skeleton that satisfies following condition connects A
IjBe selected into final skeleton and connect collection:
(1) the degree of confidence score is higher than threshold alpha (threshold value can be confirmed through hill-climbing algorithm equally); And one of meet the following conditions simultaneously:
Skeleton speech c
jWith english e
iDo not alignd; Perhaps, skeleton speech c
jNot by any english alignment, and its left neighbours or right neighbours and english e
iThe skeleton connection that constitutes is selected into final skeleton and connects collection; Perhaps, english e
iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c
jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Threshold value wherein can be confirmed through hill-climbing algorithm equally;
Can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B
1
In the step 1 of the present invention,, then directly be chosen as final connection collection if find to have the skeleton that has obtained K ballot to connect.
Step 2 of the present invention may further comprise the steps:
According to projection function, skeleton is connected collection B
1In each skeleton connect and project to each participle s
k, obtain K new alignment word alignment result respectively
Promptly
For each word alignment result
With the word alignment result
In connection
Connect A by its skeleton before by projection
IjThe degree of confidence ascending order arrange and word alignment result relatively successively
With word alignment a as a result
kIf new skeleton connects
Not at word alignment a as a result
kIn, and like any condition below satisfying, then will connect
From the word alignment result
Middle deletion:
Participle s
kIn j speech
With english e
iIn the word alignment result
In all alignd; Perhaps, there has been one not to be english e
iLeft and right sides neighbours' english e is in the word alignment result
In snap to participle s
kIn j speech
Perhaps, there has been one not to be participle s
kIn j Chinese word
Left and right sides neighbours' Chinese word c is in the word alignment result
In snap to english e
i
The result will align
Remaining connection is as final connection alignment result
Promptly obtain participle s
kThe fusion results of last word alignment.
In the step 2 of the present invention, if skeleton connects collection B
1In have two or more skeletons to connect to be projected in some participle s
kOn the connection that obtains identical, then only keep a connection.
Beneficial effect: when a kind of language in the english-chinese bilingual need carry out participle before word alignment, can effectively the multiple participle that carries out under the different participle instruments be fused into the structure of a linearity among the present invention.The present invention utilizes the characteristic that contains in the different participles to carry out the word alignment fusion, thereby the word alignment quality of various participles can both improve, and then improves the performance of computer software translation.
The present invention improves existing word alignment algorithm based on single participle, can improve word alignment quality and mechanical translation quality under each participle simultaneously.Through with the Feature Fusion that is used for word alignment under the multiple participle, make the word alignment process healthy and strong more, can reduce receiving participle mistake or the inconsistent word alignment number of errors that influences of bilingual participle.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is done specifying further, above-mentioned and/or otherwise advantage of the present invention will become apparent.
Fig. 1 is the word alignment synoptic diagram.
Fig. 2 is the example of participle lattice WSL.
Fig. 3 is the example of participle net WSN.
Fig. 4 a is respectively the example that skeleton is connected and skeleton aligns of WSN and english sentence with Fig. 4 b.
Fig. 5 is based on the process flow diagram of the inventive method.
Embodiment
The present invention proposes and a kind of multiple many participles are merged; Be called participle net (Word Segmentation Network; Hereinafter to be referred as WSN), and then the word alignment fusion method based on the participle net is proposed, to alleviate the inconsistent word alignment problem of bringing of bilingual participle.In the prior art, merge multiple participle with participle lattice (Word Segmentation Lattice is hereinafter to be referred as WSL) usually in the natural language processing task.Two kinds of participle S1 for " raining path and sliding ": " rain// road is sliding " and participle S2: " raining/road/cunning " two kinds of participles, Fig. 3 and Fig. 4 are respectively this participle lattice and participle net and represent.
WSN first row and second row are represented participle S respectively
1With participle S
2, the third line is participle S
1With participle S
2Outside another kind of participle, the present invention is referred to as the skeleton participle.
The skeleton participle is a kind of like this participle, and its speech border is participle S
1With participle S
2The union on speech border, the i.e. set of word segmentation point in all participles of its word segmentation point.For example, among Fig. 3 " rain// road/cunning " be exactly a skeleton participle (the third line among Fig. 3).Participle S
1Middle D score is middle with " rain " to be a speech border, so also is a speech border in the middle of the D score in the skeleton participle and " rain "; And for example, participle S
2In be a speech border in the middle of " road " and " cunning ", therefore " road " and " cunning " is two skeleton speech in the skeleton participle.
The skeleton speech is each speech in the skeleton participle.For example, the skeleton participle one among Fig. 3 has four skeleton speech.
Each row among the participle net WSN are by a skeleton speech and participle S
1With participle S
2In covered this skeleton speech at correspondence position speech form.The participle net WSN one of Fig. 3 has 4 row.Can find out that the speech number of the columns of WSN and skeleton participle is consistent.
It should be noted that some non-skeleton speech possibly cover a plurality of row, such as participle S
1In " road sliding " covered two row because " road is sliding " is at S
2Middle quilt has been splitted into two speech; Again such as participle S
2In " raining " covered two row because " raining " is at S
1Middle quilt has been splitted into two speech.
The present invention has done index (subscript is since 1) to the speech (comprising the skeleton participle) of each row among the WSN.The present invention defines j skeleton speech and projects to participle s
kSpeech δ
k(j) on, and if only if s
kIn δ
k(j) individual speech with it one row in.δ for example
1(4)=3, δ
2(3)=2.
Next the skeleton that further defines between WSN and the english sentence connects and the skeleton alignment.
Skeleton connects, the intertranslation relation among the sign WSN between skeleton speech and the English word.
The skeleton alignment, the skeleton alignment is the set that skeleton connects.
Fig. 4 a and Fig. 4 b are the examples that skeleton connects and skeleton aligns correct between above-mentioned WSN and the english sentence " Road is slippery when raining ".Wherein Fig. 4 a is that a skeleton connects, and Fig. 4 b is the skeleton alignment that is connected to form by four skeletons.
The present invention has adopted a kind of the connection based on the optimum skeleton of connection selection algorithm selection that connects degree of confidence to carry out the fusion of word alignment, thereby obtains final skeleton alignment.According to projection function recited above, the skeleton speech can project into S arbitrarily
1, S
2In speech, and then any skeleton connects and just can convert traditional S to
1And S
2In speech and the connection between the english.For example, according to projection function, skeleton speech " road " the mapping s among Fig. 4
1In " road sliding ", just can be mapped to S so the skeleton among Fig. 4 a connects
1In " road sliding " arrive the connection and the S of " road "
2In " road " connection of arriving " road ".
In order to evaluate and test the raising of the present invention in performance aspect the word alignment, the present invention has adopted manual 491 English-Chinese sentences that marked word alignment to as test set of the present invention.Chinese part in the test set uses the Stamford participle instrument based on Binzhou treebank mark standard to carry out participle.Word alignment in the test set connects and is divided into two types, and one type is to confirm that type connects, and is designated as S (sure), and one type is possibly to connect by type, is designated as P (possible).Suppose that the word alignment that will evaluate and test is A, the F-score of this word alignment calculates by following formula so
In the above-mentioned formula, precision refers to the accuracy rate of word alignment A, recall refer to the to align recall rate of A.In the computing formula of Fscore, the present invention has chosen α=0.5, with balance accuracy rate and recall rate.
Word alignment fusion method based on the participle net was divided into for two steps: the first step, and use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect, promptly find the skeleton alignment; Second step, will select the skeleton alignment and project on each participle, obtain traditional word alignment.
Connection selection algorithm among the present invention is based on and connects that the degree of confidence score carries out.The sub-c of centering sentence of the present invention carries out participle respectively with K kind participle instrument.For example, can take the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences, based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on Stamford participle instrument three kinds of participle instruments such as (being designated as P) of Peking University's mark standard.Their participle is designated as s respectively
k(k=1 ..., K), wherein, participle
Wherein
Be respectively participle s
kIn j speech, J
kBe participle s
kThe speech number; The english sentence parallel with Chinese sentence c is E=e
1e
2E
I, wherein
Be respectively i the english of english sentence E, I is an english sentence length.The present invention is respectively at this K kind participle s
k(k=1 ..., K that K) forms with english sentence E is utilized traditional word alignment model to obtain K word alignment result to last, is designated as a respectively
k(k=1 ..., K).Next the present invention is with the K kind participle s of Chinese sentence
k(k=1 ..., K), method construct WSN as described above, WSN is designated as C=c
1c
2C
J, c wherein
j(j=1,2 ... J) be respectively j skeleton speech among the C.Suppose A again
IjBe j skeleton speech c among the C
jWith i english e
iBetween skeleton connect,
Be participle s
kIn j speech (promptly
) and e
iConnection.The degree of confidence score that the present invention defines the skeleton connection is following:
Wherein
Be to connect
The degree of confidence score,
For skeleton connects A
IjProject to participle s
kOn connection.
W wherein
kBe the weight coefficient of participle k, can try to achieve, hill-climbing algorithm (hill-climbing algorithm with hill-climbing algorithm; Russell; Stuart J.&Norvig, Peter (2003), Artificial Intelligence:A Modern Approach); In this experiment, optimization aim of the present invention be testing material preceding 250 to last F-score.Hill-climbing algorithm is summarized as follows: the initial value that weight is set at random is current separating; In current its proximal direction of separating, do search then; Separate more excellently if certain in the proximal direction is separated than current, then separate and substitute current separating, so repeatedly with this; Separate more excellent separating up in its proximal direction, can not find, then separate current separating the most finally than current.The present invention has attempted 20 different initial values, chooses the highest finally the separating as w of F-score then
k(k=1 ..., K).
The confidence level score that connects is defined as follows:
The posterior probability that the C-E direction connects defines as follows:
The posterior probability of E-C direction
can similarly define.Probability in the top formula
Be participle s
kIn speech
Translate english e
iTranslation probability, this probability can utilize the word alignment instrument GIZA++ that increases income at participle s
kGoing up training with E obtains.
Can see, on the WSN of linearity, can define skeleton easily and connect, the degree of confidence score that is connected with the calculating skeleton.And WSL is difficult to define the corresponding relation between Chinese word and the english above that owing to its nonlinear character, and then improves existing word alignment algorithm.And the character of the linearity of WSN, prompting the present invention can be easily with existing word alignment technological expansion on word alignment technology based on WSN.
Embodiment:
The used algorithm of the present invention is all write realization by C# language.The type that experiment is adopted is: Intel Xeon X5550 processor, dominant frequency is 2.66G HZ, in save as 16G.The GIZA++ word alignment kit that the present invention uses is the at present general word alignment kit of increasing income, and obtains the version that finally can under the windows platform, move by this laboratory compiling Cygwin under.The module of all the other mechanical translation that the present invention uses is rewritten with C# language according to the statistical machine translation open source software Moses based on phrase for this laboratory and is obtained.
Data are prepared as follows before implementing: the Chinese part to English-Chinese parallel language material is used K kind participle instrument participle, obtains participle among the K, i.e. s
k(k=1 ..., K), s
k(k=1 ..., K) be traditional word alignment a with parallel English part respectively
k(k=1 ..., K).
More particularly, as shown in Figure 5, the present invention moves as follows:
1. obtain initial skeleton and connect collection: the multiple participle s that utilizes Chinese sentence
k(k=1 ..., K) make up the participle net, calculate the degree of confidence score that skeletons all between Chinese word segmentation net C and the english sentence E connects according to formula (1).If skeleton connects
Appear at certain a
k(k=1 ..., K) in, the present invention just claims that skeleton connects A
IjFrom a
kObtain a ballot.At least the skeleton articulation set that has obtained a ballot is designated as B
0, connect collection as initial skeleton.
2. obtain final skeleton and connect collection: according to the degree of confidence score, to B
0In all skeletons connect descending sorts, and investigate each bar skeleton successively and connect.Skeleton connects A
IjMust satisfy following condition and just can be selected into final skeleton connection collection: (1) degree of confidence score is higher than threshold alpha, and one of following condition is set up (threshold value wherein can be confirmed through above-mentioned hill-climbing algorithm equally):
A) skeleton speech c
jWith english e
iAll do not alignd;
B) skeleton speech c
jNot by any english alignment, and its left neighbours or right neighbours and english e
iThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;
C) english e
iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c
jThe skeleton connection that constitutes has been selected into the final skeleton connection and has collected;
Step above carrying out repeatedly can not be selected into final skeleton connection collection up to there being new skeleton to connect, and final set is designated as B
1
3. obtain final word alignment and connect collection: according to projection function, with B
1In each skeleton connect
Project to each s
k, obtain K new alignment respectively
Promptly
For each
The present invention will
In in connection connect A by its skeleton before by projection
IjThe degree of confidence ascending order arrange, and compare successively
And a
kIf, the new connection
Not at a
kIn, and following condition is satisfied, then will
From
Middle deletion:
B) a non-english e has been arranged
iLeft and right sides neighbours' english e, in alignment
In snap to Chinese word
C) a non-Chinese word has been arranged
Left and right sides neighbours' Chinese word c, in alignment
In snap to english e
i
With remaining connection as final connection, the fusion results of word alignment on the promptly different participles.
The algorithm note:
A) in the step 1, if find to have the skeleton that has obtained K ballot to connect, then it will directly be chosen as final connection collection, and further not judge;
B) in the step 3, if B
1In have two or more skeletons to connect to be projected in certain s
kOn the connection that obtains identical, then only keep a connection;
C) Rule of judgment in the step 2 is that step 4 has taked similar method to delete the connection of potential possible errors for the skeleton of deleting potential possible errors connects.
In order to verify validity of the present invention, the present invention has carried out two groups of experiments.First group of experiment is used for checking the present invention whether can effectively improve the quality of word alignment; Second group of experiment is used for checking the present invention whether can effectively improve the performance of machine translation system.
Experimental data is prepared as follows: bilingual parallel of choosing among the LDC2003E14 is right, be about 190,000 right, training set the most of the present invention; Choose the development set of NIST ' 06, be used for the weight of various features in the estimating system as machine translation system of the present invention; Choose the test set of NIST ' 08, in order to the performance of estimating system as machine translation system of the present invention.Chinese part for above-mentioned these language materials; The present invention handles with three kinds of participle instruments respectively; They are respectively: the participle instrument ICTCLAS (being designated as I) of the Chinese Academy of Sciences; Based on the Stamford participle instrument (being designated as C) of Binzhou treebank mark standard, based on the Stamford participle instrument (being designated as P) of Peking University's mark standard.The present invention adopt machine translation system be this laboratory oneself realize that similar Koehn proposed in 2003 a machine translation system based on phrase.This system adopts the 5-gram language model, is trained by Xinhua's language material partly of GIZAWORD to obtain.The minimal error rate training method that the training of systematic parameter has taked Och to propose in 2003.The present invention has adopted two groups of baseline to carry out word alignment fusion of the present invention: first group is GIZA++ word alignment instrument; After this instrument of using obtains the word alignment result of both direction; Carrying out the fusion of both direction with the didactic method of GDF, this group baseline brief note is GIZA; Second group is the linear discriminent word alignment model that Liu Yang proposes, and notes by abridging to be DIWA.In order to estimate the performance of word alignment, the present invention has adopted foregoing testing material, in the testing material preceding 250 be used for training the weight w in the formula (1)
k(k=1 ..., K) be connected selection algorithm in threshold alpha, use and the result who estimates word alignment for back 241.This Chinese part of 491 is used participle C.First group of experiment, the present invention have been estimated the present invention in the qualitative raising of word alignment on these 241.As shown in the table, GIZA wherein and DIWA represent that respectively Fused word alignment result derives from GIZA and two models of DIWA, and P, R, F represent word alignment result's accuracy rate respectively, recall rate and F-score.Usually, represent final word alignment quality with F-score, P and R only do reference.The present invention adopted four groups merge to be provided with, and C is set representes not merge, and promptly based on traditional word alignment method of participle C, C+P is set represented to merge respectively the word alignment result based on participle C and P, by that analogy.
Can find out that method of the present invention has all significantly improved the F-score of word alignment in GIZA group and DIWA group.For the GIZA group, under the setting of C+I+P, F-score falls slightly after rise.This is relevant with GIZA model deflection recall rate itself, if excessively merge for the high model of recall rate, can damage accurate rate (69.68%).But for the DIWA group, the participle of fusion is many more, and the word alignment result is good more.This is relevant with DIWA model deflection accurate rate itself, and fusion method can effectively improve recall rate, and then improves F-score.
Table 2 word alignment experimental result
Second group of experiment, the present invention has estimated the performance of machine translation system on the test set of NIST ' 08, and the index of test and appraisal is BLEU score.B wherein representes baseline, and Comb representes to merge later result through C+P+I.
The experimental result of table 3 mechanical translation
No matter can find out, be that the present invention has improved the performance of machine translation system significantly in GIZA group or DIWA group.
The invention provides the thinking of translating in a kind of computing machine in the English translation based on the word alignment fusion method of participle net; The method and the approach of concrete this technical scheme of realization are a lot, and the above only is a preferred implementation of the present invention, should be understood that; For those skilled in the art; Under the prerequisite that does not break away from the principle of the invention, can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment realizes.
Claims (4)
1. translate in the computing machine in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, may further comprise the steps:
Step 1 is confirmed the skeleton alignment: use based on the connection selection algorithm search that connects degree of confidence and select optimum skeleton to connect;
Step 2 will be selected the skeleton alignment and project on each participle, obtain the word alignment based on various participles;
It is characterized in that step 1 may further comprise the steps:
The sub-c of centering sentence carries out participle respectively with K kind participle instrument, and participle is designated as s respectively
k, wherein, participle
K=1 ..., K, wherein
Be respectively participle s
kIn j speech, J
kBe participle s
kThe speech number; The english sentence parallel with Chinese sentence c is E=e
1e
2E
I, wherein
Be respectively i the english of english sentence E, I is a total words in the english sentence;
Respectively at K kind participle s
kK that forms with english sentence E is utilized traditional word alignment model based on single participle to obtain K word alignment result to last, is designated as a respectively
k(k=1 ..., K);
With Chinese sentence to K kind participle s
k, structure participle net, the participle net is designated as C, C=c
1, c
2..., c
J, c wherein
j(j=1,2 ... J) be j skeleton speech among the participle net C; A
IjBe j skeleton speech c among the participle net C
jWith i english e
iBetween skeleton connect,
Be participle s
kIn j speech
With english e
iBetween connection; Use following formula to calculate the degree of confidence that skeleton connects:
Wherein
Be that skeleton connects
The degree of confidence score,
For skeleton connects A
IjProject to participle s
kOn skeleton connect; W wherein
kWeight coefficient for participle k; K is the sum of participle;
At least the skeleton articulation set that has obtained a ballot is designated as B
0, connect collection as initial skeleton;
According to the degree of confidence score, to B
0In all skeletons connect A
IjDescending sort;
Judge that successively each bar skeleton connects A
Ij, the skeleton that satisfies following condition connects A
IjBe selected into final skeleton and connect collection:
(1) the degree of confidence score is higher than threshold alpha; And one of meet the following conditions simultaneously:
Skeleton speech c
jWith english e
iDo not alignd; Perhaps, skeleton speech c
jNot by any english alignment, and its left neighbours or right neighbours and english e
iThe skeleton connection that constitutes has been selected into final skeleton and has connected collection; Perhaps, english e
iNot by any skeleton word alignment, and its left neighbours or right neighbours and skeleton speech c
jThe skeleton connection that constitutes has been selected into final skeleton and has connected collection;
Repeat this step, can not be selected into final skeleton connection collection up to there being new skeleton to connect, final skeleton connects collection and is designated as B
1
2. translate in a kind of computing machine according to claim 1 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 1,, then directly be chosen as final skeleton and connect and collect if find to have the skeleton that has obtained K ballot to connect.
3. according to translating in a kind of computing machine described in claim 1 or 2 in the English translation, it is characterized in that step 2 may further comprise the steps based on the Chinese-English word alignment fusion method of participle net:
According to projection function, skeleton is connected collection B
1In each skeleton connect projection branch and be clipped to each participle s
kOn, obtain K new word alignment result respectively
Promptly
For each word alignment result
With the word alignment result
In connection
Connect A by its skeleton before by projection
IjThe degree of confidence ascending order arrange and word alignment result relatively successively
With word alignment a as a result
kIf, new connection
Not at word alignment a as a result
kIn, and like any condition below satisfying, then will connect
From the word alignment result
Middle deletion:
Participle s
kIn j speech
With english e
iIn the word alignment result
In all alignd; Perhaps, there has been one not to be english e
iLeft and right sides neighbours' english e is in the word alignment result
In snap to participle s
kIn j speech
Perhaps, there has been one not to be participle s
kIn j Chinese word
Left and right sides neighbours' Chinese word c is in the word alignment result
In snap to english e
i
4. translate in a kind of computing machine according to claim 3 in the English translation based on the Chinese-English word alignment fusion method of participle net, it is characterized in that, in the step 2, if skeleton connects collection B
1In have two or more skeletons to connect to be projected in some participle s
kOn the connection that obtains identical, then only keep a connection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101486920A CN102193915B (en) | 2011-06-03 | 2011-06-03 | Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101486920A CN102193915B (en) | 2011-06-03 | 2011-06-03 | Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102193915A CN102193915A (en) | 2011-09-21 |
CN102193915B true CN102193915B (en) | 2012-11-28 |
Family
ID=44601998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011101486920A Expired - Fee Related CN102193915B (en) | 2011-06-03 | 2011-06-03 | Participle-network-based word alignment fusion method for computer-aided Chinese-to-English translation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102193915B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109684648B (en) * | 2019-01-14 | 2020-09-01 | 浙江大学 | Multi-feature fusion automatic translation method for ancient and modern Chinese |
CN116070643B (en) * | 2023-04-03 | 2023-08-15 | 武昌理工学院 | Fixed style translation method and system from ancient text to English |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4961755B2 (en) * | 2006-01-23 | 2012-06-27 | 富士ゼロックス株式会社 | Word alignment device, word alignment method, word alignment program |
CN101452446A (en) * | 2007-12-07 | 2009-06-10 | 株式会社东芝 | Target language word deforming method and device |
CN101676898B (en) * | 2008-09-17 | 2011-12-07 | 中国科学院自动化研究所 | Method and device for translating Chinese organization name into English with the aid of network knowledge |
CN101714136B (en) * | 2008-10-06 | 2012-04-11 | 株式会社东芝 | Method and device for adapting a machine translation system based on language database to new field |
-
2011
- 2011-06-03 CN CN2011101486920A patent/CN102193915B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN102193915A (en) | 2011-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110941722B (en) | Knowledge graph fusion method based on entity alignment | |
Vulić et al. | On the role of seed lexicons in learning bilingual word embeddings | |
CN106844352B (en) | Word prediction method and system based on neural machine translation system | |
CN1770107B (en) | Extracting treelet translation pairs | |
CN101539907B (en) | Part-of-speech tagging model training device and part-of-speech tagging system and method thereof | |
CN106503255A (en) | Based on the method and system that description text automatically generates article | |
CN102214166B (en) | Machine translation system and machine translation method based on syntactic analysis and hierarchical model | |
CN103500160B (en) | A kind of syntactic analysis method based on the semantic String matching that slides | |
Matci et al. | Address standardization using the natural language process for improving geocoding results | |
CN104991889A (en) | Fuzzy word segmentation based non-multi-character word error automatic proofreading method | |
CN104008092B (en) | Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping | |
CN109829173B (en) | English place name translation method and device | |
US20090326914A1 (en) | Cross lingual location search | |
CN104756100A (en) | Intent estimation device and intent estimation method | |
CN102799579A (en) | Statistical machine translation method with error self-diagnosis and self-correction functions | |
CN104778256A (en) | Rapid incremental clustering method for domain question-answering system consultations | |
CN102117270B (en) | A kind of based on the statistical machine translation method of fuzzy tree to accurate tree | |
CN110046261A (en) | A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering | |
CN103678271B (en) | A kind of text correction method and subscriber equipment | |
CN103699528B (en) | Translation providing method, device and system | |
CN103544309A (en) | Splitting method for search string of Chinese vertical search | |
CN103688254B (en) | Error-detecting system based on example, method and error-detecting facility for assessment writing automatically | |
CN109933797A (en) | Geocoding and system based on Jieba participle and address dictionary | |
CN104731774A (en) | Individualized translation method and individualized translation device oriented to general machine translation engine | |
CN107463711A (en) | A kind of tag match method and device of data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20121128 Termination date: 20150603 |
|
EXPY | Termination of patent right or utility model |