CN104391885A

CN104391885A - Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training

Info

Publication number: CN104391885A
Application number: CN201410624648.6A
Authority: CN
Inventors: 曹海龙; 张捷鑫; 赵铁军
Original assignee: Harbin Institute of Technology
Current assignee: Harbin University of technology high tech Development Corporation
Priority date: 2014-11-07
Filing date: 2014-11-07
Publication date: 2015-03-04
Anticipated expiration: 2034-11-07
Also published as: CN104391885B

Abstract

The invention discloses a method for extracting a chapter-level parallel phrase pair of a comparable corpus based on parallel corpus training and relates to a method for extracting the parallel phrase pair of the comparable corpus. The method solves the problems that acquisition of a parallel corpus needs high expenditure, and when two most similar contextual words or fragments are mutually translated and applied to the comparable corpus, serious dependency to a bilingual dictionary is caused. The method comprises the following steps of 1, providing a source language sentence set S and a target language sentence set T; 2, obtaining a phrase pair set of the parallel corpus; 3, obtaining a parallel phrase pair of the parallel corpus; 4, obtaining a non-parallel phrase pair of the parallel corpus; 5, obtaining a binary classifier of a support vector machine; 6, extracting a candidate parallel phrase pair <s, t>; 7, obtaining the parallel phrase pair containing a noise in the comparable corpus; 8, obtaining the parallel phrase pair of the comparable corpus; 9, obtaining an extension decoder. The method is applied to the field of extraction of the parallel phrase pair of the comparable corpus.

Description

A kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training

Technical field

The present invention relates to phrase translation pair abstracting method, particularly chapter level phrase translation pair abstracting method.

Background technology

Along with the appearance of broadcast, TV, the contour coverage communication media in internet, interpersonal time-space matrix shortens suddenly, and international association is day by day frequent convenient, and the whole earth is as being both in boundless and indistinct universe small village.In order to allow people straightwayly exchange, mechanical translation has the huge market demand and application prospect widely as from a kind of language to the automatic translation of another kind of language.

In recent years, computing power obtains advances by leaps and bounds, the development of internet and universal, and bilingual country, the United Nations multi-lingual file, for we providing number with the bilingual parallel corpora of ten million, these have established necessary basis for statistical machine translation method, and propose much new model and method thereupon and achieve good effect.

The structure of statictic machine translation system is generally divided into training and translation two key steps.Training step be from language material learning statistical knowledge go forward side by side line parameter training.Training package typically based on the statictic machine translation system of phrase is contained in the translation model training on large-scale bilingualism corpora, the language model on single language corpus of target language is trained, parameter training three major parts, and the parallel corpora scale for training is the principal element affecting its translation performance.For some language pair, as Chinese and english, Arabic and English have a large amount of panel datas and can be used, but for most of language to being not this situation, their panel data resource is very rare even not to be existed, picture is as Dard and English, French and Japanese, and this seriously reduces the performance of machine translation system.To obtain the cost that parallel corpora needs cost quite high, so be necessary to utilize other resources to train statictic machine translation system.Compared with parallel corpora, all there is a large amount of resources each language centering in comparable language material, and obtain conveniently, and network, news, magazine etc. can obtain rich in natural resources.The bilingual document much comprising analog information is had in these comparable language materials, how these comparable language material information are joined the concern being subjected to more and more people in statictic machine translation system, researchists extract abundanter, parallel knowledge accurately by various method from comparable language material, and joined in translation system, improve translation system performance.

Parallel knowledge is extracted mostly all based on distributional assumption from comparable language material.This hypothesis is thought, across two words translated each other between language or fragment, their context is also similar even identical.On the basis of this assumption, the context of the unknown words of source language and target language is mapped to vector space by bilingual dictionary, the similarity then between compute vector by researcher, can pass through COS distance, Euclidean distance, deflection distance etc.Think to there are the most close contextual two words or fragment is translated each other.Also derived a lot of new method based on this most original method, such as add subject information, semantic information, transliteration information etc., these methods can obtain certain effect.But from this hypothesis itself, parallel corpora is symmetrical structure, this hypothesis can be met well, but comparable language material is a kind of unsymmetric structure, sometimes this hypothesis cannot be met, so there is some problems by the most close contextual two words or fragment each other translation application to comparable language material, and the method relies on very serious for bilingual dictionary, and seed dictionary scale directly affects parallel Knowledge Extraction effect.

Summary of the invention

The object of the invention is in order to solve statictic machine translation system panel data resource very rare even do not exist to obtain parallel corpora need to spend high, by the most close contextual two words or fragment each other translation application to comparable language material exist, very serious problem is relied on for bilingual dictionary and a kind of chapter level comparable language material phrase translation pair abstracting method of proposing.

Above-mentioned goal of the invention is achieved through the following technical solutions:

Step one, establish source language sentence S set and target language sentence set T in corpus; Wherein, corpus comprises parallel corpora and comparable language material;

Step 2, respectively S and T is divided into phrase successively by specified length, length 2-7 word of phrase, the phrase be divided into carries out combination of two, obtains the phrase of parallel corpora to set; Wherein, each phrase centering must comprise a phrase and comes from a S and phrase and come from T;

Step 3, utilize GIZA++ instrument from parallel corpora, extract two-way word translation table, the statictic machine translation system utilizing parallel corpora to set up in Moses system based on phrase obtains phrase translation table; By extracting the parallel phrase pair that namely training data positive example obtains parallel corpora in the information in two-way word translation table information and phrase translation table; Wherein, word translation is often organized in two-way word translation table to there being corresponding translation probability below; The word alignment information that phrase translation table comprises the two-way translation probability between phrase, five probability and phrase inside punished in two-way word weight, word;

Step 4, the phrase of parallel corpora that obtains from step 2 to the parallel phrase of the parallel corpora that removal step three set obtains to the non-parallel phrase pair obtaining training data counter-example and parallel corpora;

Step 5, from the parallel phrase of parallel corpora, characteristic of division is extracted to the non-parallel phrase centering with parallel corpora respectively; Characteristic of division is input in SVMlight system and utilizes this kernel method of radial basis to obtain support vector machine binary classifier;

Step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material to be combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length, t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length;

Step 7, utilize support vector machine binary classifier to classify to <s, t> to the parallel phrase of candidate, obtain in comparable language material the parallel phrase pair comprising noise;

Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, threshold value θ is set, θ ∈ (0,1) by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;

Step 9, by the parallel phrase of comparable language material to the demoder that is expanded in the phrase table joining baseline demoder; Wherein, baseline demoder is evaluated by baseline BLEU value and the evaluation of extension decoder expansion BLEU value; Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.

Invention effect

The object of the invention is to excavate parallel phrase from comparable language material, solve the problem that panel data is rare.Wish to make full use of abundant comparable language material resource, therefrom obtain parallel phrase, be used for promoting the object based on the statictic machine translation system performance of phrase.

The problem extracting parallel phrase from comparable language material is converted into the problem of a binary classification by the present invention.Useful characteristic information is extracted from training data, set up support vector machine binary classifier, and utilizing this sorter to divide parallel phrase and non-parallel phrase, the parallel phrase that this system extracts from comparable language material the most at last joins in translation system, to improve mechanical translation quality.This is a full automatic generation and method of testing.

The process of establishing of binary classifier is data acquisition and training two parts:

In the acquisition training data stage, known parallel source and target language sentence S and T, respectively S and T is divided by specified length, generate all possible phrase, then phrase is matched, each phrase centering must comprise a phrase and come from a S and phrase and come from T, and the panel data information utilizing GIZA++ instrument to obtain from S and T carries out the mark of positive and negative example to training phrase.

In the training stage, panel data information is utilized from training data, to extract nineteen feature as characteristic of division.Because this classification problem belongs to Nonlinear Classification problem, so this kernel method of radial basis is applied to this support vector machine classifier.The training phrase obtained from parallel corpora so just can be utilized to set up support vector machine classifier.

The evaluation method of this invention performance is from classifier performance and carry out translation system performance two aspects:

The classifying quality of sorter is evaluated, utilizes standard evaluation method, comprise accurate rate, recall rate and accuracy rate.The method generating test phrase is the same with the generation method of training phrase, but in order to ensure the fairness of testing, aligns, panel data information that counter-example utilizes when marking should train the consistent of phrase with generating.

Meaning of the present invention from comparable language material, obtains parallel phrase to improve machine translation system performance, so need to test the parallel phrase obtained of classifying from comparable language material whether can improve machine translation system performance, evaluate according to translation quality evaluation criterion.First existing a small amount of parallel corpora is utilized to train a baseline demoder, then the parallel phrase extracted from comparable language material by sorter joins in baseline system phrase table, re-training extension decoder, evaluates respectively to two demoder translation quality.

Experimental result shows, its baseline BLEU value and expansion BLEU value specifically as shown in table 3:

The present invention as known in table 3 can classify to parallel and non-parallel phrase well, the parallel phrase utilizing method of the present invention to extract from comparable language material, then joins the result of the implication expressed by the translation result in translation system closer to human translation.

Accompanying drawing explanation

Fig. 1 is a kind of abstracting method process flow diagram right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training.

Embodiment

Embodiment one: a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training of present embodiment, specifically prepare according to following steps:

Step 2, respectively S and T is divided into phrase successively by specified length, length 2-7 word of phrase, the phrase be divided into carries out combination of two, obtains the whole phrase of parallel corpora to set; Wherein, each phrase centering must comprise a phrase and comes from a S and phrase and come from T;

Step 3, utilize GIZA++ instrument from parallel corpora, extract two-way word translation table, phrase that the statictic machine translation system based on phrase obtains comprising in phrase translation table is mostly parallel phrase pair to utilize parallel corpora to set up in Moses system; Namely the parallel phrase pair of parallel corpora is obtained by extracting training data positive example (mark of positive example) in the information in two-way word translation table information and phrase translation table; Wherein, word translation is often organized in two-way word translation table to there being corresponding translation probability below; Namely often organize word translation to there being corresponding translation probability below, and according to normalization principle, the probability sum that what each word was corresponding likely translate is 1; The word alignment information that phrase translation table comprises the two-way translation probability between phrase, five probability and phrase inside punished in two-way word weight, word;

Step 4, whole phrase of parallel corpora of obtaining from step 2 to the parallel phrase of the parallel corpora that removal step three set obtains to the non-parallel phrase pair obtaining training data counter-example and parallel corpora;

Following two problems is should be noted that in the process that training dataset obtains:

(1) training examples may extraction just with occur many times in counter-example process, so will duplicate removal process be carried out in extraction process, ensure that each training examples is unique;

(2) the training set quantity obtained by parallel corpora may be very huge, if the training data scale for training classifier is crossed conference and is caused Expired Drugs, such meeting seriously reduces classifier performance, so need to concentrate at training data to sample, determine that proper quantity is as final training examples; Can ensure, in positive example and the good situation of counter-example data set quality, the method for random sampling can be used, certainly also can carry out suitable artificial misarrangement;

Step 6, the present invention will extract the right problem of parallel phrase and be converted into the problem of a binary classification from comparable language material, at the parallel phrase of extraction to before, first the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length, t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length, so just obtain all candidate phrase pair,

Step 7, utilize support vector machine binary classifier to classify to <s, t> to the parallel phrase of candidate, obtain in comparable language material the parallel phrase pair comprising noise; Do not deal with and can affect translation system performance;

Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, namely rule of thumb threshold value θ is set with actual conditions, θ ∈ (0,1) by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;

Step 9, by the parallel phrase of comparable language material to joining in the phrase table of baseline demoder namely based on the statictic machine translation system of phrase the demoder that is expanded; Wherein, baseline demoder is evaluated as Fig. 1 by the evaluation of baseline BLEU value and extension decoder expansion BLEU value; Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.

Present embodiment effect:

The problem extracting parallel phrase from comparable language material is converted into the problem of a binary classification by present embodiment.Useful characteristic information is extracted from training data, set up support vector machine binary classifier, and utilizing this sorter to divide parallel phrase and non-parallel phrase, the parallel phrase that this system extracts from comparable language material the most at last joins in translation system, to improve mechanical translation quality.This is a full automatic generation and method of testing.

In the acquisition training data stage, known parallel source and target language sentence S and T, respectively S and T is divided by specified length, generate all possible phrase, then phrase is matched, each phrase centering must comprise a phrase and come from a S and phrase and come from T, the panel data information utilizing GIZA++ instrument to obtain from S and T to training phrase carry out just with the mark of counter-example.

The classifying quality of sorter is evaluated, utilizes standard evaluation method, comprise accurate rate, recall rate and accuracy rate.The method generating test phrase is the same with the generation method of training phrase, but in order to ensure the fairness of testing, aligning the panel data information utilized when marking with counter-example and should train the consistent of phrase with generation.

The meaning of present embodiment from comparable language material, obtains parallel phrase to improve machine translation system performance, so need to test the parallel phrase obtained of classifying from comparable language material whether can improve machine translation system performance, evaluate according to translation quality evaluation criterion.First existing a small amount of parallel corpora is utilized to train a baseline demoder, then the parallel phrase extracted from comparable language material by sorter joins in baseline system phrase table, re-training extension decoder, evaluates respectively to two demoder translation quality.

Present embodiment as known in table 3 can be classified to parallel and non-parallel phrase well, the parallel phrase utilizing method of the present invention to extract from comparable language material, then joins the result of the implication expressed by the translation result in translation system closer to human translation.

Embodiment two: present embodiment and embodiment one unlike: extracting training data positive example (mark of positive example) detailed process in step 3 is:

(1) S is established _kfor the kth in source language sentence S set ' word on individual position, from position i to the word sequence of position j and T in S _kword on ' for the kth in target language sentence set T ' individual position, from position i' to the word sequence of position j' in T; Suppose a threshold epsilon, ε ∈ (0,1);

(2) this threshold value is rule of thumb chosen with actual conditions, if the translation probability of two words is greater than threshold epsilon in two-way word translation table, then thinks these two word S _kwith T _k'translate each other;

(3) and if only if S _kwith T _k'translate when namely aliging each other, k ∈ [i, j] and k' ∈ [i', j'];

S _kwith T _k'do not translate when namely not lining up each other, k ∈ [i, j] and

S _kwith T _k'when not translating each other, and k' ∈ [i', j']; Then think with translate each other, be the training data positive example of extraction.Other step and parameter identical with embodiment one.

Embodiment three: present embodiment and embodiment one or two unlike: it is as follows that the parallel phrase respectively from parallel corpora in step 5 extracts characteristic of division to the non-parallel phrase centering with parallel corpora:

(1) phrase length is poor: the absolute value being the difference of source language phrase and target language phrase length;

(2) identical initial: if the beginning of source language phrase and the beginning of target language phrase can be translated each other, then value is 1, otherwise value is 0;

(3) identical ending: if the ending of source language phrase and the ending of target language phrase can be translated each other, then value is 1, otherwise value is 0;

(4) number of words in phrase: the quantity being each self-contained word in source language phrase and target language phrase;

(5) phrase length ratio: the ratio being source language phrase length and target language phrase length;

(6) number is translated: be the number that in the language phrase of source, word exists translation corresponding with it in target language phrase, the translation probability p (s|t) of word is greater than a threshold value η;

(7) without translation number: be the number that in the language phrase of source, word does not exist translation corresponding with it in target language phrase;

(8) ratio is translated: be the ratio that there is total words in the word quantity of translation and phrase in the language phrase of source;

(9) half translation: source language phrase word has at least half quantity to there is translation in object phrase, then value is 1, otherwise value is 0;

(10) the longest translation unit: be the length that in the language phrase of source, the longest continuous word sequence exists translation in target language phrase;

(11) the longest without translation unit: to be that in the language phrase of source, in word, the longest continuous word sequence does not exist the length of translation in target language phrase;

(1) direction of ~ (3) feature and source language and target language has nothing to do, and (4) ~ (11) are relevant with direction is both forward and reverse directions; Therefore 19 features are extracted altogether.Other step and parameter identical with embodiment one or two.

Embodiment four: one of present embodiment and embodiment one to three unlike: in step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence:

(1) the word number ratio, in two sentences is no more than 2;

(2), utilizing in dictionary inspection sentence has at least the word of half to there is translation in another one sentence;

If sentence will be dropped meeting these two conditions during difference; The sentence meeting these two conditions is right to being regarded as pseudo-parallel sentence; Most of non-parallel sentence can be removed in above process right, but the method to also been removed some approximately parallel sentences right simultaneously, these are not to meeting two conditions of filtering, and main cause is that dictionary does not comprise all entities; But these sentences are to negligible amounts, and not necessarily completely reliable, so this filter method has very great help to the precision of system and robustness on the whole; Inevitably, this filter method can not completely by non-parallel sentence to removal, because word overlay condition is very weak, such as all there is translation in stop-word substantially in corresponding language, if it by chance can mate with some notional words, meet threshold value overlap, so just likely that a non-parallel sentence is right to being mistaken for parallel sentence.Other step and parameter identical with one of embodiment one to three.

Embodiment five: present embodiment and one of embodiment one to four unlike: the formula often organizing in comparable language material the mean value of the parallel phrase centering word translation probability logarithm comprising noise in step 8 is as follows:

phrasepri = \frac{{\ln S}_{1} + \ln S_{2} + . . . + \ln S_{n}}{n}

Wherein, S _irepresent that primitive phrase exists the translation probability of i-th word of translation in target language phrase; N represents that primitive phrase exists the word number of translation in target language phrase; This does not wherein comprise the translation probability of stop words, because stop words is very little for the contribution of translation, so directly ignore.Other step and parameter identical with one of embodiment one to four.

Following examples are adopted to verify beneficial effect of the present invention:

Embodiment one:

A kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training of the present embodiment, specifically prepare according to following steps:

(1) S is established _kfor the kth in source language sentence S set ' word on individual position, from position i to the word sequence of position j and T in S _kword on ' for the kth in target language sentence set T ' individual position, from position i' to the word sequence of position j' in T; Suppose threshold epsilon=0.5, ε ∈ (0,1);

S _kwith T _k'when not translating each other, and k' ∈ [i', j']; Then think with translate each other, be the training data positive example of extraction.

(1) training examples may occur many times in the positive counter-example process of extraction, so will carry out duplicate removal process in extraction process, ensures that each training examples is unique.

(2) the training set quantity obtained by parallel corpora may be very huge, if the training data scale for training classifier is crossed conference and is caused Expired Drugs, such meeting seriously reduces classifier performance, so need to concentrate at training data to sample, determine that proper quantity is as final training examples.Can ensure, in positive example and the good situation of counter-example data set quality, the method for random sampling can be used, certainly also can carry out suitable artificial misarrangement.

The parallel phrase of parallel corpora is to as follows with the non-parallel phrase centering extraction characteristic of division of parallel corpora:

(1) direction of ~ (3) feature and source language and target language has nothing to do, and (4) ~ (11) are relevant with direction.Therefore 19 features are extracted altogether.

Evaluate according to accurate rate, recall rate and accuracy rate three classifying qualities of aspect to support vector machine binary classifier; Stochastic choice five groups of training datas obtain five different sorters, and utilize same group of test data to test, and the result finally obtained is as shown in table 1:

Table 1

Utilize same sorter, random selecting five groups of test datas are tested respectively, and the result finally obtained is as shown in table 2:

Table 2

According to the above results can judge described in the invention by binary support vector machine classification method can well to parallel phrase to non-parallel phrase to classifying, test by selecting different training data and test data, can find out for different data sets, the method stable performance, and good effect can be reached.

Step 6, the present invention will extract the right problem of parallel phrase and be converted into the problem of a binary classification from comparable language material, at the parallel phrase of extraction to before, first the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length (2≤i≤7), t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length (2≤i≤7), so just obtain all candidate phrase pair.

Filtration obtains pseudo-parallel sentence:

(1) the word number ratio, in two sentences is no more than 2.

If sentence will be dropped meeting these two conditions during difference.The sentence meeting these two conditions is right to being regarded as pseudo-parallel sentence; Most of non-parallel sentence can be removed in above process right, but the method to also been removed some approximately parallel sentences right simultaneously, these are not to meeting two conditions of filtering, and main cause is that dictionary does not comprise all entities.But these sentences are to negligible amounts, and not necessarily completely reliable, so this filter method has very great help to the precision of system and robustness on the whole.Inevitably, this filter method can not completely by non-parallel sentence to removal, because word overlay condition is very weak, such as all there is translation in stop-word substantially in corresponding language, if it by chance can mate with some notional words, meet threshold value overlap, so just likely that a non-parallel sentence is right to being mistaken for parallel sentence.

Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, namely rule of thumb threshold value θ is set with actual conditions, θ=0.3 by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;

The formula often organizing in comparable language material the mean value of the parallel phrase centering word translation probability logarithm comprising noise is as follows:

phrasepri = \frac{{\ln S}_{1} + \ln S_{2} + . . . + \ln S_{n}}{n}

Wherein, S _irepresent that primitive phrase exists the translation probability of i-th word of translation in target language phrase; N represents that primitive phrase exists the word number of translation in target language phrase; This does not wherein comprise the translation probability of stop words, because stop words is very little for the contribution of translation, so directly ignore.

Step 9, by the parallel phrase of comparable language material to joining in the phrase table of baseline demoder namely based on the statictic machine translation system of phrase the demoder that is expanded; Wherein, baseline demoder is evaluated by baseline BLEU value and the evaluation of extension decoder expansion BLEU value, its baseline BLEU value and to expand BLEU value specifically as shown in table 3:

Table 3

Can judge according to the above results, adopt that of the present invention to utilize the method for binary classification to extract parallel phrase higher to quality, the parallel phrase extracted from comparable language material can be improved translation system performance to joining in translation system, make result closer to human translation result, and along with parallel phrase is to the increase of quantity, translation result is also become better and better.Experimental result shows, the present invention can well to parallel and non-parallel phrase to classifying, the parallel phrase pair utilizing method of the present invention to extract from comparable language material, then joins the result of the implication expressed by the translation result in translation system closer to human translation.Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.

The present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those skilled in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1., based on the abstracting method that the parallel phrase of the comparable language material of chapter level of parallel corpora training is right, it is characterized in that: a kind of chapter level comparable language material phrase translation pair abstracting method specifically carries out according to following steps:

2. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, is characterized in that: extracting training data positive example detailed process in step 3 is:

(1) S is established _kfor the kth in source language sentence S set ' word on individual position, from position i to the word sequence of position j and T in S _k'for the kth in target language sentence set T ' word on individual position, from position i' to the word sequence of position j' in T; Suppose a threshold epsilon, ε ∈ (0,1);

(2) if the translation probability of two words is greater than threshold epsilon in two-way word translation table, then these two word S are thought _kwith T _k'translate each other;

(3) and if only if S _kwith T _k'when translating each other, k ∈ [i, j] and k' ∈ [i', j'];

S _kwith T _k'when not translating each other, k ∈ [i, j] and

3. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, is characterized in that: it is as follows that the parallel phrase respectively from parallel corpora in step 5 extracts characteristic of division to the non-parallel phrase centering with parallel corpora:

(11) the longest without translation unit: to be that in the language phrase of source, in word, the longest continuous word sequence does not exist the length of translation in target language phrase.

4. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, it is characterized in that: in step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence:

(1) the word number ratio, in two sentences is no more than 2;

(2), utilizing in dictionary inspection sentence has at least the word of half to there is translation in another one sentence; The sentence meeting these two conditions is right to being regarded as pseudo-parallel sentence.

5. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, it is characterized in that: the formula often organizing in comparable language material the mean value of the parallel phrase centering word translation probability logarithm comprising noise in step 8 is as follows: θ, θ ∈ (0,1)

phrasepri = \frac{\ln S_{1} + \ln S_{2} + . . . + \ln S_{n}}{n}

Wherein, S _irepresent that primitive phrase exists the translation probability of i-th word of translation in target language phrase; N represents that primitive phrase exists the word number of translation in target language phrase.