CN103034627B

CN103034627B - Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation

Info

Publication number: CN103034627B
Application number: CN201110303522.5A
Authority: CN
Inventors: 刘占一; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-10-09
Filing date: 2011-10-09
Publication date: 2016-05-25
Anticipated expiration: 2031-10-09
Also published as: CN103034627A

Abstract

The invention provides a kind of method and apparatus of sentence similarity and method and apparatus of machine translation of calculating, the method for wherein calculating sentence similarity comprises: the first sentence and the second sentence are compared, determine difference word pair; The collocation probability of other words in utilization variance word centering difference word and its place the first sentence or the second sentence, for each difference word marking, wherein the collocation probability between two words obtains by the inquiry probabilistic model of arranging in pairs or groups, and by described two words, the statistics of the co-occurrence number of times in default corpus obtains the collocation probability in described collocation probabilistic model between two words; The marking result of the each difference word of utilization variance word centering, determines the marking that difference word is right; Utilize the right marking result of each difference word, determine the similarity of described the first sentence and described the second sentence. Can embody more exactly the matching degree between two sentences by the present invention, thereby improve its quality for the application such as such as machine translation.

Description

Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation

[technical field]

The present invention relates to field of computer technology, particularly a kind of method and apparatus and machine that calculates sentence similarityThe method and apparatus of device translation.

[background technology]

Sentence similarity calculates to be had very heavy in fields such as problem retrieval, bilingual illustrative sentence retrieval, machine translation, document abstractsWhat kind of sentence similarity computational methods is the using value of wanting, wherein adopt to embody exactly similar between two sentencesSituation is the key of the above-mentioned application quality of impact.

Lift an application in machine translation mothod, in machine translation mothod, conventionally use pretreated bilingual exampleSentence is as main translated resources, and the similar example sentence mating with sentence to be translated by editor generates final translation. Particularly, bagDraw together following steps:

1) the similar example sentence that search is mated with sentence to be translated in translation instance storehouse.

For example: sentence to be translated is: Thisisapencil.

Similar example sentence is: Thatisapen.

2) identify the difference word between sentence to be translated and similar example sentence

This and That are difference words, and pencil and pen are difference words.

3) using translation corresponding the difference word in sentence to be translated as candidate's translation fragment.

" this " and " pencil " is as candidate's translation fragment.

4) in the translation of similar example sentence, utilize candidate's translation fragment to replace the translation of difference word in similar example sentence, obtainThe translation of sentence to be translated.

The translation of similar example sentence is: " that is a pen ", replace " that " with " this ", with " pencil " replacement " pencil ",Be " this is a pencil " to the translation of sentence to be translated.

Can be found out by above machine translation process, the similar example sentence of How to choose be the key that affects translation quality height because ofElement.

Existing sentence similarity calculates the mode of calculating editing distance between sentence that conventionally adopts, and editing distance is by from oneIndividual sentence is transformed into the needed minimal action number of another sentence to be determined, described operation can comprise: insert, delete or replaceChange etc., if the editing distance between two sentences is less, determine that the similarity between two sentences is higher, but this modeCan there is certain defect.

For example, if sentence to be translated is: CanItakeapictureofthepainting?

Does is the similar example sentence of selecting by calculating editing distance mode: CanItakeapictureofthecar?

Does is the translation that utilizes this similar example sentence to form: I can clap a photo for this oil painting?

If the similar example using sentence Canwetakeaphotoofthepainting as sentence to be translatedSentence, does is the translation forming: I can clap a photo for this width oil painting?

Can find out, although the volume of sentence Canwetakeaphotoofthepainting and sentence to be translatedVolume distance is greater than the editing distance of sentence CanItakeapictureofthecar and sentence to be translated, but itself and treatThe similitude of translation of the sentence will be higher than sentence CanItakeapictureofthecar, thus the translation quality formingAlso higher.

Above-mentioned problem is exactly because in the time calculating between sentence similarity, do not consider the pass between two sentence difference wordsSystem. Although someone proposes to consider the similarity degree between difference word based on synonymicon in the calculating of similarity,Under a lot of application, in above-mentioned machine translation application, between difference word and context collocation relation compare semantic,Similarity has more importantly meaning in calculating, and more can embody exactly the matching degree between two sentences, to above-mentionedThe quality influence of application is larger.

[summary of the invention]

The invention provides a kind of method and apparatus of sentence similarity and method and apparatus of machine translation of calculating, withBe convenient to embody more exactly the matching degree between two sentences, thereby improve its matter for the application such as such as machine translationAmount.

Concrete technical scheme is as follows:

A method of calculating sentence similarity, the method comprises:

A, the first sentence and the second sentence are compared, determine difference word pair;

The collocation probability of other words in B, utilization variance word centering difference word and its place the first sentence or the second sentence,For the marking of each difference word, wherein the collocation probability between two words obtains by the inquiry probabilistic model of arranging in pairs or groups, described collocation probabilityBy described two words, the statistics of the co-occurrence number of times in default corpus obtains collocation probability in model between two words;

The marking result of C, the each difference word of utilization variance word centering, determines the marking that difference word is right;

D, utilize the right marking result of each difference word, determine the similarity of described the first sentence and described the second sentence.

Particularly, in described step B, be each difference word marking according to following formula:

Wherein r (w_i, E) and be difference word w_iMarking result, E is difference word w_iFirst sentence at place or the second sentence, w_jFor removing w in E_iOutside other words, r (w_i，w_j) be w_iAnd w_jCollocation probability, mThe word number comprising for E.

In described step C, be difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is the first sentence E1In the marking result of difference word w,It is the difference word in the second sentence E2Marking result, α 1, α 2, β 1 and β 2 areDefault weighting parameter.

Further, the method also comprises: determine the characteristic vector of difference word centering two difference words, it is described two poor to utilizeThe characteristic vector of dissenting words, calculates the similarity distance of described two difference words;

While determining the right marking of difference word in described step C, the further similar distance of utilization variance word centering two difference wordsFrom.

Wherein, definite mode of the characteristic vector of difference word is specially:

Inquire about described collocation probabilistic model, will reach the word structure of presetting collocation probability threshold value with the collocation probability of difference wordBecome the characteristic vector of this difference word.

Particularly, calculate the similarity distance of described two difference words according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Wherein,For difference word w andSimilarity distance, AFor default positive number, F (w) is the characteristic vector of difference word w,For difference wordCharacteristic vector,

Co \sin e (F (w), F (\tilde{w}))

For F (w) andIncluded angle cosine.

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w});

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is the first sentence E1In the marking result of difference word w,It is the difference word in the second sentence E2Marking result,For differenceWord w andSimilarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.

A method for machine translation, the method for this machine translation comprises:

S1, adopt the method for above-mentioned calculating sentence similarity to calculate the phase of sentence in sentence to be translated and default example sentence storehouseLike degree;

S2, selection similarity come the sentence of top n as the similar example sentence of described sentence to be translated, and N just presetsInteger;

S3, utilize the translation of described similar example sentence to obtain the translation of described sentence to be translated.

Wherein, described step S1 specifically comprises:

S11, determine that the editing distance in described example sentence storehouse and between described sentence to be translated meets the sentence of preset requirement;

S12, adopt the method for above-mentioned calculating sentence similarity to calculate sentence to be translated and the definite sentence of described step S11Between similarity.

Described step S3 specifically comprises:

S31, identify the difference word between described sentence to be translated and described similar example sentence;

S32, using translation corresponding the difference word in described sentence to be translated as candidate's translation fragment;

S33, in the translation of described similar example sentence, utilize candidate's translation fragment to replace corresponding difference word in similar example sentenceTranslation, obtains the translation of described sentence to be translated.

Preferably, the method for this machine translation also comprises: in showing the translation of described sentence to be translated, will adoptSimilar example sentence and the similar example sentence of employing and the right marking result of each difference word of described sentence to be translated show.

Calculate a device for sentence similarity, this device comprises:

Sentence comparison unit, for the first sentence and the second sentence are compared, determines difference word pair;

Difference word marking unit, for utilization variance word centering difference word and its place the first sentence or the second sentence itsThe collocation probability of his word, is the marking of each difference word, and wherein the collocation probability between two words is by the inquiry probabilistic model of arranging in pairs or groupsObtain, the collocation probability in described collocation probabilistic model between two words is the co-occurrence in default corpus by described two wordsNumber of times statistics obtains;

Difference word air exercise subdivision, for the marking result of the each difference word of utilization variance word centering, determines that difference word is rightMarking;

Similarity determining unit, for utilizing the right marking result of each difference word, determines described the first sentence and describedThe similarity of two sentences.

Particularly, described difference word marking unit is each difference word marking according to following formula:

r (w_{i}, E) = \frac{\underset{w_{i} &Element; E, w_{j} &Element; E, w_{i} &NotEqual; w_{j}}{Σ} r (w_{i}, w_{j})}{m},

Now, described difference word air exercise subdivision is difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Also have a kind of embodiment, this device also comprises: similarity distance determining unit, and poor for determining difference word centering twoThe characteristic vector of dissenting words, utilizes the characteristic vector of described two difference words, calculates the similarity distance of described two difference words;

Described difference word air exercise subdivision in the time determining the right marking of difference word, further utilization variance word centering two differencesThe similarity distance of word.

Wherein, described similarity distance determining unit is inquired about described collocation probabilistic model, will reach with the collocation probability of difference wordForm the characteristic vector of this difference word to the word of default collocation probability threshold value.

Described similarity distance determining unit is calculated the similarity distance of described two difference words according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Co \sin e (F (w), F (\tilde{w}))

For F (w) andIncluded angle cosine.

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w});

A device for machine translation, the device of this machine translation comprises:

The device of above-mentioned calculating sentence similarity, for calculating the similar of sentence to be translated and default example sentence storehouse sentenceDegree;

Similar example sentence selected cell, for selecting similarity to come the sentence of top n as the phase of described sentence to be translatedLike example sentence, N is default positive integer;

Translation forming unit, for utilizing the translation of described similar example sentence to obtain the translation of described sentence to be translated.

Further, the device of this machine translation also comprises: initial option unit, for determine described example sentence storehouse withEditing distance between described sentence to be translated meets the sentence of preset requirement;

The device of described calculating sentence similarity calculate the definite sentence of sentence to be translated and described initial option unit itBetween similarity.

Wherein, described translation forming unit specifically comprises:

Difference word recognin unit, for identifying the difference word between described sentence to be translated and described similar example sentence;

Fragment constructor unit, for using translation corresponding the difference word of described sentence to be translated as candidate's translation sheetSection;

Translation forms subelement, for the translation at described similar example sentence, utilizes candidate's translation fragment to replace similar exampleThe translation of corresponding difference word in sentence, obtains the translation of described sentence to be translated.

Preferably, the device of this machine translation also comprises: display unit, and for the translation showing described sentence to be translatedTime, by the similar example sentence adopting and the similar example sentence of employing and the right marking knot of each difference word of described sentence to be translatedFruit shows.

As can be seen from the above technical solutions, method and apparatus provided by the invention melts the collocation probability of word and wordEnter the calculating of sentence similarity, the collocation probability based on other words in difference word and its place sentence is difference word air exercisePoint, and then calculate the diversity factor between sentence, the prior art of comparing, embodies the coupling between sentence more exactlyDegree, thus its quality for the application such as such as machine translation improved.

[brief description of the drawings]

The method flow diagram of the calculating sentence similarity that Fig. 1 provides for the embodiment of the present invention one;

The method flow diagram of the calculating sentence similarity that Fig. 2 provides for the embodiment of the present invention two;

The method flow diagram of the machine translation that Fig. 3 provides for the embodiment of the present invention three;

The translation demonstration instance graph that Fig. 4 provides for the embodiment of the present invention three;

The structure drawing of device of the calculating sentence similarity that Fig. 5 provides for the embodiment of the present invention four;

The structure chart of the machine translation apparatus that Fig. 6 provides for the embodiment of the present invention five.

[detailed description of the invention]

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments pairThe present invention is described in detail.

Below by embodiment mono-and embodiment bis-, similarity calculating method provided by the present invention is described. ImplementExample one and embodiment bis-are for calculating the similarity between sentence E1 and sentence E2, and sentence E1 and sentence E2 can be according to concreteApplication is chosen. For example: if be applied to problem retrieval, sentence E1 can be the query of user's input, and sentence E2 canFor existing problem in issue database; If be applied to machine translation, sentence E1 can be sentence to be translated, and sentence E2 canThink the sentence in the example sentence storehouse that translation uses, etc.

Embodiment mono-,

The method flow diagram of the calculating sentence similarity that Fig. 1 provides for the embodiment of the present invention one, as shown in Figure 1, the methodCan comprise the following steps:

Step 101: sentence E1 and sentence E2 are compared, determine difference word pair.

The base text processing of the embodiment of the present invention based on to sentence, the such as processing such as participle, alignment, in this partHold for prior art, do not repeat them here.

Word in sentence E1 and sentence E2 is compared, determine different word and form difference word pair, for example:

Does is sentence E1: CanItakeapictureofthepainting?

Does is sentence E2: Canwetakeaphotoofthepainting?

Determine difference word to being: the difference word pair that I and we form, the difference word pair that picture and photo form.

Step 102: in utilization variance word centering difference word and its place sentence E1 or sentence E2, the collocation of other words is generalRate, is the marking of each difference word, and wherein the collocation probability between two words obtains by the inquiry probabilistic model of arranging in pairs or groups, the probability mould of arranging in pairs or groupsBy two words, the statistics of the co-occurrence number of times in default corpus obtains collocation probability in type between two words.

By to the statistics of co-occurrence number of times between word and word in default corpus, can obtain word and word in advanceThe collocation probability of language, thus form collocation probabilistic model. For example, when for machine translation, this default corpus can beThe corpus that machine translation is used, statistics " take " and " picture " co-occurrence number of times can obtain " take " andThe collocation probability of " picture " deposits collocation probabilistic model in, and the co-occurrence number of times of statistics " take " and " photo " can obtainThe collocation probability of " take " and " photo " deposits collocation probabilistic model in, like that. Collocation probability between word is larger, saysDependence between bright word is stronger.

Because word in sentence is not the individuality isolating, each word more or less with sentence in other wordsThere is certain collocation relation, this collocation relation can embody this word in sentence with contextual degree of dependence and editorRisk. In the time giving a mark for each difference word, the collocation that can obtain respectively other words in difference word and its place sentence is generalRate, integrates the collocation probability obtaining to obtain the marking result of difference word, for example, and for difference word w_iCan adopt asThe lower formula result r (w that obtains giving a mark_i，E)：

r (w_{i}, E) = \frac{\underset{w_{i} &Element; E, w_{j} &Element; E, w_{i} &NotEqual; w_{j}}{Σ} r (w_{i}, w_{j})}{m}, - - - (1)

E is difference word w_iThe sentence at place, can be above-mentioned sentence E1 or sentence E2, w_jFor removing w in E_iOutside otherWord, r (w_i，w_j) be w_iAnd w_jCollocation probability, obtain by the inquiry probabilistic model of arranging in pairs or groups, m is the word number that E comprises.

Taking sentence E1 as example, can obtain difference word " picture " respectively with " can ", " I ", " take ", " a ", " of ",The collocation probability of " the " and " painting ", m value is 8, then substitution formula (1) calculates and just can obtain difference wordThe marking result of " picture ".

Step 103: the marking result of the each difference word of utilization variance word centering, determine the marking that difference word is right.

Determining after the marking result of each difference word in sentence E1 and sentence E2, can be for difference word to beatingPoint, marking mode can obtain by the marking result of integrating difference word centering two difference words, for example can be according to following formula(2) or formula (3) calculate:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2}; - - - (2)

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2); - - - (3)

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is in sentence E1The marking result of difference word w,For the difference word in sentence E2Marking result, α 1, α 2, β 1 and β 2 are default powerValue parameter. It is the number between-1 and 1 that α 1, α 2, β 1 and β 2 can be set conventionally, and α 1 and α 2 are conventionally with being chosen for positive number or negative, β1 and β 2 conventionally with being chosen for positive number or negative.

For example, in the time calculating the right marking result of difference word of " picture " and " photo " formation, first utilize formula(1) calculate the marking result of " picture ", and the marking result of " photo ", substitution formula (2) or (3) obtainThe right marking result of difference word that " picture " and " photo " forms.

Step 104: utilize the right marking result of each difference word, determine the similarity of sentence E1 and sentence E2.

In this step, the right marking result of all differences word in sentence E1 and sentence E2 is integrated, for example, by instituteThe right marking result of variant word is sued for peace, thereby determines the similarity of sentence E1 and sentence E2. Described in embodiment mono-The marking mode that method obtains, the marking result of final each difference word is worth higher after integrating, illustrate that two example sentences close and fasten in collocationSimilarity higher, matching degree is also higher.

Embodiment bis-,

The method flow diagram of the calculating sentence similarity that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, the methodCan comprise the following steps:

Step 201 is with step 101 in embodiment mono-.

Step 202 is with step 102 in embodiment mono-.

Step 203: determine the characteristic vector of difference word centering two difference words, utilize the characteristic vector of two difference words to calculate twoThe similarity distance of difference word.

In embodiment bis-, can further consider the similarity degree of difference word in specific corpus, this similar journeyDegree embodies by the distance of the characteristic vector of difference word centering two difference words.

The characteristic vector of difference word can be by existing the word of higher collocation probability to form with this difference word, particularly, and canWith the probabilistic model of arranging in pairs or groups by inquiry, the word that reaches default collocation probability threshold value with the collocation probability of this difference word is formed shouldThe characteristic vector of difference word.

Taking difference word " picture " as example, by the inquiry probabilistic model of arranging in pairs or groups, determine " take ", " draw ", " of "," gallery " etc. and the collocation probability of " picture " reach default collocation probability threshold value, can by " take ", " draw ",The word such as " of ", " gallery " forms the characteristic vector of " picture ". Same method also can be determined difference word " photo "Characteristic vector.

In the time of the similarity distance of calculated difference word centering two difference words, can utilize the angle of the characteristic vector of two difference wordsCosine. For example can adopt following formula to calculate:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})), - - - (4)

Wherein,For difference word w andSimilarity distance, A is default positive number, F (w) is the feature of difference word wVector,For difference wordCharacteristic vector,For F (w) andIncluded angle cosine.

Wherein the account form of included angle cosine can adopt multiple concrete formula of the prior art, taking one wherein as exampleCan obtain following formula:

Because Collocation in collocation probabilistic model and collocation probability are trained and counted on specific corpus, therefore,By the mode of this step, the similarity degree of two difference words on specific corpus can be described effectively.

Step 204: the marking result of the each difference word of utilization variance word centering and the similarity distance of two difference words, it is poor to determineThe marking that dissenting words is right.

This embodiment bis-is different from embodiment mono-, difference word having been further considered to difference word when giving a markSimilarity distance, considered simultaneously difference word centering two difference words similarity distance and editor risk. For example, can adopt asLower formula (6) or (7) to difference word to marking:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3}; - - - (6)

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w}); - - - (7)

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is in sentence E1The marking result of difference word w,For the difference word in sentence E2Marking result,For difference word w and'sSimilarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter. Conventionally can arrange α 1, α 2, α 3, β 1, β 2 and β 3 for-Number between 1 and 1, α 1 and α 2 are conventionally with being chosen for positive number or negative, and β 1 and β 2 are conventionally with being chosen for positive number or negative.

For example, in the time calculating the right marking result of difference word of " picture " and " photo " formation, first utilize formula(1) calculate the marking result of " picture ", and the marking result of " photo ", substitution formula (2) or (3) obtainThe right marking result of difference word that " picture " and " photo " forms. Utilize formula (5) to calculate " picture " and " photo "Between similarity distance, then utilize formula (6) or (7) obtain " picture " and " photo " form the right marking of difference wordResult.

Step 205 is with step 104 in embodiment mono-.

Above-mentioned two embodiment are described as an example of english sentence example, but are not limited to english sentence, can apply equallySentence similarity in other language such as such as Chinese sentence calculates.

The sentence similarity calculating by above-mentioned two embodiment can be for examining such as problem retrieval, bilingual example sentenceThe fields such as rope, machine translation, document abstracts. Situation below by embodiment tri-when for machine translation is described.

Embodiment tri-,

The method flow diagram of the machine translation that Fig. 3 provides for the embodiment of the present invention three, as shown in Figure 3, the method can be wrappedDraw together following steps:

Step 301: the similarity of calculating sentence in sentence to be translated and default example sentence storehouse.

Can adopt in this step the method described in embodiment mono-or embodiment bis-to calculate sentence to be translated and exampleThe similarity of sentence in sentence storehouse, thereby for further selecting similar example sentence to prepare.

Because example sentence quantity in example sentence storehouse is very huge, if adopt one by one mode shown in embodiment mono-or embodiment bis-The similarity of each sentence and sentence to be translated in calculating example sentence storehouse, efficiency can be lower, in order to raise the efficiency, can first calculateThe editing distance of each example sentence and sentence to be translated in example sentence storehouse, determines in example sentence storehouse and meets pre-with the editing distance of sentence to be translatedIf the sentence requiring, the then similarity between each sentence and the sentence to be translated of calculative determination. For example, can select to edit distanceFrom the sentence that is less than predetermined threshold value, or, select editing distance to come the sentence of front M, M is default positive integer.

Editing distance is determined by be transformed into the needed minimal action number of another sentence from a sentence, described operationCan comprise: insertion, deletion or replacement etc., because the account form of editing distance is prior art, do not repeat them here.

Step 302: select similarity to come the sentence of top n as the similar example sentence of sentence to be translated, N just presetsInteger.

The similar example sentence of selecting by the similarity account form described in embodiment mono-or embodiment bis-, has considered difference wordWith the collocation relation of other words in sentence, consider the compiling risk of difference word, even in embodiment bis-, further examineConsidered the similarity distance between difference word, select sentence that matching degree is higher as similar example sentence for generating version, fromAnd raising translation quality.

One preferred embodiment, can select a sentence that similarity is the highest as similar example sentence.

For sentence to be translated: for CanItakeapictureofthepainting, sentence CanweThe takeaphotoofthepainting CanItakeapictureofthecar that compares, difference word to " I " andSimilarity distance to difference word in " picture " and " photo " of " we " and difference word and with sentence in the collocation of other wordsProbability is all larger, and the similarity distance of difference word to " painting " and " car " and with sentence in the collocation probability of other wordsLess, therefore, the sentence Canwetakeaphotoofthepainting CanItakeapicture that comparesOfthecar and sentence to be translated have higher similarity, can choose CanwetakeaphotoofthePainting is as similar example sentence.

Step 303: utilize the translation of similar example sentence to obtain the translation of sentence to be translated.

Determining after similar example sentence, the translation that generates sentence to be translated can be realized in accordance with the following steps:

Identify the difference word between sentence to be translated and similar example sentence; By translation corresponding to difference word in sentence to be translatedAs candidate's translation fragment; In the translation of similar example sentence, utilize candidate's translation fragment to replace corresponding difference word in similar example sentenceTranslation, obtain the translation of sentence to be translated. This partial content is same as the prior art, repeats no more.

For example, identify similar example sentence Canwetakeaphotoofthepainting and sentence Can to be translatedThe difference word of Itakeapictureofthepainting is " we " and " I ", " photo " and " picture ". " I's "The translation " photograph " of translation " I " and " picture " is as candidate's translation fragment. The translation of similar example sentence is that " we can be thisWidth oil painting is clapped photo ", utilize candidate's translation fragment to the translation of difference word in similar example sentence replace obtain to be translatedThe translation of sentence is " I can clap sheet photo for this width oil painting ".

When the translation for the treatment of translation of the sentence shows, similar example sentence can be shown, and further canSo that marking result right each difference word of similar example sentence and sentence to be translated is shown. At the marking knot that shows that difference word is rightWhen fruit, can be according to the marking result setting in advance and the corresponding relation of confidence level, for example by confidence level according to beatingPoint result is divided into height, neutralizes low Three Estate, then determines right confidence level corresponding to marking result of difference word, thereby aobviousShow this confidence level.

As shown in Figure 4, show sentence to be translated, similar example sentence, the translation of similar example sentence and the translation of sentence to be translated,Wherein the difference word of similar example sentence and sentence to be translated can highlight, and candidate's translation fragment also highlights. Simultaneously on the right sideSide shows the right confidence level of difference word. The mode highlighting is not limited to the mode shown in Fig. 4.

More than the detailed description that method provided by the invention is carried out, below by embodiment tetra-to provided by the inventionThe device that calculates sentence similarity is described.

Embodiment tetra-,

The structure drawing of device of the calculating sentence similarity that Fig. 5 provides for the embodiment of the present invention four, as shown in Figure 5, this deviceCan comprise: sentence comparison unit 501, difference word marking unit 502, difference word air exercise subdivision 503 and similarity are determined singleUnit 504.

Sentence comparison unit 501 compares sentence E1 and sentence E2, determines difference word pair.

In fact exactly the word in sentence E1 and sentence E2 is compared, determine different word and form differenceWord pair.

Other words in difference word marking unit 502 utilization variance word centering difference words and its place sentence E1 or sentence E2Collocation probability, be the marking of each difference word, wherein the collocation probability between two words obtains by the inquiry probabilistic model of arranging in pairs or groups, and takesThe statistics of the co-occurrence number of times in default corpus obtains by two words to join in probabilistic model the collocation probability between two words.

The formation of collocation probabilistic model is in advance by co-occurrence number of times between word and word in default corpusStatistics, thus the collocation probability of acquisition word and word forms collocation probabilistic model. Collocation probability between word is larger, energyEnough embody this word in sentence with contextual degree of dependence and editor risk. In the time giving a mark for each difference word, canTo obtain respectively the collocation probability of other words in difference word and its place sentence, the collocation probability obtaining is integrated to obtainTo the marking result of difference word.

For example, difference word marking unit 502 can be each difference word marking according to following formula:

Wherein r (w_i, E) and be difference word w_iMarking result, E is difference word w_iThe sentence at place, can be sentence E1 or sentence E2, w_jFor removing w in E_iOutside other words, r (w_i，w_j) be w_iAnd w_jTakeJoin probability, m is the word number that E comprises.

The marking result of the each difference word of difference word air exercise subdivision 503 utilization variance word centering, determines right the beating of difference wordPoint.

Similarity determining unit 504 is utilized the right marking result of each difference word, determines the similarity of sentence E1 and sentence E2.The right marking result of all differences word in sentence E1 and sentence E2 is integrated, for example, by the right marking of institute's all differences wordResult is sued for peace, thereby determines the similarity of sentence E1 and sentence E2.

Wherein, difference word air exercise subdivision 503 can be adopted in two ways as difference word is to marking, and corresponding method is real respectivelyExecute example one and embodiment bis-, specific as follows:

First kind of way: difference word air exercise subdivision 503 can be difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) .

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is in sentence E1The marking result of difference word w,For the difference word in sentence E2Marking result, α 1, α 2, β 1 and β 2 are default powerValue parameter.

The second way: as shown in Figure 5, this device also comprises: similarity distance determining unit 505, for determining differenceThe characteristic vector of word centering two difference words, utilizes the characteristic vector of two difference words, calculates the similarity distance of two difference words.

Now, difference word air exercise subdivision 503 in the time determining the right marking of difference word, further utilization variance word centering twoThe similarity distance of difference word.

Particularly, in the second way, similarity distance determining unit 505 can be inquired about collocation probabilistic model, will be with poorThe word that the collocation probability of dissenting words reaches default collocation probability threshold value forms the characteristic vector of this difference word.

In the time calculating the similarity distance of two difference words, similarity distance determining unit 505 can be according to following formula:

dist (w, \tilde{w}) = A - Co \sin e (F (w), F (\tilde{w})),

Co \sin e (F (w), F (\tilde{w}))

For F (w) andIncluded angle cosine.

Wherein the account form of included angle cosine can adopt multiple concrete formula of the prior art, taking one wherein as exampleCan obtain the formula (5) in embodiment bis-.

In the second way, difference word air exercise subdivision 503 can be difference word to marking according to following formula:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * dist {(w, \tilde{w})}^{α 3};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * dist (w, \tilde{w}) .

Embodiment five,

The structure chart of the machine translation apparatus that Fig. 6 provides for the embodiment of the present invention five, as shown in Figure 6, this device can wrapDraw together: device 600, similar example sentence selected cell 610 and the translation forming unit 620 of calculating sentence similarity.

Calculate the similarity that the device 600 of sentence similarity calculates sentence in sentence to be translated and default example sentence storehouse, knotStructure can be as shown in Figure 5.

Similar example sentence selected cell 610 selects similarity to come the sentence of top n as the similar example sentence of sentence to be translated,N is default positive integer. Conventionally get 1 as a kind of preferred embodiment N value.

Translation forming unit 620 utilizes the translation of similar example sentence to obtain the translation of sentence to be translated.

Because example sentence quantity in example sentence storehouse is very huge, if the device 600 that calculates sentence similarity is in example sentence storehouseAll sentences calculate the similarity with sentence to be translated one by one, and efficiency can be lower, in order to raise the efficiency, this machine translationDevice can also comprise: initial option unit 630, and for determining that the editing distance between example sentence storehouse and sentence to be translated meetsThe sentence of preset requirement. For example, can select editing distance to be less than the sentence of predetermined threshold value, or, select editing distance to comeThe sentence of front M, M is default positive integer.

Correspondingly, 600, the device of calculating sentence similarity needs calculating sentence to be translated and initial option unit 630 to determineSentence between similarity.

Translation forming unit 620 can specifically comprise: difference word recognin unit 621, fragment constructor unit 622 and translateLiterary composition forms subelement 623.

The difference word between sentence to be translated and similar example sentence is identified in difference word recognin unit 621.

Fragment constructor unit 622 is using translation corresponding the difference word in sentence to be translated as candidate's translation fragment.

Translation forms subelement 623, for the translation at similar example sentence, utilizes candidate's translation fragment to replace similar example sentenceThe translation of middle corresponding difference word, obtains the translation of sentence to be translated.

The device of this machine translation can further include: display unit 640, and for showing translating of sentence to be translatedWhen literary composition, by the similar example sentence and the similar example sentence of employing and the right marking result of each difference word of sentence to be translated that adoptShow.

To marking result while showing, can be according to the marking result setting in advance the corresponding pass with confidence levelSystem, for example, be divided into confidence level height, neutralize low Three Estate according to marking result, then determines the marking knot that difference word is rightThe confidence level that fruit is corresponding, thus this confidence level shown.

As the preferred displaying scheme of one, can show sentence to be translated, similar example sentence, similar example sentence translation andThe translation of sentence to be translated, wherein the difference word of similar example sentence and sentence to be translated can highlight, and candidate's translation fragment is alsoHighlight, as shown in Figure 4, show the right confidence level of difference word on right side simultaneously.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all in essence of the present inventionWithin god and principle, any amendment of making, be equal to replacement, improvement etc., within the scope of protection of the invention all should be included in.

Claims

1. a method of calculating sentence similarity, is characterized in that, the method comprises:

The collocation probability of other words in B, utilization variance word centering difference word and its place sentence, is each difference word marking, itsIn collocation probability between two words obtain by the inquiry probabilistic model of arranging in pairs or groups, in described collocation probabilistic model between two wordsBy described two words, the statistics of the co-occurrence number of times in default corpus obtains collocation probability;

2. method according to claim 1, is characterized in that, in described step B, is each difference word according to following formulaMarking:

Wherein r (w_i, E) and be difference word w_iMarking result, E is difference word w_iPlaceThe first sentence or the second sentence, w_jFor removing w in E_iOutside other words, r (w_i,w_j) be w_iAnd w_jCollocation probability, m is EThe word number comprising.

3. method according to claim 1 and 2, is characterized in that, in described step C, is difference according to following formulaWord is to giving a mark:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is in the first sentence E1The marking result of difference word w,It is the difference word in the second sentence E2Marking result, α 1, α 2, β 1 and β 2 areDefault weighting parameter.

4. method according to claim 1, is characterized in that, the method also comprises: determine difference word centering two difference wordsCharacteristic vector, utilize the characteristic vector of described two difference words, calculate the similarity distance of described two difference words;

While determining the right marking of difference word in described step C, the further similarity distance of utilization variance word centering two difference words;

Wherein definite mode of the characteristic vector of difference word is specially:

Inquire about described collocation probabilistic model, the word that reaches default collocation probability threshold value with the collocation probability of difference word is formed shouldThe characteristic vector of difference word.

5. method according to claim 4, is characterized in that, calculates the similar distance of described two difference words according to following formulaFrom:

d i s t (w, \tilde{w}) = A - C o s i n e (F (w), F (\tilde{w})),

Wherein,For difference word w andSimilarity distance, A is pre-If positive number, F (w) is the characteristic vector of difference word w,For difference wordCharacteristic vector,For F (w) andIncluded angle cosine.

6. according to the method described in claim 4 or 5, it is characterized in that, in described step C, is difference according to following formulaWord is to giving a mark:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * d i s t {(w, \tilde{w})}^{α 3};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * d i s t (w, \tilde{w});

Wherein,For by difference word w andThe right marking result of difference word forming, r (w, E1) is in the first sentence E1The marking result of difference word w,It is the difference word in the second sentence E2Marking result,For differenceWord w andSimilarity distance, α 1, α 2, α 3, β 1, β 2 and β 3 are default weighting parameter.

7. a method for machine translation, is characterized in that, the method for this machine translation comprises:

S1, adopt the method for claim 1 to calculate the similarity of sentence in sentence to be translated and default example sentence storehouse;

S2, selection similarity come the sentence of top n as the similar example sentence of described sentence to be translated, and N is default positive integer;

8. the method for machine translation according to claim 7, is characterized in that, described step S1 specifically comprises:

S12, adopt the method for claim 1 to calculate between sentence to be translated and the definite sentence of described step S11Similarity.

9. the method for machine translation according to claim 7, is characterized in that, described step S3 specifically comprises:

S33, in the translation of described similar example sentence, utilize candidate's translation fragment to replace translating of corresponding difference word in similar example sentenceLiterary composition, obtains the translation of described sentence to be translated.

10. the method for machine translation according to claim 7, is characterized in that, the method for this machine translation also comprises:When showing the translation of described sentence to be translated, by the similar example sentence of the similar example sentence adopting and employing and described to be translatedThe right marking result of each difference word of sentence shows.

11. 1 kinds are calculated the device of sentence similarity, it is characterized in that, this device comprises:

Difference word marking unit, for the collocation probability of utilization variance word centering difference word and its place other words of sentence,For the marking of each difference word, wherein the collocation probability between two words obtains by the inquiry probabilistic model of arranging in pairs or groups, described collocation probabilityBy described two words, the statistics of the co-occurrence number of times in default corpus obtains collocation probability in model between two words;

Difference word air exercise subdivision, for the marking result of the each difference word of utilization variance word centering, determines the marking that difference word is right;

Similarity determining unit, for utilizing the right marking result of each difference word, determines described the first sentence and described secondThe similarity of son.

12. devices according to claim 11, is characterized in that, described difference word marking unit is each according to following formulaThe marking of difference word:

13. according to the device described in claim 11 or 12, it is characterized in that, described difference word air exercise subdivision is according to following public affairsFormula is difference word to marking:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2);

14. devices according to claim 11, is characterized in that, this device also comprises: similarity distance determining unit, forDetermine the characteristic vector of difference word centering two difference words, utilize the characteristic vector of described two difference words, calculate described two difference wordsSimilarity distance;

Described difference word air exercise subdivision when determining the right marking of difference word, further utilization variance word centering two difference wordsSimilarity distance;

Wherein said similarity distance determining unit, in the time determining the characteristic vector of difference word, is inquired about described collocation probabilistic model, willThe word that reaches default collocation probability threshold value with the collocation probability of difference word forms the characteristic vector of this difference word.

15. devices according to claim 14, is characterized in that, described similarity distance determining unit is according to following formula meterCalculate the similarity distance of described two difference words:

d i s t (w, \tilde{w}) = A - C o s i n e (F (w), F (\tilde{w})),

Wherein,For difference word w andSimilarity distance, A isDefault positive number, F (w) is the characteristic vector of difference word w,For difference wordCharacteristic vector,For F (w) andIncluded angle cosine.

16. according to the device described in claims 14 or 15, it is characterized in that, described difference word air exercise subdivision is according to following public affairsFormula is difference word to marking:

S (w, \tilde{w}) = r {(w, E 1)}^{α 1} * r {(\tilde{w}, E 2)}^{α 2} * d i s t {(w, \tilde{w})}^{α 3};

Or,

S (w, \tilde{w}) = β 1 * r (w, E 1) + β 2 * r (\tilde{w}, E 2) + β 3 * d i s t (w, \tilde{w});

The device of 17. 1 kinds of machine translation, is characterized in that, the device of this machine translation comprises:

The device of calculating sentence similarity as claimed in claim 11, for calculating sentence to be translated and default example sentence storehouseThe similarity of sentence;

Similar example sentence selected cell, for selecting similarity to come the sentence of top n as the similar example of described sentence to be translatedSentence, N is default positive integer;

The device of 18. machine translation according to claim 17, is characterized in that, the device of this machine translation also comprises:Initial option unit, for determining that editing distance between described example sentence storehouse and described sentence to be translated meets preset requirementSentence;

The device of described calculating sentence similarity calculates between the definite sentence of sentence to be translated and described initial option unitSimilarity.

The device of 19. machine translation according to claim 17, is characterized in that, described translation forming unit is specifically wrappedDraw together:

Fragment constructor unit, for using translation corresponding the difference word of described sentence to be translated as candidate's translation fragment;

Translation forms subelement, for the translation at described similar example sentence, utilizes candidate's translation fragment to replace in similar example sentenceThe translation of corresponding difference word, obtains the translation of described sentence to be translated.

The device of 20. machine translation according to claim 17, is characterized in that, the device of this machine translation also comprises:Display unit, in showing the translation of described sentence to be translated, similar by the similar example sentence adopting and employingThe right marking result of each difference word of example sentence and described sentence to be translated shows.