CN109710950B

CN109710950B - Bilingual alignment method, apparatus and system

Info

Publication number: CN109710950B
Application number: CN201811567535.1A
Authority: CN
Inventors: 聂镭; 徐泓洋; 郑权; 张峰; 聂颖
Original assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Current assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2019-10-18
Anticipated expiration: 2038-12-20
Also published as: CN109710950A

Abstract

The invention discloses a kind of bilingual alignment method, apparatus and systems, this method comprises: step S1: two texts to be aligned are split according to same-language organizational level；Step S2: the text similarity for each section that each section and another text segmentation that a text segmentation obtains in two text obtain is calculated；Step S3: text similarity matrix B is established；Step S4: the pairing between part that two text segmentation obtains successively is realized using current maximum element in the text similarity matrix B, wherein, after determining each pairing, the value of element of the pairing of the determination in the text similarity matrix B in corresponding column and corresponding row is set to end identifier.The present invention is conducive to improve the accuracy rate of bilingual alignment.

Description

Bilingual alignment method, apparatus and system

Technical field

The present invention relates to natural language processing technique field, especially a kind of bilingual alignment method, apparatus and system.

Background technique

Bilingual alignment is an important subject during natural language processing, it is that source is established in bilingual corpora Corresponding relationship between language and the same-language unit (such as sentence or paragraph) of object language, i.e. each section in original language with Relationship is translated each other in which part of object language kind.Bilingual alignment is further to obtain some linguistries using parallel corpora Prerequisite has utilization in the pre-processing of the systems such as machine translation, and it is current for how improving the accuracy rate of bilingual alignment Urgent problem to be solved.

Summary of the invention

In view of this, being conducive to improve double the purpose of the present invention is to provide a kind of bilingual alignment method, apparatus and system The accuracy rate of language alignment.

In order to achieve the above objectives, technical solution of the present invention provides a kind of bilingual alignment method, comprising:

Step S1: two texts to be aligned are split according to same-language organizational level；

Step S2: it calculates each section that a text segmentation obtains in two text and is obtained with another text segmentation Each section text similarity；

Step S3: text similarity matrix B is established；

Wherein, n is the quantity for the part that one text segmentation obtains, another text segmentation described in m obtains Partial quantity, the element K in the text similarity matrix B_ijI-th of the part obtained for one text segmentation and institute State the text similarity for j-th of part that another text segmentation obtains；

Step S4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between part, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B.

Further, the step S4 includes:

Step S401: judging whether the value of current maximum element in the text similarity matrix B is greater than preset value, if It is to execute step S402, if it is not, executing step S405；

Step S402: the part that the corresponding one text segmentation of the current maximum element obtains is worked as with described The portion paired that another corresponding described text segmentation of preceding maximum value element obtains, and will be where the current maximum element The value of element in row and column is set to end identifier, executes step S403 later；

Step S403: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step S401, if it is not, executing step S404；

Step S404: terminate pairing process；

Step S405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to terminate The adjacent column of identifier, if so, step S406 is executed, if it is not, executing step S407；

Step S406: another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The obtained part of part another described text segmentation corresponding with the current maximum element merge, and by the merging The obtained portion paired of part one text segmentation corresponding with the current maximum element, and will it is described not by The value of element where being set to the adjacent column of end identifier, the current maximum element in row and column is set to end of identification Symbol executes step S408 later；

Step S407: part that the corresponding one text segmentation of the current maximum element is obtained and described another The portion paired that one text segmentation obtains, and the value of the element where the current maximum element in row and column is set to knot Beam identification symbol, executes step S408 later；

Step S408: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step S405, if it is not, executing step S404.

Further, one text is English text, in the step S2, calculates text phase in the following ways Like degree K_ij:

Obtain the English text that j-th of partial translation that another described text segmentation obtains obtains；

Obtain j-th of i-th part and another described text segmentation that more one text segmentation obtains The quantity of word in the English text for dividing translation to obtain；

It calculates

Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, N_vFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then N_vValue be 1, be otherwise 0.

Further, the linguistic unit rank is paragraph rank, to realize the paragraph alignment between two text.

Further, the linguistic unit rank is sentence level, to realize the sentence alignment between two text.

To achieve the above object, technical solution of the present invention additionally provides a kind of bilingual alignment method, comprising:

Paragraph pair is realized between the first text and the second text using the bilingual alignment method that above-mentioned realization paragraph is aligned Together, several pairs of pairing paragraphs are obtained；

The pairing paragraph described for every a pair realizes sentence pair using the bilingual alignment method of above-mentioned realization sentence alignment Together.

To achieve the above object, technical solution of the present invention additionally provides a kind of bilingual alignment device, comprising:

Divide module, for being split two texts to be aligned according to same-language organizational level；

Text similarity computing module, for calculating a text segmentation obtains in two text each section and another The text similarity for each section that one text segmentation obtains；

Matrix module, for establishing text similarity matrix B；

Matching module, for successively realizing two text using current maximum element in the text similarity matrix B The pairing between part that this segmentation obtains, wherein after determining each pairing, by the pairing of the determination in the text phase End identifier is set to like the value of the element in corresponding column and corresponding row in degree matrix B.

Further, the matching module includes:

First judging unit, for judging whether the value of current maximum element in the text similarity matrix B is greater than Preset value；

First pairing unit, the part for obtaining the corresponding one text segmentation of the current maximum element The portion paired that another described text segmentation corresponding with the current maximum element obtains, and by the current maximum The value of element where element in row and column is set to end identifier；

Whether second judgment unit is not set to tie for judging currently to exist simultaneously in the text similarity matrix B The column of beam identification symbol and the row for not being set to end identifier, if so, the first judging unit of control repeats and judges the text Whether the value of current maximum element is greater than preset value in this similarity matrix B, if it is not, control end unit, which executes, terminates pairing The step of process；

End unit, for terminating pairing process；

Third judging unit, for judge in the text similarity matrix B current maximum element with the presence or absence of not by It is set to the adjacent column of end identifier；

Second pairing unit, for by another corresponding described text of the adjacent column for not being set to end identifier Divide the part that obtained part another described text segmentation corresponding with the current maximum element obtains to merge, and will The portion paired that the combined part one text segmentation corresponding with the current maximum element obtains, and will The value of element where the adjacent column for not being set to end identifier, the current maximum element in row and column is set to knot Beam identification symbol；

Third pairing unit, the part for obtaining the corresponding one text segmentation of the current maximum element The portion paired obtained with another described text segmentation, and by the element in row and column where the current maximum element Value is set to end identifier；

Whether the 4th judging unit is not set to tie for judging currently to exist simultaneously in the text similarity matrix B The column of beam identification symbol and the row for not being set to end identifier, if so, control third judging unit repeats and judges the text Current maximum element whether there is the step of adjacent column for not being set to end identifier in this similarity matrix B, if it is not, control End unit processed executes the step of terminating pairing process.

Further, one text is English text, and the Text similarity computing module includes:

Acquiring unit, the English text obtained for obtaining j-th of partial translation that another described text segmentation obtains；

Comparing unit, i-th of the part obtained for more one text segmentation and another described text segmentation The quantity of word in the English text that j-th obtained of partial translation obtains；

Computing unit, for calculating

To achieve the above object, technical solution of the present invention additionally provides a kind of bilingual alignment system, including above-mentioned is used for Realize the bilingual alignment device and the above-mentioned bilingual alignment device for realizing sentence alignment of paragraph alignment；

Section is realized between the first text and the second text by the above-mentioned bilingual alignment device for realizing paragraph alignment Alignment is fallen, several pairs of pairing paragraphs are obtained；

The pairing paragraph described for every a pair realizes sentence by the above-mentioned bilingual alignment device for realizing sentence alignment Alignment.

Bilingual alignment method provided by the invention is obtained every by a text segmentation in calculating two texts to be aligned The text similarity for each section that a part is obtained with another text segmentation, and establish text similarity matrix, Zhi Houyi The secondary current maximum element from the matrix is started with, and carries out pairing process gradually downward, to be conducive to improve bilingual alignment Accuracy rate.

Detailed description of the invention

By referring to the drawings to the description of the embodiment of the present invention, the above and other purposes of the present invention, feature and Advantage will be apparent from, in the accompanying drawings:

Fig. 1 is a kind of flow chart of bilingual alignment method provided in an embodiment of the present invention；

Fig. 2 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of text similarity matrix provided in an embodiment of the present invention；

Fig. 4 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention；

Fig. 5 is a kind of schematic diagram of bilingual alignment device provided in an embodiment of the present invention.

Specific embodiment

Below based on embodiment, present invention is described, but the present invention is not restricted to these embodiments.Under Text is detailed to describe some specific detail sections in datail description of the invention, in order to avoid obscuring essence of the invention, There is no narrations in detail for well known method, process, process, element.

In addition, it should be understood by one skilled in the art that provided herein attached drawing be provided to explanation purpose, and What attached drawing was not necessarily drawn to scale.

Unless the context clearly requires otherwise, "include", "comprise" otherwise throughout the specification and claims etc. are similar Word should be construed as the meaning for including rather than exclusive or exhaustive meaning；That is, be " including but not limited to " contains Justice.

In the description of the present invention, it is to be understood that, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " It is two or more.

It is a kind of flow chart of bilingual alignment method provided in an embodiment of the present invention referring to Fig. 1, Fig. 1, this method comprises:

Step S3: text similarity matrix B is established；

Wherein, n is the quantity for the part that one text segmentation obtains, another text segmentation described in m obtains Partial quantity, the element K in the text similarity matrix B_ijI-th of the part obtained for one text segmentation and institute State the text similarity for j-th of part that another text segmentation obtains, wherein each section that one text segmentation obtains Number (i.e. the value of i) it is corresponding its one text the position (volume of the part of starting position in i.e. one text It number is 1, the number of part thereafter is 2 ..., and the number of the part at text end position is n) another described text point The number (i.e. the value of j) of each section cut it is corresponding its in the position of another text (in another i.e. described text The number of the part of starting position is 1, and the number of part thereafter is 2 ..., and the number of the part at text end position is m)；

Step S4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between part, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B, for example, the end identifier can be 0.

Bilingual alignment method provided in an embodiment of the present invention is obtained by calculating a text segmentation in two texts to be aligned To the obtained text similarity of each section of each section and another text segmentation, and establish text similarity matrix, Successively current maximum element is started with from the matrix later, carries out pairing process gradually downward, to be conducive to improve bilingual The accuracy rate of alignment.

Referring to fig. 2, Fig. 2 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention, and this method is used for Realize paragraph alignment, this method comprises:

Step A1: two texts to be aligned are split and (are segmented) according to paragraph rank；

Since usual text is storage in storage into a character string, paragraph ending is all ended up with newline in text, Therefore, cutting can be carried out to text using newline " n " as cutting symbol here；

For example, n paragraph is obtained, by another text by being segmented to a text in above-mentioned two text It is segmented, obtains m paragraph, in addition, after being segmented to bilingual parallel corporas, it can also be according to the tandem of text to section Row number is dropped into, the form shaped like 1_text；

Step A2: it calculates each paragraph that a text segmentation obtains in two text and is obtained with another text segmentation Each paragraph text similarity；

For example, said one text is English text (text of original language), another text is Chinese text (target language The text of speech), the text similarity K between two paragraphs can be calculated in the following ways_ij:

Step A201: the English text that j-th of paragraph that another described text segmentation obtains is translated is obtained；

For example, API can be translated by calling, above-mentioned m Chinese paragraph is translated as English form, it is each after translation The tag number of paragraph is still identical as the Chinese numbered paragraphs before translation；

Step A202: i-th of paragraph that more one text segmentation obtains is obtained with another described text segmentation The English text translated of j-th of paragraph in word quantity；

Step A203: it calculates

Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, N_vFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then N_vValue be 1, be otherwise 0；

It should be noted that if step A202 comparison result is identical for the word quantity of the two, then it can be by any one As a fairly large number of one of word, negligible amounts one of of the another one as word；

Calculate two paragraphs between text similarity after, if its value is bigger, show between two paragraph have pair A possibility that should being related to, is bigger；

Step A3: text similarity matrix B is established；

Wherein, n is the quantity for the paragraph that one text segmentation obtains, another text segmentation described in m obtains The quantity of paragraph, the element K in the text similarity matrix B_ijI-th of the paragraph obtained for one text segmentation and institute State the text similarity for j-th of paragraph that another text segmentation obtains, wherein each paragraph that one text segmentation obtains Number (i.e. the value of i) it is corresponding its one text the paragraph position (paragraph of starting position in i.e. one text Number be 1, the number of paragraph thereafter is 2 ..., the number of the paragraph at text end position be n), it is described another text It is (i.e. described another in the paragraph position of another text that the number (i.e. the value of j) for each paragraph that this segmentation obtains corresponds to it The number of the paragraph of starting position is 1 in a text, and the number of paragraph thereafter is 2 ..., the paragraph at text end position Number is m)；

Step A4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between paragraph, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B,

It should be noted that the value of above-mentioned m and the value of n are not necessarily identical, i.e., although two texts to be aligned are that content contains Identical two articles of justice, but since the difference of expression way will cause the paragraph in the text of original language in object language The case where being split into multiple paragraphs, i.e. matrix B can have the case where a line is to multiple row, therefore, to further increase improve Matched accuracy needs to be reconsolidated by separated paragraph when paragraph matches one by one, specifically, step A4 It can specifically include:

Step A401: judge whether the value of current maximum element in text similarity matrix B is greater than preset value (i.e. in advance The threshold value of setting, such as the preset value can be 0.5,0.6 or 0.7 etc.), if so, step A402 is executed, if it is not, executing step A405；

Step A402: paragraph that the corresponding one text segmentation of current maximum element is obtained and it is described it is current most The paragraph pairing that another corresponding described text segmentation of big value element obtains, and by the current maximum element be expert at The value of all elements in column is set to end identifier, and (element for being set to end identifier is not participating in subsequent matched Journey, for example, end identifier is 0), to execute step A403 later；

Step A403: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step A401, if it is not, executing step A404；

It should be noted that the column for being set to end identifier refer to that the value of all elements in column is set to end of identification The column of symbol, the row for being set to end identifier refer to that the value of all elements in row is set to the row of end identifier；

Step A404: terminate pairing process；

Step A405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to terminate The adjacent column (whether there is the adjacent column for not being set to end identifier) of identifier, if so, step A406 is executed, if It is no, execute step A407；

Step A406: another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The obtained paragraph of paragraph another described text segmentation corresponding with the current maximum element merge, and by the merging The obtained paragraph pairing of paragraph one text segmentation corresponding with the current maximum element, and will it is described not by The value of all elements where being set to the adjacent column of end identifier, the current maximum element in row and column is set to end mark Know symbol, executes step A408 later；

Step A407: paragraph that the corresponding one text segmentation of the current maximum element is obtained and described another The paragraph pairing that one text segmentation obtains, and the value of all elements where the current maximum element in row and column is set For end identifier, step A408 is executed later；

Step A408: judge the column for not being set to end identifier currently whether are existed simultaneously in text similarity matrix B It is not set to the row of end identifier, if so, step A405 is repeated, if it is not, executing step A404；

Pairing process in the present embodiment includes two stages, wherein first stage includes the steps that above-mentioned A401- A402 finds pairing using preset value in this stage, i.e., finds current maximum element in matrix B, the value of the element first It indicates the text similarity between two paragraphs of corresponding row, column, judges whether the value of the element is greater than preset value, if more than The preset value takes out this coordinate and saves, later by matrix B then it is believed that corresponding row and column is matched two paragraphs In the values of all elements of the rows and columns be set to end identifier, then look for the maximum value in current updated matrix again again Element, and repeat above-mentioned process；

When the value of current maximum element in matrix B is less than or equal to preset value, then start to carry out second stage, including upper The step A405-A408 stated, at this point, the remaining row and column for not being set to end identifier is the section that do not match in matrix It falls, the corresponding paragraph of these row and columns is generally exactly that the corresponding column of row have been separated into multistage, therefore look in the first phase not The paragraph high to matching degree first looks for current maximum element in matrix B in this stage, from the position of the maximum value element It sets (i.e. the left direction of maximum value element column) forward and (i.e. the right direction of maximum value element column) is searched backward, Judge whether left and right adjacent column is to be set to the column of end identifier, and the adjacent column for not being set to end identifier is considered as By separated text, it is merged, and row corresponding with the current maximum element is aligned, later by corresponding row and The value of all elements is set to end identifier in column, then repeatedly aforesaid operations, until do not existed simultaneously in current matrix not by The stopping when column for being set to end identifier and the row for not being set to end identifier；

By the above method, just the paragraph one of two articles of source language and the target language (two texts i.e. to be aligned) One has corresponded to；

For example, the matrix established in step A3 is as shown in figure 3, preset value is 0.6, treatment process is as follows:

First stage:

1, in current matrix maximum value element be the position a1 element, and the value of the element of the position a1 be greater than 0.6, then will The 7th paragraph that the 9th paragraph and one text segmentation that another described text segmentation obtains obtain matches, and later will The value of all elements of 9th column and the 7th row sets 0；

2, later in current matrix maximum value element be the position a2 element, and the value of the element of the position a2 be greater than 0.6, Then the 3rd paragraph that another described text segmentation obtains and the 3rd paragraph that one text segmentation obtains are matched, it The value of the 3rd column and all elements of the 3rd row is set 0 afterwards；

3, it repeats the above steps, sequentially finds the element of the position a3, a4, by then obtain another described text segmentation The 2nd paragraph that 2nd paragraph and one text segmentation obtain matches, then the obtained another described text segmentation The 5th paragraph that 6 paragraphs and one text segmentation obtain matches, and the element in corresponding columns and rows is set 0；

The value of maximum value element is less than or equal to carry out second stage when 0.6 in current matrix:

1, the element of the position b1 is current maximum element, finds the adjacent column for not being set to 0 to the left, to the right, does not find, Then the 10th paragraph that another described text segmentation obtains and the 9th paragraph that one text segmentation obtains are matched, And the element in corresponding columns and rows is set 0；

2, the element of the position b2 is current maximum element later, finds the adjacent column for not being set to 0 to the left, to the right, searches The element column for seeking the position b3 is not set to 0, then the 7th paragraph obtained another described text segmentation and the 8th section Merging is fallen, the 8th paragraph for later obtaining combined paragraph and one text segmentation matches, then by corresponding columns and rows In element set 0；

3, repeat the above steps, the 4th paragraph that another described text segmentation is obtained merge with the 5th paragraph after with The 4th paragraph pairing that one text segmentation obtains, sets 0 for the element in corresponding columns and rows, will another described text The 1st paragraph that the 1st paragraph and one text segmentation that this segmentation obtains obtain matches, will be in corresponding columns and rows Element sets 0, finally only remains the 6th row and is not set to 0；

It should be noted that if generating column or row has the case where not matching (the 6th row in such as Fig. 3), then it represents that bilingual In text there is the content that another language text is not mentioned in one of language text.

Referring to fig. 4, Fig. 4 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention, and this method is used for Realize sentence alignment, this method comprises:

Step B1: two texts to be aligned are split (i.e. progress subordinate sentence) according to sentence level；

For example, two texts to be aligned can be two paragraphs being mutually paired；

For example, can according to punctuation mark to paragraph text carry out subordinate sentence, by taking bilingual Chinese-English alignment as an example, Chinese with "." For cut-point, English is with " " for cut-point；

In addition, in actual application, since the sentence lack of standardization that often results in of style of writing is according to punctuation mark progress cutting The effect is unsatisfactory and it is bilingual between there are the case where semantic series connection, in order to solve this problem, also can be used between sentence The method that canonical is added removes cutting sentence, or the bad sentence of cutting effect is directly merged into one, to avoid matched Mistake；

For example, n sentence is obtained, by another by carrying out subordinate sentence processing to a text in above-mentioned two text Text carries out subordinate sentence processing, obtains m sentence, wherein each sentence can be numbered according to text tandem；

Step B2: it calculates each sentence that a text segmentation obtains in two text and is obtained with another text segmentation Each sentence text similarity；

For example, said one text is English text (text of original language), another text is Chinese text (target language The text of speech), the text similarity K between two sentences can be calculated in the following ways_ij:

Step B201: the English text that j-th of sentence translation that another described text segmentation obtains obtains is obtained；

It is English form by above-mentioned m Chinese sentence translation for example, API can be translated by calling, it is each after translation The tag number of sentence is still identical as the Chinese sentence number before translation；

Step B202: i-th of sentence that more one text segmentation obtains is obtained with another described text segmentation The obtained English text of j-th of sentence translation in word quantity；

Step B203: it calculates

It should be noted that if step B202 comparison result is identical for the word quantity of the two, then it can be by any one As a fairly large number of one of word, negligible amounts one of of the another one as word；

Calculate two sentences between text similarity after, if its value is bigger, show between two sentence have pair A possibility that should being related to, is bigger；

Step B3: text similarity matrix B is established；

Wherein, n is the quantity for the sentence that one text segmentation obtains, another text segmentation described in m obtains The quantity of sentence, the element K in the text similarity matrix B_ijI-th of the sentence obtained for one text segmentation and institute State the text similarity for j-th of sentence that another text segmentation obtains, wherein each sentence that one text segmentation obtains Number (i.e. the value of i) it is corresponding its one text the sentence position (sentence of starting position in i.e. one text Number be 1, the number of sentence thereafter is 2 ..., the number of the sentence at text end position be n), it is described another text It is (i.e. described another in the sentence position of another text that the number (i.e. the value of j) for each sentence that this segmentation obtains corresponds to it The number of the sentence of starting position is 1 in a text, and the number of sentence thereafter is 2 ..., the sentence at text end position Number is m)；

Step B4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between sentence, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B, for example, the end identifier can be 0.

It should be noted that the value of above-mentioned m and the value of n are not necessarily identical, i.e., although two texts to be aligned are that content contains Identical two articles of justice, but since the difference of expression way will cause the sentence in the text of original language in object language The case where being split into multiple sentences, i.e. matrix B can have the case where a line is to multiple row, therefore, to further increase improve Matched accuracy needs to be reconsolidated by separated sentence when sentence matches one by one, specifically, step B4 It can specifically include:

Step B401: judge whether the value of current maximum element in text similarity matrix B is greater than preset value (i.e. in advance The threshold value of setting, such as the preset value can be 0.5,0.6 or 0.7 etc.), if so, step B402 is executed, if it is not, executing step B405；

Step B402: sentence that the corresponding one text segmentation of current maximum element is obtained and it is described it is current most The sentence pairing that another corresponding described text segmentation of big value element obtains, and by the current maximum element be expert at The value of all elements in column is set to end identifier, and (element for being set to end identifier is not participating in subsequent matched Journey, for example, end identifier is 0), to execute step B403 later；

Step B403: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step B401, if it is not, executing step B404；

Step B404: terminate pairing process；

Step B405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to terminate The adjacent column (whether there is the adjacent column for not being set to end identifier) of identifier, if so, step B406 is executed, if It is no, execute step B407；

Step B406: another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The obtained sentence of sentence another described text segmentation corresponding with the current maximum element merge, and by the merging The obtained sentence pairing of sentence one text segmentation corresponding with the current maximum element, and will it is described not by The value of all elements where being set to the adjacent column of end identifier, the current maximum element in row and column is set to end mark Know symbol, executes step B408 later；

Step B407: sentence that the corresponding one text segmentation of the current maximum element is obtained and described another The sentence pairing that one text segmentation obtains, and the value of all elements where the current maximum element in row and column is set For end identifier, step B408 is executed later；

Step B408: judge the column for not being set to end identifier currently whether are existed simultaneously in text similarity matrix B It is not set to the row of end identifier, if so, step B405 is repeated, if it is not, executing step B404.

The embodiment of the invention also provides a kind of bilingual alignment methods, this method comprises:

Step C1: realize that paragraph is aligned, and is obtained several between the first text and the second text using method shown in Fig. 2 To pairing paragraph；

Step C2: obtained every a pair of of the pairing segment of step C1 is fallen, sentence alignment is realized using method shown in Fig. 3.

Bilingual alignment method provided in an embodiment of the present invention, for the style of writing that occurs in text, lack of standardization, position is mismatched It is repeated with semanteme, semantic the problems such as splitting, using first matching paragraph, then matches the two-stage approach of sentence in paragraph, avoid The case where many-one matching and repetition sentence or the dislocation of Wen Neigao similarity sentence when directly carrying out sentence matching match；This Outside, take result to when using maximum probability when one-to-one thought, from the maximum language of text similarity to starting with, by It gradually takes downwards, first excludes confirmable pairing, reprocess uncertain pairing, thereby may be ensured that the accurate of bilingual alignment Rate.

The embodiment of the invention also provides a kind of bilingual alignment device, which includes:

Divide module 1, for being split two texts to be aligned according to same-language organizational level；

Text similarity computing module 2, for calculate a text segmentation obtains in two text each section with The text similarity for each section that another text segmentation obtains；

Matrix module 3, for establishing text similarity matrix B；

Matching module 4, for successively realizing two text using current maximum element in the text similarity matrix B The pairing between part that this segmentation obtains, wherein after determining each pairing, by the pairing of the determination in the text phase End identifier is set to like the value of the element in corresponding column and corresponding row in degree matrix B.

In one embodiment, the matching module includes:

End unit, for terminating pairing process；

In one embodiment, one text is English text, and the Text similarity computing module includes:

Computing unit, for calculating

In one embodiment, the linguistic unit rank is paragraph rank, to realize the paragraph between two text Alignment.

In one embodiment, the linguistic unit rank is sentence level, to realize the sentence between two text Alignment.

The embodiment of the invention also provides a kind of bilingual alignment systems, including above-mentioned for realizing the bilingual right of paragraph alignment Neat device and the above-mentioned bilingual alignment device for realizing sentence alignment；

Those skilled in the art will readily recognize that above-mentioned each preferred embodiment can be free under the premise of not conflicting Ground combination, superposition.

It should be appreciated that above-mentioned embodiment is merely exemplary, and not restrictive, without departing from of the invention basic In the case where principle, those skilled in the art can be directed to the various apparent or equivalent modification or replace that above-mentioned details is made It changes, is all included in scope of the presently claimed invention.

Claims

1. a kind of bilingual alignment method characterized by comprising

Step S2: each section that a text segmentation obtains in calculating two text obtains every with another text segmentation The text similarity of a part；

Step S3: text similarity matrix B is established；

Wherein, n is the quantity for the part that one text segmentation obtains, and m is the part that another described text segmentation obtains Quantity, the element K in the text similarity matrix B_ijI-th of part obtaining for one text segmentation and described another The text similarity for j-th of part that one text segmentation obtains；

Step S4: successively realize that two text segmentation obtains using current maximum element in the text similarity matrix B Part between pairing, wherein after determining each pairing, by the pairing of the determination in the text similarity matrix B In the value of element in corresponding column and corresponding row be set to end identifier；

Wherein, the step S4 includes:

Step S401: judging whether the value of current maximum element in the text similarity matrix B is greater than preset value, if so, Step S402 is executed, if it is not, executing step S405；

Step S402: the part that the corresponding one text segmentation of the current maximum element is obtained with it is described it is current most The portion paired that another corresponding described text segmentation of big value element obtains, and by the current maximum element be expert at The value of element in column is set to end identifier, executes step S403 later；

Step S403: judge the column for not being set to end identifier currently whether are existed simultaneously in the text similarity matrix B It is not set to the row of end identifier, if so, step S401 is repeated, if it is not, executing step S404；

Step S404: terminate pairing process；

Step S405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to end of identification The adjacent column of symbol, if so, step S406 is executed, if it is not, executing step S407；

Step S406: the portion that another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The part merging that another point corresponding with the current maximum element described text segmentation obtains, and by the combined portion The portion paired for dividing one text segmentation corresponding with the current maximum element to obtain, and be not set to described The value of element where the adjacent column of end identifier, the current maximum element in row and column is set to end identifier, it Step S408 is executed afterwards；

Step S407: part that the corresponding one text segmentation of the current maximum element is obtained and it is described another The portion paired that text segmentation obtains, and the value of the element where the current maximum element in row and column is set to end mark Know symbol, executes step S408 later；

Step S408: judge the column for not being set to end identifier currently whether are existed simultaneously in the text similarity matrix B It is not set to the row of end identifier, if so, step S405 is repeated, if it is not, executing step S404.

2. the method according to claim 1, wherein one text is English text, in the step S2 In, text similarity K is calculated in the following ways_ij:

It turns over j-th of part that i-th of part that more one text segmentation obtains is obtained with another described text segmentation The quantity of word in the English text translated；

It calculates

Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, N_vFor the comparison result The value of v-th of word in a fairly large number of one of middle word, if in the comparison result negligible amounts of word one Person includes the word of root identical as v-th of word, then N_vValue be 1, be otherwise 0.

3. method according to claim 1 to 2, which is characterized in that the linguistic unit rank is paragraph rank, from And it realizes the paragraph between two text and is aligned.

4. method according to claim 1 to 2, which is characterized in that the linguistic unit rank is sentence level, from And realize the sentence alignment between two text.

5. a kind of bilingual alignment method characterized by comprising

It realizes that paragraph is aligned between the first text and the second text using method as claimed in claim 3, obtains several to matching To paragraph；

The pairing paragraph described for every a pair realizes sentence alignment using method as claimed in claim 4.

6. a kind of bilingual alignment device characterized by comprising

Text similarity computing module, for calculating each section and another that a text segmentation obtains in two text The text similarity for each section that text segmentation obtains；

Matrix module, for establishing text similarity matrix B；

Matching module, for successively realizing two text point using current maximum element in the text similarity matrix B The pairing between part cut, wherein after determining each pairing, by the pairing of the determination in the text similarity The value of corresponding column and the element in corresponding row is set to end identifier in matrix B；

Wherein, the matching module includes:

First judging unit, for judging it is default whether the value of current maximum element in the text similarity matrix B is greater than Value；

First pairing unit, part and institute for obtaining the corresponding one text segmentation of the current maximum element State the portion paired that another corresponding described text segmentation of current maximum element obtains, and by the current maximum element The value of element in the row and column of place is set to end identifier；

Whether second judgment unit is not set to terminate mark for judging currently to exist simultaneously in the text similarity matrix B The column for knowing symbol and the row for not being set to end identifier, if so, the first judging unit of control repeats and judges the text phase Whether it is greater than preset value like the value of current maximum element in degree matrix B, if it is not, control end unit, which executes, terminates pairing process The step of；

End unit, for terminating pairing process；

Third judging unit, for judging in the text similarity matrix B current maximum element with the presence or absence of not being set to The adjacent column of end identifier；

Second pairing unit, for by another corresponding described text segmentation of the adjacent column for not being set to end identifier The part that obtained part another described text segmentation corresponding with the current maximum element obtains merges, and will be described The portion paired that combined part one text segmentation corresponding with the current maximum element obtains, and will be described The value of element where not being set to the adjacent column of end identifier, the current maximum element in row and column is set to end mark Know symbol；

Third pairing unit, part and institute for obtaining the corresponding one text segmentation of the current maximum element The portion paired that another text segmentation obtains is stated, and the value of the element where the current maximum element in row and column is set For end identifier；

Whether the 4th judging unit is not set to terminate mark for judging currently to exist simultaneously in the text similarity matrix B The column for knowing symbol and the row for not being set to end identifier, if so, control third judging unit repeats and judges the text phase The step of whether there is the adjacent column for not being set to end identifier like current maximum element in degree matrix B, if it is not, control knot Shu Danyuan executes the step of terminating pairing process.

7. device according to claim 6, which is characterized in that one text is English text, and the text is similar Spending computing module includes:

Comparing unit, i-th of the part obtained for more one text segmentation are obtained with another described text segmentation The obtained English text of j-th of partial translation in word quantity；

Computing unit, for calculating

8. a kind of bilingual alignment system, which is characterized in that including realizing the device of claim 3 the method and realizing right It is required that the device of 4 the methods；

By realizing that the device of claim 3 the method realizes that paragraph is aligned, and is obtained between the first text and the second text Several pairs of pairing paragraphs；

The pairing paragraph described for every a pair, by realizing that the device of claim 4 the method realizes sentence alignment.