CN109710950B - Bilingual alignment method, apparatus and system - Google Patents

Bilingual alignment method, apparatus and system Download PDF

Info

Publication number
CN109710950B
CN109710950B CN201811567535.1A CN201811567535A CN109710950B CN 109710950 B CN109710950 B CN 109710950B CN 201811567535 A CN201811567535 A CN 201811567535A CN 109710950 B CN109710950 B CN 109710950B
Authority
CN
China
Prior art keywords
text
current maximum
value
obtains
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811567535.1A
Other languages
Chinese (zh)
Other versions
CN109710950A (en
Inventor
聂镭
徐泓洋
郑权
张峰
聂颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Original Assignee
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd filed Critical Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority to CN201811567535.1A priority Critical patent/CN109710950B/en
Publication of CN109710950A publication Critical patent/CN109710950A/en
Application granted granted Critical
Publication of CN109710950B publication Critical patent/CN109710950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of bilingual alignment method, apparatus and systems, this method comprises: step S1: two texts to be aligned are split according to same-language organizational level;Step S2: the text similarity for each section that each section and another text segmentation that a text segmentation obtains in two text obtain is calculated;Step S3: text similarity matrix B is established;Step S4: the pairing between part that two text segmentation obtains successively is realized using current maximum element in the text similarity matrix B, wherein, after determining each pairing, the value of element of the pairing of the determination in the text similarity matrix B in corresponding column and corresponding row is set to end identifier.The present invention is conducive to improve the accuracy rate of bilingual alignment.

Description

Bilingual alignment method, apparatus and system
Technical field
The present invention relates to natural language processing technique field, especially a kind of bilingual alignment method, apparatus and system.
Background technique
Bilingual alignment is an important subject during natural language processing, it is that source is established in bilingual corpora Corresponding relationship between language and the same-language unit (such as sentence or paragraph) of object language, i.e. each section in original language with Relationship is translated each other in which part of object language kind.Bilingual alignment is further to obtain some linguistries using parallel corpora Prerequisite has utilization in the pre-processing of the systems such as machine translation, and it is current for how improving the accuracy rate of bilingual alignment Urgent problem to be solved.
Summary of the invention
In view of this, being conducive to improve double the purpose of the present invention is to provide a kind of bilingual alignment method, apparatus and system The accuracy rate of language alignment.
In order to achieve the above objectives, technical solution of the present invention provides a kind of bilingual alignment method, comprising:
Step S1: two texts to be aligned are split according to same-language organizational level;
Step S2: it calculates each section that a text segmentation obtains in two text and is obtained with another text segmentation Each section text similarity;
Step S3: text similarity matrix B is established;
Wherein, n is the quantity for the part that one text segmentation obtains, another text segmentation described in m obtains Partial quantity, the element K in the text similarity matrix BijI-th of the part obtained for one text segmentation and institute State the text similarity for j-th of part that another text segmentation obtains;
Step S4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between part, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B.
Further, the step S4 includes:
Step S401: judging whether the value of current maximum element in the text similarity matrix B is greater than preset value, if It is to execute step S402, if it is not, executing step S405;
Step S402: the part that the corresponding one text segmentation of the current maximum element obtains is worked as with described The portion paired that another corresponding described text segmentation of preceding maximum value element obtains, and will be where the current maximum element The value of element in row and column is set to end identifier, executes step S403 later;
Step S403: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step S401, if it is not, executing step S404;
Step S404: terminate pairing process;
Step S405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to terminate The adjacent column of identifier, if so, step S406 is executed, if it is not, executing step S407;
Step S406: another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The obtained part of part another described text segmentation corresponding with the current maximum element merge, and by the merging The obtained portion paired of part one text segmentation corresponding with the current maximum element, and will it is described not by The value of element where being set to the adjacent column of end identifier, the current maximum element in row and column is set to end of identification Symbol executes step S408 later;
Step S407: part that the corresponding one text segmentation of the current maximum element is obtained and described another The portion paired that one text segmentation obtains, and the value of the element where the current maximum element in row and column is set to knot Beam identification symbol, executes step S408 later;
Step S408: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step S405, if it is not, executing step S404.
Further, one text is English text, in the step S2, calculates text phase in the following ways Like degree Kij:
Obtain the English text that j-th of partial translation that another described text segmentation obtains obtains;
Obtain j-th of i-th part and another described text segmentation that more one text segmentation obtains The quantity of word in the English text for dividing translation to obtain;
It calculates
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then NvValue be 1, be otherwise 0.
Further, the linguistic unit rank is paragraph rank, to realize the paragraph alignment between two text.
Further, the linguistic unit rank is sentence level, to realize the sentence alignment between two text.
To achieve the above object, technical solution of the present invention additionally provides a kind of bilingual alignment method, comprising:
Paragraph pair is realized between the first text and the second text using the bilingual alignment method that above-mentioned realization paragraph is aligned Together, several pairs of pairing paragraphs are obtained;
The pairing paragraph described for every a pair realizes sentence pair using the bilingual alignment method of above-mentioned realization sentence alignment Together.
To achieve the above object, technical solution of the present invention additionally provides a kind of bilingual alignment device, comprising:
Divide module, for being split two texts to be aligned according to same-language organizational level;
Text similarity computing module, for calculating a text segmentation obtains in two text each section and another The text similarity for each section that one text segmentation obtains;
Matrix module, for establishing text similarity matrix B;
Wherein, n is the quantity for the part that one text segmentation obtains, another text segmentation described in m obtains Partial quantity, the element K in the text similarity matrix BijI-th of the part obtained for one text segmentation and institute State the text similarity for j-th of part that another text segmentation obtains;
Matching module, for successively realizing two text using current maximum element in the text similarity matrix B The pairing between part that this segmentation obtains, wherein after determining each pairing, by the pairing of the determination in the text phase End identifier is set to like the value of the element in corresponding column and corresponding row in degree matrix B.
Further, the matching module includes:
First judging unit, for judging whether the value of current maximum element in the text similarity matrix B is greater than Preset value;
First pairing unit, the part for obtaining the corresponding one text segmentation of the current maximum element The portion paired that another described text segmentation corresponding with the current maximum element obtains, and by the current maximum The value of element where element in row and column is set to end identifier;
Whether second judgment unit is not set to tie for judging currently to exist simultaneously in the text similarity matrix B The column of beam identification symbol and the row for not being set to end identifier, if so, the first judging unit of control repeats and judges the text Whether the value of current maximum element is greater than preset value in this similarity matrix B, if it is not, control end unit, which executes, terminates pairing The step of process;
End unit, for terminating pairing process;
Third judging unit, for judge in the text similarity matrix B current maximum element with the presence or absence of not by It is set to the adjacent column of end identifier;
Second pairing unit, for by another corresponding described text of the adjacent column for not being set to end identifier Divide the part that obtained part another described text segmentation corresponding with the current maximum element obtains to merge, and will The portion paired that the combined part one text segmentation corresponding with the current maximum element obtains, and will The value of element where the adjacent column for not being set to end identifier, the current maximum element in row and column is set to knot Beam identification symbol;
Third pairing unit, the part for obtaining the corresponding one text segmentation of the current maximum element The portion paired obtained with another described text segmentation, and by the element in row and column where the current maximum element Value is set to end identifier;
Whether the 4th judging unit is not set to tie for judging currently to exist simultaneously in the text similarity matrix B The column of beam identification symbol and the row for not being set to end identifier, if so, control third judging unit repeats and judges the text Current maximum element whether there is the step of adjacent column for not being set to end identifier in this similarity matrix B, if it is not, control End unit processed executes the step of terminating pairing process.
Further, one text is English text, and the Text similarity computing module includes:
Acquiring unit, the English text obtained for obtaining j-th of partial translation that another described text segmentation obtains;
Comparing unit, i-th of the part obtained for more one text segmentation and another described text segmentation The quantity of word in the English text that j-th obtained of partial translation obtains;
Computing unit, for calculating
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then NvValue be 1, be otherwise 0.
Further, the linguistic unit rank is paragraph rank, to realize the paragraph alignment between two text.
Further, the linguistic unit rank is sentence level, to realize the sentence alignment between two text.
To achieve the above object, technical solution of the present invention additionally provides a kind of bilingual alignment system, including above-mentioned is used for Realize the bilingual alignment device and the above-mentioned bilingual alignment device for realizing sentence alignment of paragraph alignment;
Section is realized between the first text and the second text by the above-mentioned bilingual alignment device for realizing paragraph alignment Alignment is fallen, several pairs of pairing paragraphs are obtained;
The pairing paragraph described for every a pair realizes sentence by the above-mentioned bilingual alignment device for realizing sentence alignment Alignment.
Bilingual alignment method provided by the invention is obtained every by a text segmentation in calculating two texts to be aligned The text similarity for each section that a part is obtained with another text segmentation, and establish text similarity matrix, Zhi Houyi The secondary current maximum element from the matrix is started with, and carries out pairing process gradually downward, to be conducive to improve bilingual alignment Accuracy rate.
Detailed description of the invention
By referring to the drawings to the description of the embodiment of the present invention, the above and other purposes of the present invention, feature and Advantage will be apparent from, in the accompanying drawings:
Fig. 1 is a kind of flow chart of bilingual alignment method provided in an embodiment of the present invention;
Fig. 2 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention;
Fig. 3 is a kind of schematic diagram of text similarity matrix provided in an embodiment of the present invention;
Fig. 4 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention;
Fig. 5 is a kind of schematic diagram of bilingual alignment device provided in an embodiment of the present invention.
Specific embodiment
Below based on embodiment, present invention is described, but the present invention is not restricted to these embodiments.Under Text is detailed to describe some specific detail sections in datail description of the invention, in order to avoid obscuring essence of the invention, There is no narrations in detail for well known method, process, process, element.
In addition, it should be understood by one skilled in the art that provided herein attached drawing be provided to explanation purpose, and What attached drawing was not necessarily drawn to scale.
Unless the context clearly requires otherwise, "include", "comprise" otherwise throughout the specification and claims etc. are similar Word should be construed as the meaning for including rather than exclusive or exhaustive meaning;That is, be " including but not limited to " contains Justice.
In the description of the present invention, it is to be understood that, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " It is two or more.
It is a kind of flow chart of bilingual alignment method provided in an embodiment of the present invention referring to Fig. 1, Fig. 1, this method comprises:
Step S1: two texts to be aligned are split according to same-language organizational level;
Step S2: it calculates each section that a text segmentation obtains in two text and is obtained with another text segmentation Each section text similarity;
Step S3: text similarity matrix B is established;
Wherein, n is the quantity for the part that one text segmentation obtains, another text segmentation described in m obtains Partial quantity, the element K in the text similarity matrix BijI-th of the part obtained for one text segmentation and institute State the text similarity for j-th of part that another text segmentation obtains, wherein each section that one text segmentation obtains Number (i.e. the value of i) it is corresponding its one text the position (volume of the part of starting position in i.e. one text It number is 1, the number of part thereafter is 2 ..., and the number of the part at text end position is n) another described text point The number (i.e. the value of j) of each section cut it is corresponding its in the position of another text (in another i.e. described text The number of the part of starting position is 1, and the number of part thereafter is 2 ..., and the number of the part at text end position is m);
Step S4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between part, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B, for example, the end identifier can be 0.
Bilingual alignment method provided in an embodiment of the present invention is obtained by calculating a text segmentation in two texts to be aligned To the obtained text similarity of each section of each section and another text segmentation, and establish text similarity matrix, Successively current maximum element is started with from the matrix later, carries out pairing process gradually downward, to be conducive to improve bilingual The accuracy rate of alignment.
Referring to fig. 2, Fig. 2 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention, and this method is used for Realize paragraph alignment, this method comprises:
Step A1: two texts to be aligned are split and (are segmented) according to paragraph rank;
Since usual text is storage in storage into a character string, paragraph ending is all ended up with newline in text, Therefore, cutting can be carried out to text using newline " n " as cutting symbol here;
For example, n paragraph is obtained, by another text by being segmented to a text in above-mentioned two text It is segmented, obtains m paragraph, in addition, after being segmented to bilingual parallel corporas, it can also be according to the tandem of text to section Row number is dropped into, the form shaped like 1_text;
Step A2: it calculates each paragraph that a text segmentation obtains in two text and is obtained with another text segmentation Each paragraph text similarity;
For example, said one text is English text (text of original language), another text is Chinese text (target language The text of speech), the text similarity K between two paragraphs can be calculated in the following waysij:
Step A201: the English text that j-th of paragraph that another described text segmentation obtains is translated is obtained;
For example, API can be translated by calling, above-mentioned m Chinese paragraph is translated as English form, it is each after translation The tag number of paragraph is still identical as the Chinese numbered paragraphs before translation;
Step A202: i-th of paragraph that more one text segmentation obtains is obtained with another described text segmentation The English text translated of j-th of paragraph in word quantity;
Step A203: it calculates
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then NvValue be 1, be otherwise 0;
It should be noted that if step A202 comparison result is identical for the word quantity of the two, then it can be by any one As a fairly large number of one of word, negligible amounts one of of the another one as word;
Calculate two paragraphs between text similarity after, if its value is bigger, show between two paragraph have pair A possibility that should being related to, is bigger;
Step A3: text similarity matrix B is established;
Wherein, n is the quantity for the paragraph that one text segmentation obtains, another text segmentation described in m obtains The quantity of paragraph, the element K in the text similarity matrix BijI-th of the paragraph obtained for one text segmentation and institute State the text similarity for j-th of paragraph that another text segmentation obtains, wherein each paragraph that one text segmentation obtains Number (i.e. the value of i) it is corresponding its one text the paragraph position (paragraph of starting position in i.e. one text Number be 1, the number of paragraph thereafter is 2 ..., the number of the paragraph at text end position be n), it is described another text It is (i.e. described another in the paragraph position of another text that the number (i.e. the value of j) for each paragraph that this segmentation obtains corresponds to it The number of the paragraph of starting position is 1 in a text, and the number of paragraph thereafter is 2 ..., the paragraph at text end position Number is m);
Step A4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between paragraph, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B,
It should be noted that the value of above-mentioned m and the value of n are not necessarily identical, i.e., although two texts to be aligned are that content contains Identical two articles of justice, but since the difference of expression way will cause the paragraph in the text of original language in object language The case where being split into multiple paragraphs, i.e. matrix B can have the case where a line is to multiple row, therefore, to further increase improve Matched accuracy needs to be reconsolidated by separated paragraph when paragraph matches one by one, specifically, step A4 It can specifically include:
Step A401: judge whether the value of current maximum element in text similarity matrix B is greater than preset value (i.e. in advance The threshold value of setting, such as the preset value can be 0.5,0.6 or 0.7 etc.), if so, step A402 is executed, if it is not, executing step A405;
Step A402: paragraph that the corresponding one text segmentation of current maximum element is obtained and it is described it is current most The paragraph pairing that another corresponding described text segmentation of big value element obtains, and by the current maximum element be expert at The value of all elements in column is set to end identifier, and (element for being set to end identifier is not participating in subsequent matched Journey, for example, end identifier is 0), to execute step A403 later;
Step A403: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step A401, if it is not, executing step A404;
It should be noted that the column for being set to end identifier refer to that the value of all elements in column is set to end of identification The column of symbol, the row for being set to end identifier refer to that the value of all elements in row is set to the row of end identifier;
Step A404: terminate pairing process;
Step A405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to terminate The adjacent column (whether there is the adjacent column for not being set to end identifier) of identifier, if so, step A406 is executed, if It is no, execute step A407;
Step A406: another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The obtained paragraph of paragraph another described text segmentation corresponding with the current maximum element merge, and by the merging The obtained paragraph pairing of paragraph one text segmentation corresponding with the current maximum element, and will it is described not by The value of all elements where being set to the adjacent column of end identifier, the current maximum element in row and column is set to end mark Know symbol, executes step A408 later;
Step A407: paragraph that the corresponding one text segmentation of the current maximum element is obtained and described another The paragraph pairing that one text segmentation obtains, and the value of all elements where the current maximum element in row and column is set For end identifier, step A408 is executed later;
Step A408: judge the column for not being set to end identifier currently whether are existed simultaneously in text similarity matrix B It is not set to the row of end identifier, if so, step A405 is repeated, if it is not, executing step A404;
Pairing process in the present embodiment includes two stages, wherein first stage includes the steps that above-mentioned A401- A402 finds pairing using preset value in this stage, i.e., finds current maximum element in matrix B, the value of the element first It indicates the text similarity between two paragraphs of corresponding row, column, judges whether the value of the element is greater than preset value, if more than The preset value takes out this coordinate and saves, later by matrix B then it is believed that corresponding row and column is matched two paragraphs In the values of all elements of the rows and columns be set to end identifier, then look for the maximum value in current updated matrix again again Element, and repeat above-mentioned process;
When the value of current maximum element in matrix B is less than or equal to preset value, then start to carry out second stage, including upper The step A405-A408 stated, at this point, the remaining row and column for not being set to end identifier is the section that do not match in matrix It falls, the corresponding paragraph of these row and columns is generally exactly that the corresponding column of row have been separated into multistage, therefore look in the first phase not The paragraph high to matching degree first looks for current maximum element in matrix B in this stage, from the position of the maximum value element It sets (i.e. the left direction of maximum value element column) forward and (i.e. the right direction of maximum value element column) is searched backward, Judge whether left and right adjacent column is to be set to the column of end identifier, and the adjacent column for not being set to end identifier is considered as By separated text, it is merged, and row corresponding with the current maximum element is aligned, later by corresponding row and The value of all elements is set to end identifier in column, then repeatedly aforesaid operations, until do not existed simultaneously in current matrix not by The stopping when column for being set to end identifier and the row for not being set to end identifier;
By the above method, just the paragraph one of two articles of source language and the target language (two texts i.e. to be aligned) One has corresponded to;
For example, the matrix established in step A3 is as shown in figure 3, preset value is 0.6, treatment process is as follows:
First stage:
1, in current matrix maximum value element be the position a1 element, and the value of the element of the position a1 be greater than 0.6, then will The 7th paragraph that the 9th paragraph and one text segmentation that another described text segmentation obtains obtain matches, and later will The value of all elements of 9th column and the 7th row sets 0;
2, later in current matrix maximum value element be the position a2 element, and the value of the element of the position a2 be greater than 0.6, Then the 3rd paragraph that another described text segmentation obtains and the 3rd paragraph that one text segmentation obtains are matched, it The value of the 3rd column and all elements of the 3rd row is set 0 afterwards;
3, it repeats the above steps, sequentially finds the element of the position a3, a4, by then obtain another described text segmentation The 2nd paragraph that 2nd paragraph and one text segmentation obtain matches, then the obtained another described text segmentation The 5th paragraph that 6 paragraphs and one text segmentation obtain matches, and the element in corresponding columns and rows is set 0;
The value of maximum value element is less than or equal to carry out second stage when 0.6 in current matrix:
1, the element of the position b1 is current maximum element, finds the adjacent column for not being set to 0 to the left, to the right, does not find, Then the 10th paragraph that another described text segmentation obtains and the 9th paragraph that one text segmentation obtains are matched, And the element in corresponding columns and rows is set 0;
2, the element of the position b2 is current maximum element later, finds the adjacent column for not being set to 0 to the left, to the right, searches The element column for seeking the position b3 is not set to 0, then the 7th paragraph obtained another described text segmentation and the 8th section Merging is fallen, the 8th paragraph for later obtaining combined paragraph and one text segmentation matches, then by corresponding columns and rows In element set 0;
3, repeat the above steps, the 4th paragraph that another described text segmentation is obtained merge with the 5th paragraph after with The 4th paragraph pairing that one text segmentation obtains, sets 0 for the element in corresponding columns and rows, will another described text The 1st paragraph that the 1st paragraph and one text segmentation that this segmentation obtains obtain matches, will be in corresponding columns and rows Element sets 0, finally only remains the 6th row and is not set to 0;
It should be noted that if generating column or row has the case where not matching (the 6th row in such as Fig. 3), then it represents that bilingual In text there is the content that another language text is not mentioned in one of language text.
Referring to fig. 4, Fig. 4 is the flow chart of another bilingual alignment method provided in an embodiment of the present invention, and this method is used for Realize sentence alignment, this method comprises:
Step B1: two texts to be aligned are split (i.e. progress subordinate sentence) according to sentence level;
For example, two texts to be aligned can be two paragraphs being mutually paired;
For example, can according to punctuation mark to paragraph text carry out subordinate sentence, by taking bilingual Chinese-English alignment as an example, Chinese with "." For cut-point, English is with " " for cut-point;
In addition, in actual application, since the sentence lack of standardization that often results in of style of writing is according to punctuation mark progress cutting The effect is unsatisfactory and it is bilingual between there are the case where semantic series connection, in order to solve this problem, also can be used between sentence The method that canonical is added removes cutting sentence, or the bad sentence of cutting effect is directly merged into one, to avoid matched Mistake;
For example, n sentence is obtained, by another by carrying out subordinate sentence processing to a text in above-mentioned two text Text carries out subordinate sentence processing, obtains m sentence, wherein each sentence can be numbered according to text tandem;
Step B2: it calculates each sentence that a text segmentation obtains in two text and is obtained with another text segmentation Each sentence text similarity;
For example, said one text is English text (text of original language), another text is Chinese text (target language The text of speech), the text similarity K between two sentences can be calculated in the following waysij:
Step B201: the English text that j-th of sentence translation that another described text segmentation obtains obtains is obtained;
It is English form by above-mentioned m Chinese sentence translation for example, API can be translated by calling, it is each after translation The tag number of sentence is still identical as the Chinese sentence number before translation;
Step B202: i-th of sentence that more one text segmentation obtains is obtained with another described text segmentation The obtained English text of j-th of sentence translation in word quantity;
Step B203: it calculates
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then NvValue be 1, be otherwise 0;
It should be noted that if step B202 comparison result is identical for the word quantity of the two, then it can be by any one As a fairly large number of one of word, negligible amounts one of of the another one as word;
Calculate two sentences between text similarity after, if its value is bigger, show between two sentence have pair A possibility that should being related to, is bigger;
Step B3: text similarity matrix B is established;
Wherein, n is the quantity for the sentence that one text segmentation obtains, another text segmentation described in m obtains The quantity of sentence, the element K in the text similarity matrix BijI-th of the sentence obtained for one text segmentation and institute State the text similarity for j-th of sentence that another text segmentation obtains, wherein each sentence that one text segmentation obtains Number (i.e. the value of i) it is corresponding its one text the sentence position (sentence of starting position in i.e. one text Number be 1, the number of sentence thereafter is 2 ..., the number of the sentence at text end position be n), it is described another text It is (i.e. described another in the sentence position of another text that the number (i.e. the value of j) for each sentence that this segmentation obtains corresponds to it The number of the sentence of starting position is 1 in a text, and the number of sentence thereafter is 2 ..., the sentence at text end position Number is m);
Step B4: two text segmentation successively is realized using current maximum element in the text similarity matrix B The obtained pairing between sentence, wherein after determining each pairing, by the pairing of the determination in the text similarity square The value of corresponding column and the element in corresponding row is set to end identifier in battle array B, for example, the end identifier can be 0.
It should be noted that the value of above-mentioned m and the value of n are not necessarily identical, i.e., although two texts to be aligned are that content contains Identical two articles of justice, but since the difference of expression way will cause the sentence in the text of original language in object language The case where being split into multiple sentences, i.e. matrix B can have the case where a line is to multiple row, therefore, to further increase improve Matched accuracy needs to be reconsolidated by separated sentence when sentence matches one by one, specifically, step B4 It can specifically include:
Step B401: judge whether the value of current maximum element in text similarity matrix B is greater than preset value (i.e. in advance The threshold value of setting, such as the preset value can be 0.5,0.6 or 0.7 etc.), if so, step B402 is executed, if it is not, executing step B405;
Step B402: sentence that the corresponding one text segmentation of current maximum element is obtained and it is described it is current most The sentence pairing that another corresponding described text segmentation of big value element obtains, and by the current maximum element be expert at The value of all elements in column is set to end identifier, and (element for being set to end identifier is not participating in subsequent matched Journey, for example, end identifier is 0), to execute step B403 later;
Step B403: judge current whether exist simultaneously is not set to end identifier in the text similarity matrix B Column and be not set to the row of end identifier, if so, repeat step B401, if it is not, executing step B404;
It should be noted that the column for being set to end identifier refer to that the value of all elements in column is set to end of identification The column of symbol, the row for being set to end identifier refer to that the value of all elements in row is set to the row of end identifier;
Step B404: terminate pairing process;
Step B405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to terminate The adjacent column (whether there is the adjacent column for not being set to end identifier) of identifier, if so, step B406 is executed, if It is no, execute step B407;
Step B406: another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The obtained sentence of sentence another described text segmentation corresponding with the current maximum element merge, and by the merging The obtained sentence pairing of sentence one text segmentation corresponding with the current maximum element, and will it is described not by The value of all elements where being set to the adjacent column of end identifier, the current maximum element in row and column is set to end mark Know symbol, executes step B408 later;
Step B407: sentence that the corresponding one text segmentation of the current maximum element is obtained and described another The sentence pairing that one text segmentation obtains, and the value of all elements where the current maximum element in row and column is set For end identifier, step B408 is executed later;
Step B408: judge the column for not being set to end identifier currently whether are existed simultaneously in text similarity matrix B It is not set to the row of end identifier, if so, step B405 is repeated, if it is not, executing step B404.
The embodiment of the invention also provides a kind of bilingual alignment methods, this method comprises:
Step C1: realize that paragraph is aligned, and is obtained several between the first text and the second text using method shown in Fig. 2 To pairing paragraph;
Step C2: obtained every a pair of of the pairing segment of step C1 is fallen, sentence alignment is realized using method shown in Fig. 3.
Bilingual alignment method provided in an embodiment of the present invention, for the style of writing that occurs in text, lack of standardization, position is mismatched It is repeated with semanteme, semantic the problems such as splitting, using first matching paragraph, then matches the two-stage approach of sentence in paragraph, avoid The case where many-one matching and repetition sentence or the dislocation of Wen Neigao similarity sentence when directly carrying out sentence matching match;This Outside, take result to when using maximum probability when one-to-one thought, from the maximum language of text similarity to starting with, by It gradually takes downwards, first excludes confirmable pairing, reprocess uncertain pairing, thereby may be ensured that the accurate of bilingual alignment Rate.
The embodiment of the invention also provides a kind of bilingual alignment device, which includes:
Divide module 1, for being split two texts to be aligned according to same-language organizational level;
Text similarity computing module 2, for calculate a text segmentation obtains in two text each section with The text similarity for each section that another text segmentation obtains;
Matrix module 3, for establishing text similarity matrix B;
Wherein, n is the quantity for the part that one text segmentation obtains, another text segmentation described in m obtains Partial quantity, the element K in the text similarity matrix BijI-th of the part obtained for one text segmentation and institute State the text similarity for j-th of part that another text segmentation obtains;
Matching module 4, for successively realizing two text using current maximum element in the text similarity matrix B The pairing between part that this segmentation obtains, wherein after determining each pairing, by the pairing of the determination in the text phase End identifier is set to like the value of the element in corresponding column and corresponding row in degree matrix B.
In one embodiment, the matching module includes:
First judging unit, for judging whether the value of current maximum element in the text similarity matrix B is greater than Preset value;
First pairing unit, the part for obtaining the corresponding one text segmentation of the current maximum element The portion paired that another described text segmentation corresponding with the current maximum element obtains, and by the current maximum The value of element where element in row and column is set to end identifier;
Whether second judgment unit is not set to tie for judging currently to exist simultaneously in the text similarity matrix B The column of beam identification symbol and the row for not being set to end identifier, if so, the first judging unit of control repeats and judges the text Whether the value of current maximum element is greater than preset value in this similarity matrix B, if it is not, control end unit, which executes, terminates pairing The step of process;
End unit, for terminating pairing process;
Third judging unit, for judge in the text similarity matrix B current maximum element with the presence or absence of not by It is set to the adjacent column of end identifier;
Second pairing unit, for by another corresponding described text of the adjacent column for not being set to end identifier Divide the part that obtained part another described text segmentation corresponding with the current maximum element obtains to merge, and will The portion paired that the combined part one text segmentation corresponding with the current maximum element obtains, and will The value of element where the adjacent column for not being set to end identifier, the current maximum element in row and column is set to knot Beam identification symbol;
Third pairing unit, the part for obtaining the corresponding one text segmentation of the current maximum element The portion paired obtained with another described text segmentation, and by the element in row and column where the current maximum element Value is set to end identifier;
Whether the 4th judging unit is not set to tie for judging currently to exist simultaneously in the text similarity matrix B The column of beam identification symbol and the row for not being set to end identifier, if so, control third judging unit repeats and judges the text Current maximum element whether there is the step of adjacent column for not being set to end identifier in this similarity matrix B, if it is not, control End unit processed executes the step of terminating pairing process.
In one embodiment, one text is English text, and the Text similarity computing module includes:
Acquiring unit, the English text obtained for obtaining j-th of partial translation that another described text segmentation obtains;
Comparing unit, i-th of the part obtained for more one text segmentation and another described text segmentation The quantity of word in the English text that j-th obtained of partial translation obtains;
Computing unit, for calculating
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then NvValue be 1, be otherwise 0.
In one embodiment, the linguistic unit rank is paragraph rank, to realize the paragraph between two text Alignment.
In one embodiment, the linguistic unit rank is sentence level, to realize the sentence between two text Alignment.
The embodiment of the invention also provides a kind of bilingual alignment systems, including above-mentioned for realizing the bilingual right of paragraph alignment Neat device and the above-mentioned bilingual alignment device for realizing sentence alignment;
Section is realized between the first text and the second text by the above-mentioned bilingual alignment device for realizing paragraph alignment Alignment is fallen, several pairs of pairing paragraphs are obtained;
The pairing paragraph described for every a pair realizes sentence by the above-mentioned bilingual alignment device for realizing sentence alignment Alignment.
Those skilled in the art will readily recognize that above-mentioned each preferred embodiment can be free under the premise of not conflicting Ground combination, superposition.
It should be appreciated that above-mentioned embodiment is merely exemplary, and not restrictive, without departing from of the invention basic In the case where principle, those skilled in the art can be directed to the various apparent or equivalent modification or replace that above-mentioned details is made It changes, is all included in scope of the presently claimed invention.

Claims (8)

1. a kind of bilingual alignment method characterized by comprising
Step S1: two texts to be aligned are split according to same-language organizational level;
Step S2: each section that a text segmentation obtains in calculating two text obtains every with another text segmentation The text similarity of a part;
Step S3: text similarity matrix B is established;
Wherein, n is the quantity for the part that one text segmentation obtains, and m is the part that another described text segmentation obtains Quantity, the element K in the text similarity matrix BijI-th of part obtaining for one text segmentation and described another The text similarity for j-th of part that one text segmentation obtains;
Step S4: successively realize that two text segmentation obtains using current maximum element in the text similarity matrix B Part between pairing, wherein after determining each pairing, by the pairing of the determination in the text similarity matrix B In the value of element in corresponding column and corresponding row be set to end identifier;
Wherein, the step S4 includes:
Step S401: judging whether the value of current maximum element in the text similarity matrix B is greater than preset value, if so, Step S402 is executed, if it is not, executing step S405;
Step S402: the part that the corresponding one text segmentation of the current maximum element is obtained with it is described it is current most The portion paired that another corresponding described text segmentation of big value element obtains, and by the current maximum element be expert at The value of element in column is set to end identifier, executes step S403 later;
Step S403: judge the column for not being set to end identifier currently whether are existed simultaneously in the text similarity matrix B It is not set to the row of end identifier, if so, step S401 is repeated, if it is not, executing step S404;
Step S404: terminate pairing process;
Step S405: judge in the text similarity matrix B current maximum element with the presence or absence of not being set to end of identification The adjacent column of symbol, if so, step S406 is executed, if it is not, executing step S407;
Step S406: the portion that another corresponding described text segmentation of the adjacent column for not being set to end identifier is obtained The part merging that another point corresponding with the current maximum element described text segmentation obtains, and by the combined portion The portion paired for dividing one text segmentation corresponding with the current maximum element to obtain, and be not set to described The value of element where the adjacent column of end identifier, the current maximum element in row and column is set to end identifier, it Step S408 is executed afterwards;
Step S407: part that the corresponding one text segmentation of the current maximum element is obtained and it is described another The portion paired that text segmentation obtains, and the value of the element where the current maximum element in row and column is set to end mark Know symbol, executes step S408 later;
Step S408: judge the column for not being set to end identifier currently whether are existed simultaneously in the text similarity matrix B It is not set to the row of end identifier, if so, step S405 is repeated, if it is not, executing step S404.
2. the method according to claim 1, wherein one text is English text, in the step S2 In, text similarity K is calculated in the following waysij:
Obtain the English text that j-th of partial translation that another described text segmentation obtains obtains;
It turns over j-th of part that i-th of part that more one text segmentation obtains is obtained with another described text segmentation The quantity of word in the English text translated;
It calculates
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison result The value of v-th of word in a fairly large number of one of middle word, if in the comparison result negligible amounts of word one Person includes the word of root identical as v-th of word, then NvValue be 1, be otherwise 0.
3. method according to claim 1 to 2, which is characterized in that the linguistic unit rank is paragraph rank, from And it realizes the paragraph between two text and is aligned.
4. method according to claim 1 to 2, which is characterized in that the linguistic unit rank is sentence level, from And realize the sentence alignment between two text.
5. a kind of bilingual alignment method characterized by comprising
It realizes that paragraph is aligned between the first text and the second text using method as claimed in claim 3, obtains several to matching To paragraph;
The pairing paragraph described for every a pair realizes sentence alignment using method as claimed in claim 4.
6. a kind of bilingual alignment device characterized by comprising
Divide module, for being split two texts to be aligned according to same-language organizational level;
Text similarity computing module, for calculating each section and another that a text segmentation obtains in two text The text similarity for each section that text segmentation obtains;
Matrix module, for establishing text similarity matrix B;
Wherein, n is the quantity for the part that one text segmentation obtains, and m is the part that another described text segmentation obtains Quantity, the element K in the text similarity matrix BijI-th of part obtaining for one text segmentation and described another The text similarity for j-th of part that one text segmentation obtains;
Matching module, for successively realizing two text point using current maximum element in the text similarity matrix B The pairing between part cut, wherein after determining each pairing, by the pairing of the determination in the text similarity The value of corresponding column and the element in corresponding row is set to end identifier in matrix B;
Wherein, the matching module includes:
First judging unit, for judging it is default whether the value of current maximum element in the text similarity matrix B is greater than Value;
First pairing unit, part and institute for obtaining the corresponding one text segmentation of the current maximum element State the portion paired that another corresponding described text segmentation of current maximum element obtains, and by the current maximum element The value of element in the row and column of place is set to end identifier;
Whether second judgment unit is not set to terminate mark for judging currently to exist simultaneously in the text similarity matrix B The column for knowing symbol and the row for not being set to end identifier, if so, the first judging unit of control repeats and judges the text phase Whether it is greater than preset value like the value of current maximum element in degree matrix B, if it is not, control end unit, which executes, terminates pairing process The step of;
End unit, for terminating pairing process;
Third judging unit, for judging in the text similarity matrix B current maximum element with the presence or absence of not being set to The adjacent column of end identifier;
Second pairing unit, for by another corresponding described text segmentation of the adjacent column for not being set to end identifier The part that obtained part another described text segmentation corresponding with the current maximum element obtains merges, and will be described The portion paired that combined part one text segmentation corresponding with the current maximum element obtains, and will be described The value of element where not being set to the adjacent column of end identifier, the current maximum element in row and column is set to end mark Know symbol;
Third pairing unit, part and institute for obtaining the corresponding one text segmentation of the current maximum element The portion paired that another text segmentation obtains is stated, and the value of the element where the current maximum element in row and column is set For end identifier;
Whether the 4th judging unit is not set to terminate mark for judging currently to exist simultaneously in the text similarity matrix B The column for knowing symbol and the row for not being set to end identifier, if so, control third judging unit repeats and judges the text phase The step of whether there is the adjacent column for not being set to end identifier like current maximum element in degree matrix B, if it is not, control knot Shu Danyuan executes the step of terminating pairing process.
7. device according to claim 6, which is characterized in that one text is English text, and the text is similar Spending computing module includes:
Acquiring unit, the English text obtained for obtaining j-th of partial translation that another described text segmentation obtains;
Comparing unit, i-th of the part obtained for more one text segmentation are obtained with another described text segmentation The obtained English text of j-th of partial translation in word quantity;
Computing unit, for calculating
Wherein, L is the word quantity of a fairly large number of one of word in the comparison result, NvFor the comparison result The value of v-th of word in a fairly large number of one of middle word, if in the comparison result negligible amounts of word one Person includes the word of root identical as v-th of word, then NvValue be 1, be otherwise 0.
8. a kind of bilingual alignment system, which is characterized in that including realizing the device of claim 3 the method and realizing right It is required that the device of 4 the methods;
By realizing that the device of claim 3 the method realizes that paragraph is aligned, and is obtained between the first text and the second text Several pairs of pairing paragraphs;
The pairing paragraph described for every a pair, by realizing that the device of claim 4 the method realizes sentence alignment.
CN201811567535.1A 2018-12-20 2018-12-20 Bilingual alignment method, apparatus and system Active CN109710950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811567535.1A CN109710950B (en) 2018-12-20 2018-12-20 Bilingual alignment method, apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811567535.1A CN109710950B (en) 2018-12-20 2018-12-20 Bilingual alignment method, apparatus and system

Publications (2)

Publication Number Publication Date
CN109710950A CN109710950A (en) 2019-05-03
CN109710950B true CN109710950B (en) 2019-10-18

Family

ID=66257105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811567535.1A Active CN109710950B (en) 2018-12-20 2018-12-20 Bilingual alignment method, apparatus and system

Country Status (1)

Country Link
CN (1) CN109710950B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191473B (en) * 2019-12-31 2024-05-03 深圳市优必选科技股份有限公司 Method and device for acquiring translation text file
CN112580299A (en) * 2020-12-30 2021-03-30 讯飞智元信息科技有限公司 Intelligent bid evaluation method, bid evaluation device and computer storage medium
CN112906371B (en) * 2021-02-08 2024-03-01 北京有竹居网络技术有限公司 Parallel corpus acquisition method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8612205B2 (en) * 2010-06-14 2013-12-17 Xerox Corporation Word alignment method and system for improved vocabulary coverage in statistical machine translation
CN106844332A (en) * 2016-12-16 2017-06-13 中国科学院自动化研究所 The alignment schemes and alignment of the real-time bilingual word-alignment of growth formula based on anchor point
CN106814950A (en) * 2016-12-25 2017-06-09 语联网(武汉)信息技术有限公司 A kind of method and system that original text and translation are adjusted alignment

Also Published As

Publication number Publication date
CN109710950A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
CN109710950B (en) Bilingual alignment method, apparatus and system
US10949709B2 (en) Method for determining sentence similarity
CN105068997B (en) The construction method and device of parallel corpora
CN104252484B (en) A kind of phonetic error correction method and system
US20150186361A1 (en) Method and apparatus for improving a bilingual corpus, machine translation method and apparatus
US20090043741A1 (en) Autocompletion and Automatic Input Method Correction for Partially Entered Search Query
CN105593845B (en) Generating means and its method based on the arrangement corpus for learning by oneself arrangement, destructive expression morpheme analysis device and its morpheme analysis method using arrangement corpus
US8510099B2 (en) Method and system of selecting word sequence for text written in language without word boundary markers
US9946704B2 (en) Tone mark based text suggestions for chinese or japanese characters or words
CN115081440B (en) Method, device and equipment for recognizing variant words in text and extracting original sensitive words
CN106547743B (en) Translation method and system
GB2575580A (en) Supporting interactive text mining process with natural language dialog
WO2017012327A1 (en) Syntax analysis method and device
KR20170004983A (en) Line segmentation method
CN105095196A (en) Method and device for finding new word in text
CN103559181A (en) Establishment method and system for bilingual semantic relation classification model
CN111984845B (en) Website wrongly written word recognition method and system
CN108197315A (en) A kind of method and apparatus for establishing participle index database
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
Noaman et al. Automatic Arabic spelling errors detection and correction based on confusion matrix-noisy channel hybrid system
RU2019141908A (en) IDENTIFICATION OF RELATED WORD BLOCKS IN DOCUMENTS OF A COMPLEX STRUCTURE
CN104317882A (en) Decision-based Chinese word segmentation and fusion method
List Investigating the impact of sample size on cognate detection
CN104933030B (en) A kind of Uighur spell checking methods and device
CN107229611B (en) Word alignment-based historical book classical word segmentation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 519031 office 1316, No. 1, lianao Road, Hengqin new area, Zhuhai, Guangdong

Patentee after: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

Address before: 519031 room 417, building 20, creative Valley, Hengqin New District, Zhuhai City, Guangdong Province

Patentee before: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder