CN109670178A

CN109670178A - Sentence-level bilingual alignment method and device, computer readable storage medium

Info

Publication number: CN109670178A
Application number: CN201811562126.2A
Authority: CN
Inventors: 聂镭; 李睿; 聂颖; 郑权; 张峰
Original assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Current assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2019-04-23
Anticipated expiration: 2038-12-20
Also published as: CN109670178B

Abstract

The invention discloses a kind of Sentence-level bilingual alignment method and devices, computer readable storage medium, this method comprises: step S1: obtaining Z trained convolution kernels, wherein Z is the integer more than or equal to 1；Step S2: punctuate processing is carried out to two texts to be aligned respectively, and establish the text similarity matrix U of described two texts to be aligned: step S3: convolution being carried out to the text similarity matrix U using each of the Z trained convolution kernels convolution kernel, obtains Z optimization text similarity matrix；Step S4: optimize text similarity matrix using described Z and obtain the sentence alignment result of described two texts to be aligned.The present invention is conducive to the efficiency of sentence alignment between raising text.

Description

Sentence-level bilingual alignment method and device, computer readable storage medium

Technical field

The present invention relates to natural language processing technique field, especially a kind of Sentence-level bilingual alignment method and device, meter Calculation machine readable storage medium storing program for executing.

Background technique

Parallel Corpus is more important data for the translation algorithm based on natural language processing, parallel/right Answering corpus is by source text and its parallel corresponding bilingual/multi-lingual corpus translating Chinese language and originally constituting, and degree of registration can It is several to be divided into word grade, sentence grade, section grade and piece grade, wherein the parallel corpora of sentence grade is therefore most common corpus usually can The parallel corpora of section grade, piece grade will be converted to the parallel corpora for the grade that forms a complete sentence, but in corpus, original text and translation might not It is one-to-one, for example, being likely to result in 15 Chinese sentence pairs due to the difference that text structure and author write habit and answering 22 English sentences, it is also possible to will cause 16 Chinese sentence pairs and answer 50 English sentences, so needing to consider complicated multiplicity Sentence match situation, presently mainly the fractionation of the corpus of paragraph and chapter is combined into using manual type one-to-one Sentence, it will take a lot of manpower and time for this mode, to be unfavorable for the raising of sentence alignment efficiency.

Summary of the invention

In view of this, one of the objects of the present invention is to provide a kind of Sentence-level bilingual alignment method and devices, computer Readable storage medium storing program for executing is conducive to the raising of sentence alignment efficiency.

In order to achieve the above objectives, technical solution of the present invention provides a kind of Sentence-level bilingual alignment method, comprising:

Step S1: Z trained convolution kernels are obtained, wherein Z is the integer more than or equal to 1, is trained described in each Convolution kernel obtained by step S11- step S15；

Step S11: punctuate processing is carried out with text to two training respectively, and establishes the text of described two trained texts This similarity matrix B:

Wherein, n is the sentence that described two training are handled with text by punctuate with a training in text Quantity, m are the quantity for the sentence that described two training are handled with text by punctuate with another training in text, text Element K in this similarity matrix B_ijI-th of the sentence handled by punctuate with text for one training with it is described The text similarity for j-th of sentence that another training is handled with text by punctuate；

Step S12: initialization convolution kernel；

Step S13: it is rolled up using text similarity matrix B of the current convolution kernel to described two training text Product, obtains matrix P, and calculate penalty values loss, if penalty values loss meets preset requirement, thens follow the steps S14, otherwise, hold Row step S16；

Wherein, if i-th of sentence and described another that one training is handled with text by punctuate are trained It is matched with text by j-th of sentence that punctuate is handled, then L_ijIt is 1, is otherwise 0；

Step S14: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default want It asks, if so, step S15 is executed, if it is not, executing step S16；

Step S15: using current convolution kernel as trained convolution kernel；

Step S16: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reaches To preset times, if so, step S15 is executed, if it is not, repeating step S13；

Step S2: carrying out punctuate processing to two texts to be aligned respectively, and establishes the text of described two texts to be aligned This similarity matrix U:

Wherein, a is the sentence that a text to be aligned in described two texts to be aligned is handled by punctuate Quantity, b are the quantity for the sentence that another text to be aligned in described two texts to be aligned is handled by punctuate, text Element K in this similarity matrix U_ijI-th of the sentence handled for one text to be aligned by punctuate with it is described The text similarity for j-th of sentence that another text to be aligned is handled by punctuate；

Step S3: each of the Z trained convolution kernels convolution kernel is respectively adopted to the text similarity Matrix U carries out convolution, obtains Z optimization text similarity matrix；

Step S4: optimize text similarity matrix using described Z and obtain the sentence alignment of described two texts to be aligned As a result.

Further, Z is integer more than or equal to 2, and the size and weighted of different trained convolution kernels.

Further, the step S4 includes:

Step S41: text matches degree matrix T is calculated according to the Z optimization text similarity matrix, wherein the text Element Y in this matching degree matrix T_ijI-th of the sentence handled for one text to be aligned by punctuate with it is described The text matches degree for j-th of sentence that another text to be aligned is handled by punctuate, and the text matches degree matrix T Each of element value be described Z optimize text similarity matrix in same position element average value；

Step S42: each row element in the text matches degree matrix T is successively traversed, is chosen from each row element It is worth maximum element, and corresponding two sentences of the element of the selection is matched.

Further, after the step S42 further include:

Step S43: judge that another described text to be aligned passes through in the b sentence that punctuate is handled with the presence or absence of not The sentence of pairing, if so, lookup and its maximum sentence of text matches degree in the text matches degree matrix T, and will be described The sentence found is matched with it.

Further, after the step S4 further include:

Step S5: the b sentence handled according to another described text to be aligned by punctuate it is described another The a sentence that sequence of positions, one text to be aligned in text to be aligned are handled by punctuate is one Sequence of positions in text to be aligned detects sentence alignment result.

Further, the step S5 includes:

Step S51: according to sequence of positions of the b sentence in another described text to be aligned and the sentence Alignment result is ranked up a sentence；

Step S52: if there are two sentences in a sentence, described two sentences pass through the position sorted and obtained Sequence is set with sequence of positions of described two sentences in one text to be aligned on the contrary, then there are mistakes for judgement.

Further, include an English text in described two trained texts and described two texts to be aligned with An and non English language text, wherein calculate in the following ways each sentence that English text is handled by punctuate with it is non- The text similarity K for each sentence that English text is handled by punctuate:

Non English language text is translated by the sentence that punctuate is handled, obtains corresponding English text；

To two sentences of text similarity to be calculated, compare sentence that wherein English text is handled by punctuate with Pass through the quantity of word in the English text that the statement translation that punctuate is handled obtains by non English language text；

It calculates

Wherein, E is the word quantity of a fairly large number of one of word in the comparison result, N_vFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then N_vValue be 1, be otherwise 0.

To achieve the above object, technical solution of the present invention additionally provides a kind of Sentence-level bilingual alignment device, comprising:

Module is obtained, for obtaining Z trained convolution kernels, wherein Z is the integer more than or equal to 1, described in each Trained convolution kernel is obtained by step S11- step S15；

Step S12: initialization convolution kernel；

Step S15: using current convolution kernel as trained convolution kernel；

First processing module for carrying out punctuate processing to two texts to be aligned respectively, and is established described two to right The text similarity matrix U of neat text:

Second processing module, for each of the Z trained convolution kernels convolution kernel to be respectively adopted to described Text similarity matrix U carries out convolution, obtains Z optimization text similarity matrix；

Third processing module, for obtaining described two texts to be aligned using the Z optimization text similarity matrix Sentence be aligned result.

To achieve the above object, technical solution of the present invention additionally provides a kind of Sentence-level bilingual alignment device, including place Reason device and the memory that couple with the processor, wherein the processor is for executing the instruction in memory, in realization State Sentence-level bilingual alignment method.

To achieve the above object, technical solution of the present invention additionally provides a kind of computer readable storage medium, the meter Calculation machine readable storage medium storing program for executing is stored with computer program, and the computer program realizes that above-mentioned Sentence-level is double when being executed by processor The step of language alignment schemes.

Sentence-level bilingual alignment method provided by the invention, by using trained convolution kernel to two texts to be aligned Text similarity matrix carry out convolution, and sentence alignment are carried out to two texts to be aligned according to the result of convolution, not only may be used It to reduce artificial participation, realizes sentence automatic aligning, the accuracy rate of alignment can also be improved, be conducive to sentence pair between raising text Neat efficiency.

Detailed description of the invention

By referring to the drawings to the description of the embodiment of the present invention, the above and other purposes of the present invention, feature and Advantage will be apparent from, in the accompanying drawings:

Fig. 1 is a kind of flow chart of Sentence-level bilingual alignment method provided in an embodiment of the present invention；

Fig. 2 is a kind of schematic diagram of training convolutional core provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of text similarity matrix provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of objective matrix provided in an embodiment of the present invention；

Fig. 5 is a kind of schematic diagram for calculating text matches degree matrix provided in an embodiment of the present invention.

Specific embodiment

Below based on embodiment, present invention is described, but the present invention is not restricted to these embodiments.Under Text is detailed to describe some specific detail sections in datail description of the invention, in order to avoid obscuring essence of the invention, There is no narrations in detail for well known method, process, process, element.

In addition, it should be understood by one skilled in the art that provided herein attached drawing be provided to explanation purpose, and What attached drawing was not necessarily drawn to scale.

Unless the context clearly requires otherwise, "include", "comprise" otherwise throughout the specification and claims etc. are similar Word should be construed as the meaning for including rather than exclusive or exhaustive meaning；That is, be " including but not limited to " contains Justice.

In the description of the present invention, it is to be understood that, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " It is two or more.

It is a kind of flow chart of Sentence-level bilingual alignment method provided in an embodiment of the present invention, this method referring to Fig. 1, Fig. 1 Include:

Step S12: initialization convolution kernel；

Step S15: using current convolution kernel as trained convolution kernel；

Sentence-level bilingual alignment method provided in an embodiment of the present invention, by using trained convolution kernel to two to right The text similarity matrix of neat text carries out convolution, and carries out sentence alignment to two texts to be aligned according to the result of convolution, It can not only reduce artificial participation, realize sentence automatic aligning, the accuracy rate of alignment can also be improved, be conducive to improve between text The efficiency of sentence alignment.

The trained convolution kernel of each of embodiment of the present invention can be obtained by convolutional neural networks training, such as Fig. 2 It is shown, by using sentence be aligned result known to two training use the text similarity matrix B of text as the input of training set, And objective matrix is inputted, objective matrix (i.e. model answer) is used for compared with the matrix that neural network returns, so that nerve net The output of network is infinitely close to objective matrix, to obtain required convolution kernel, detailed process is as follows:

Step A1: obtaining two trained texts from training set, for example, one of training is English text with text (original text), another training are Chinese text (translation) with text, and the sentence of two trained texts is aligned known to result；

Step A2: punctuate processing is carried out with text to two training respectively；

Punctuate processing is carried out for dividing the marking symbols of sentence for example, can use in text, with bilingual Chinese-English right For neat, Chinese with ".","！" it is ending, English is ending with " ", is made pauses in reading unpunctuated ancient writings if there are above-mentioned marking symbols, is broken Two lists are obtained after sentence, respectively one English (original text) sentence list and one including n English sentence includes m Chinese Chinese (translation) sentence list of sentence, each of English sentence list sentence is independent a word in original text, middle sentence Each of list sentence is independent a word in translation,, can be with for each sentence list in addition, for convenient for processing Each sentence therein is numbered according to text tandem (i.e. the sequence of positions of sentence in the text), as sentence rope Draw, for example, the number of the sentence of beginning location is 1 in English text ... in English sentence list, the sentence of end position Number be n, in Chinese sentence list, the number of the sentence of beginning location is 1 in Chinese text ..., the language of end position The number of sentence is m；

Step A3: establishing the text similarity matrix B of two trained texts, i.e., for the m word in Chinese list, All with each progress similarity system design of n word in English list, detailed process is as follows:

Firstly, using translation tool by the identical language of translator of Chinese Cheng Yuyuan (English) text, i.e., in Chinese sentence list Each sentence translated, obtain the wherein corresponding English text of each sentence；

To two sentences (a Chinese sentence and an English sentence) of text similarity to be calculated, compare wherein english statement The quantity of word in the English text obtained with Chinese statement translation；

It calculates later

Wherein, E is the word quantity of a fairly large number of one of word in the comparison result, N_vFor the comparison As a result in a fairly large number of one of middle word v-th of word value, if in the comparison result word negligible amounts One of include root identical as v-th of word word, then N_vValue be 1, be otherwise 0；

It should be noted that if comparison result is identical for the word quantity of the two, then it can be using any one as word A fairly large number of one, negligible amounts one of of the another one as word；

I.e. by taking root to exactly match the word in sentence, and the text between two sentences is calculated using above-mentioned formula Similarity, if root is identical, coupling number adds 1. matched sums as molecule, the length of the sentence (number of word i.e. in sentence Amount) it is used as denominator to take the word quantity of longer sentence as denominator if length is inconsistent；

By the above-mentioned means, available m*n text similarity, is indicated, i.e., using the matrix that a size is m*n As text similarity matrix B；

Wherein, the element K in text similarity matrix B_ijIt (is numbered for i-th of sentence in above-mentioned English sentence list For the sentence of i) text similarity with j-th of sentence (i.e. number be j sentence) in above-mentioned Chinese sentence list；

For example, obtaining its text similarity matrix as shown in figure 3, can see after being handled with text two training Out the element aggregation of matrix intermediate value larger (i.e. text similarity is higher) since the upper left corner to the diagonal line that the lower right corner terminates Position, this is because China and British text sentence sequencing having the same；

Step A4: initialization convolution kernel, and the convolution kernel that initialization is obtained executes step A5 as current convolution kernel；

Step A5: result is aligned according to the sentence of above-mentioned two training text and establishes objective matrix J；

Wherein, the element L in objective matrix J_ijI-th of sentence and above-mentioned Chinese in corresponding above-mentioned English sentence list J-th of sentence in sentence list, and the value of element is determined by known sentence alignment result, if i-th in English sentence list J-th of sentence in a sentence and Chinese sentence list matches, L_ijValue be 1, be otherwise 0；

For example, as shown in Figure 4 according to the objective matrix J that above-mentioned two training is established with text；

Step A6: carrying out convolution using text similarity matrix B of the current convolution kernel to described two trained texts, Matrix P is obtained, and calculates penalty values loss using the objective matrix J established, if penalty values loss meets preset requirement (such as less than One threshold value), A7 is thened follow the steps, otherwise, executes step A9；

Step A7: verifying current convolution kernel using verifying collection, judges whether the result of verifying meets default want It asks, if so, step A8 is executed, if it is not, executing step A9；

Wherein, which includes several verifying texts pair, each verifying text is to including an English text This (original text) and a Chinese text (translation)；

Wherein, verification process is substantially similar to training process, and details are not described herein again, when the damage for verifying collection in the result of verifying Mistake value loss is less than a certain threshold value, and when the accuracy rate for verifying collection is greater than a certain threshold value, it is default to determine that the result of verifying meets It is required that；

Step A8: using current convolution kernel as trained convolution kernel；

Step A9: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reaches To preset times, if so, step A8 is executed, if it is not, repeating step A6.

Preferably, in one embodiment, Z is integer more than or equal to 2, and the size of different trained convolution kernels and Weighted, for example, the value of Z can be 3,5 or 6；

To obtain multiple trained convolution kernels, multiple convolution kernels (different volumes that initialization obtains can be initialized respectively The size and weight of product core are different), later using each convolution kernel respectively to the text of above-mentioned two trained text Similarity matrix B carries out convolution algorithm, operation the result is that multiple changed matrixes of numerical value, later by obtain each Matrix obtains the penalty values loss of different convolutional neural networks compared with objective matrix, wherein the more big then table of penalty values loss Show that neural network effect is more bad, need parameter adjustment bigger, penalty values loss is smaller, indicates that neural network effect is better, needs Want parameter adjustment smaller, therefore can be according to respectively different penalty values loss, reverse transfer is to corresponding convolution mind Through network, each convolutional neural networks reversely successively adjusts network parameter according to respective penalty values loss, i.e. adjustment convolution The weight of core, the weighted value that each backpropagation of each convolution kernel is adjusted is not identical, until penalty values loss reaches pre- Phase requires.

It should be noted that memory can be stored it in after obtaining trained convolution kernel through the above way In, when need to use, it can read and obtain directly from memory.

For example, in one embodiment, in two texts to be aligned, one of them text to be aligned is English text (original text), another text to be aligned are Chinese text (translation), wherein establish the text similarity of two texts to be aligned The method (i.e. above-mentioned steps A1, A2, A3) that matrix U and the text similarity matrix B for establishing above-mentioned two training text are adopted Identical, details are not described herein again；

In above-mentioned steps S3, by by the text similarity matrix U of two texts to be aligned and trained convolution kernel Convolution is carried out, realizes that the optimization to text similarity matrix U is corrected, obtains optimization text similarity matrix；

For example, in one embodiment, above-mentioned steps S4 includes:

After the contraposition of obtained Z optimization text similarity matrix is added, the element of each position is averaging, is obtained To text matches degree matrix T；

It should be noted that if the value of Z is 1, it can be directly using optimization text similarity matrix as text matches degree square Battle array；

For example, with reference to Fig. 5, the text similarity matrix U of two texts to be aligned and 3 trained convolution kernels are carried out Convolution obtains 3 optimization text similarity matrixes, text matches degree matrix is calculated later；

Step S42: each row element in the text matches degree matrix T is successively traversed, is chosen from each row element It is worth maximum element, and corresponding two sentences of the element of the selection is matched；

For example, for each row element, therefrom selective value is maximum for text matches degree matrix obtained in Fig. 5 Element matches corresponding two sentences of the element of selection, obtain three pairing as a result, i.e. the 1st row (i.e. said one waits for Aligning texts pass through the 1st sentence that punctuate is handled) (another i.e. above-mentioned text to be aligned is by punctuate with the 1st column Manage the 1st obtained sentence) pairing, the 2nd row (i.e. said one text to be aligned passes through the 2nd sentence that punctuate is handled) With the 3rd column (another i.e. above-mentioned text to be aligned passes through the 3rd sentence that punctuate is handled) pairing, the 3rd row (i.e. above-mentioned one A text to be aligned passes through the 3rd sentence that punctuate is handled) (another i.e. above-mentioned text to be aligned is by disconnected with the 3rd column The 3rd sentence that sentence processing obtains) pairing:

Wherein, in this step, if in a line, there are multiple maximum elements of value are (i.e. same in text matches degree matrix T The value of multiple elements is maximum value in a line), then determine the value with the maximum element of a line intermediate value first, and as Current lookup value is searched in above-mentioned Z optimization text similarity matrix and above-mentioned multiple maximum same positions of element of value later Element, and determine the most position of current lookup value number wherein occur, and by determining corresponding two sentences in position into Row pairing, for example, the element in the first row is [0.7,0.7,0.3], wherein first for the text matches degree matrix in Fig. 5 The value of the element of the second column position of element and the first row of the first column position of row is maximum value 0.7, then searches 3 optimization texts The element of the second column position of the element of the first column position of the first row and the first row in this similarity matrix, due to 3 optimization texts The first row element in this similarity matrix is respectively [0.7,0.6,0.3], [0.7,0.6,0.2] [0.7,0.9,0.4], can be with See the element of the first column position of the first row occur 0.7 number it is most, therefore by the 1st row (i.e. said one text to be aligned The 1st sentence handled by punctuate) (another i.e. above-mentioned text to be aligned is handled by punctuate with the 1st column 1st sentence) pairing, in addition, if in text matches degree matrix T in a line there are the maximum elements of multiple values, can also be from An element is randomly choosed in multiple maximum element of value is used as the maximum element of value；

S42 can match each of said one text to be aligned sentence through the above steps, but may It is unpaired in the presence of the sentence in another one or more above-mentioned text to be aligned, it is preferable that in step S4, the step After rapid S42 further include:

Step S43: judge that another described text to be aligned passes through in the b sentence that punctuate is handled with the presence or absence of not The sentence of pairing, if so, lookup and its maximum sentence of text matches degree in the text matches degree matrix T, and will be described The sentence found is matched with it, is realized to the column leakage detection in matrix；

For example, after being matched by step S42, there are still the 2nd for text matches degree matrix obtained in Fig. 5 Arranging not matching row, (i.e. another text to be aligned is unpaired language by the 2nd sentence that punctuate is handled Sentence), then wherein maximum value element is searched in the 2nd column in text matches degree matrix T, obtained result is that the 1st row the 2nd arranges position The element set, thus by the 1st row (i.e. said one text to be aligned passes through the 1st sentence that punctuate is handled) and the 2nd column (another i.e. above-mentioned text to be aligned passes through the 2nd sentence that punctuate is handled) matches, through the above steps S42-S43, The pairing result that text matches degree matrix in Fig. 5 obtains are as follows: the 1st row and the 1st column pairing, the 1st row and the 2nd column pairing, the 2nd row It is matched with the 3rd column pairing, the 3rd row and the 3rd column；

Preferably, in one embodiment, after the step S4 further include:

Step S5: the b sentence handled according to another described text to be aligned by punctuate it is described another The a sentence that sequence of positions, one text to be aligned in text to be aligned are handled by punctuate is one Sequence of positions in text to be aligned detects sentence alignment result；

For example, the step S5 can be specifically included:

Step S52: if there are two sentences in a sentence, described two sentences pass through the position sorted and obtained Sequence is set with sequence of positions of described two sentences in one text to be aligned on the contrary, then there are mistakes for judgement, is needed Illustrate, sequence of positions herein refers on the contrary: for two sentences in one text to be aligned, if passed through The sequence of positions that sequence in step S51 obtains be one of sentence be located at before another sentence, but it is one to Said one sentence is located at behind another above-mentioned sentence in aligning texts, it is determined that sequence of positions is opposite.

For example, said one text to be aligned is English text, another text to be aligned is Chinese text, in this After two texts to be aligned of English carry out sentence alignment, it will usually obtain shaped like [in 20, English 25] such matching pair, for into one Step ground improves the accuracy of pairing, can detect to matched result, specifically, will obtain matching to according to Chinese first The number (i.e. sequence of positions of all Chinese sentences made pauses in reading unpunctuated ancient writings of Chinese text in Chinese text) of sentence carry out from it is small to Big sequence is ranked up all English sentences that English text is made pauses in reading unpunctuated ancient writings to realize, then according to the result of the sequence Detect the number (i.e. sequence of positions of all English sentences made pauses in reading unpunctuated ancient writings of English text in English text) of english sentence Variation, judges whether it is the variation of monotonic increase, wherein monotonic increase is are as follows: inside a collating sequence, if in rear position The number set is greater than the number in front position, then this sequence is monotonic increase, if not meeting the variation of monotonic increase, can incite somebody to action The matching of monotonic increase is not met to being marked, to carry out error prompting to user.

Sentence-level bilingual alignment method provided in an embodiment of the present invention, it is contemplated that since complexity is more in sentence alignment procedure The difference that the text structure of sample and author write habit causes complicated and diversified sentence pairing situation, by using multiple training Good convolution kernel carries out convolution to the text similarity matrix of two texts to be aligned, realizes to the excellent of text similarity matrix Change amendment, the matrix after making optimization considers the time sequencing (namely sequence of positions) that sentence occurs in the text, not only avoids The interference that identical sentence generates when matching to sentence, and also avoid doing caused by complicated and diversified sentence pairing situation It disturbs, ensure that the matched accuracy rate of sentence, substantially increase the robustness of algorithm.

The embodiment of the invention also provides a kind of Sentence-level bilingual alignment devices, comprising:

Step S12: initialization convolution kernel；

Step S15: using current convolution kernel as trained convolution kernel；

Wherein, in one embodiment, Z is integer more than or equal to 2, and the size and power of different trained convolution kernels Weight is different.

Wherein, in one embodiment, the third processing module includes:

Computing unit, for calculating text matches degree matrix T according to the Z optimization text similarity matrix, wherein institute State the element Y in text matches degree matrix T_ijI-th of the sentence handled for one text to be aligned by punctuate with The text matches degree for j-th of sentence that another described text to be aligned is handled by punctuate, and the text matches degree The value of each of matrix T element is the average value of same position element in described Z optimization text similarity matrix；

First pairing unit, for successively traversing each row element in the text matches degree matrix T, from every a line member The maximum element of selected value in element, and corresponding two sentences of the element of the selection are matched.

Wherein, in one embodiment, the third processing module further include:

Second pairing unit, for judging that another described text to be aligned passes through in the b sentence that punctuate is handled With the presence or absence of unpaired sentence, if so, being searched and its maximum language of text matches degree in the text matches degree matrix T Sentence, and the sentence found is matched with it.

Wherein, in one embodiment, the Sentence-level bilingual alignment device further include:

As a result detection module, for being existed according to another described text to be aligned by the b sentence that punctuate is handled Sequence of positions, one text to be aligned in another described text to be aligned pass through a sentence that punctuate is handled Sequence of positions in one text to be aligned detects sentence alignment result.

Wherein, in one embodiment, the result detection module includes:

Sequencing unit, for according to sequence of positions of the b sentence in another described text to be aligned and institute The neat result of predicate sentence pair is ranked up a sentence；

Detection unit, if for, there are two sentences, described two sentences to be obtained by the sequence in a sentence Sequence of positions in one text to be aligned of sequence of positions and described two sentences on the contrary, then there are mistakes for judgement.

It wherein, in one embodiment, include one in described two trained texts and described two texts to be aligned English text and a non English language text, wherein calculating English text is handled each by punctuate in the following ways The text similarity K for each sentence that a sentence and non English language text are handled by punctuate:

It calculates

The embodiment of the invention also provides a kind of Sentence-level bilingual alignment device, including processor and with the processor The memory of coupling, wherein the processor is used to execute the instruction in memory, realizes above-mentioned Sentence-level bilingual alignment side Method.

The embodiment of the invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage There is the step of computer program, the computer program realizes above-mentioned Sentence-level bilingual alignment method when being executed by processor.

Those skilled in the art will readily recognize that above-mentioned each preferred embodiment can be free under the premise of not conflicting Ground combination, superposition.

It should be appreciated that above-mentioned embodiment is merely exemplary, and not restrictive, without departing from of the invention basic In the case where principle, those skilled in the art can be directed to the various apparent or equivalent modification or replace that above-mentioned details is made It changes, is all included in scope of the presently claimed invention.

Claims

1. a kind of Sentence-level bilingual alignment method characterized by comprising

Step S1: Z trained convolution kernels are obtained, wherein Z is the integer more than or equal to 1, each described trained volume Product core is obtained by step S11- step S15；

Step S11: punctuate processing is carried out with text to two training respectively, and establishes the text phase of described two trained texts Like degree matrix B:

Wherein, n is the quantity for the sentence that described two training are handled with text by punctuate with a training in text, M is the quantity for the sentence that described two training are handled with text by punctuate with another training in text, text phase Like the element K in degree matrix B_ijI-th of the sentence handled by punctuate with text for one training with it is described another The text similarity for j-th of sentence that a training is handled with text by punctuate；

Step S12: initialization convolution kernel；

Step S13: convolution is carried out using text similarity matrix B of the current convolution kernel to described two trained texts, is obtained To matrix P, and penalty values loss is calculated, if penalty values loss meets preset requirement, then follow the steps S14, otherwise, executes step S16；

Wherein, if i-th of sentence and another described training text that one training is handled with text by punctuate This is matched by j-th of sentence that punctuate is handled, then L_ijIt is 1, is otherwise 0；

Step S14: verifying current convolution kernel using verifying collection, judge whether the result of verifying meets preset requirement, If so, step S15 is executed, if it is not, executing step S16；

Step S15: using current convolution kernel as trained convolution kernel；

Step S16: adjusting the weight of current convolution kernel according to penalty values loss, judges whether current frequency of training reaches pre- If number, if so, step S15 is executed, if it is not, repeating step S13；

Step S2: carrying out punctuate processing to two texts to be aligned respectively, and establishes the text phase of described two texts to be aligned Like degree matrix U:

Wherein, a is the quantity for the sentence that a text to be aligned in described two texts to be aligned is handled by punctuate, B is the quantity for the sentence that another text to be aligned in described two texts to be aligned is handled by punctuate, text phase Like the element K in degree matrix U_ijI-th of the sentence handled for one text to be aligned by punctuate with it is described another The text similarity for j-th of sentence that a text to be aligned is handled by punctuate；

Step S3: the text similarity matrix U is carried out using each of the Z trained convolution kernels convolution kernel Convolution obtains Z optimization text similarity matrix；

Step S4: optimize text similarity matrix using described Z and obtain the sentence alignment result of described two texts to be aligned.

2. the method according to claim 1, wherein Z is integer more than or equal to 2, and different trained volumes The size and weighted of product core.

3. the method according to claim 1, wherein the step S4 includes:

Step S41: text matches degree matrix T is calculated according to the Z optimization text similarity matrix, wherein the text With the element Y in degree matrix T_ijI-th of the sentence handled for one text to be aligned by punctuate with it is described another The text matches degree for j-th of sentence that a text to be aligned is handled by punctuate, and in the text matches degree matrix T The value of each element is the average value of same position element in described Z optimization text similarity matrix；

Step S42: successively traversing each row element in the text matches degree matrix T, and selected value is most from each row element Big element, and corresponding two sentences of the element of the selection are matched.

4. according to the method described in claim 3, it is characterized in that, after the step S42 further include:

Step S43: judge that another described text to be aligned passes through in the b sentence that punctuate is handled with the presence or absence of unpaired Sentence, if so, searched in the text matches degree matrix T with its maximum sentence of text matches degree, and by the lookup To sentence matched with it.

5. the method according to claim 1, wherein after the step S4 further include:

Step S5: the b sentence handled according to another described text to be aligned by punctuate waits for pair in described another The a sentence that sequence of positions, one text to be aligned in neat text are handled by punctuate is one to right Sequence of positions in neat text detects sentence alignment result.

6. according to the method described in claim 5, it is characterized in that, the step S5 includes:

Step S51: according to sequence of positions and sentence alignment of the b sentence in another described text to be aligned As a result a sentence is ranked up；

Step S52: if there are two sentences in a sentence, described two sentences are suitable by the position that the sequence obtains The sequence of positions of sequence and described two sentences in one text to be aligned is on the contrary, then there are mistakes for judgement.

7. -6 any method according to claim 1, which is characterized in that described two trained texts and described two It include an English text and a non English language text in text to be aligned, wherein calculate English text warp in the following ways It is similar to the text of each sentence that non English language text is handled by punctuate to cross each sentence that punctuate is handled Spend K:

To two sentences of text similarity to be calculated, compare sentence that wherein English text is handled by punctuate with by non- English text passes through the quantity of word in the English text that the statement translation that punctuate is handled obtains；

It calculates

Wherein, E is the word quantity of a fairly large number of one of word in the comparison result, N_vFor the comparison result The value of v-th of word in a fairly large number of one of middle word, if in the comparison result negligible amounts of word one Person includes the word of root identical as v-th of word, then N_vValue be 1, be otherwise 0.

8. a kind of Sentence-level bilingual alignment device characterized by comprising

Module is obtained, for obtaining Z trained convolution kernels, wherein Z is the integer more than or equal to 1, each described training Good convolution kernel is obtained by step S11- step S15；

Step S12: initialization convolution kernel；

Step S15: using current convolution kernel as trained convolution kernel；

First processing module for carrying out punctuate processing to two texts to be aligned respectively, and establishes described two texts to be aligned This text similarity matrix U:

Second processing module, for similar to the text using each of the Z trained convolution kernels convolution kernel It spends matrix U and carries out convolution, obtain Z optimization text similarity matrix；

Third processing module, for obtaining the language of described two texts to be aligned using described Z optimization text similarity matrix The neat result of sentence pair.

9. a kind of Sentence-level bilingual alignment device, which is characterized in that including processor and the storage coupled with the processor Device, wherein the processor is used to execute the instruction in memory, realizes the described in any item methods of claim 1-7.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, when the computer program is executed by processor the step of any one of realization claim 1-7 the method.