CN105653516A

CN105653516A - Parallel corpus aligning method and device

Info

Publication number: CN105653516A
Application number: CN201511022223.9A
Authority: CN
Inventors: 江潮; 张芃
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-06-08
Anticipated expiration: 2035-12-30
Also published as: CN105653516B

Abstract

The invention discloses a parallel corpus aligning method. The method comprises the following steps: converting all the original text sentences in an original text and all the translated text sentences in a translated text into characters with a same encoding manner; carrying out word segmentation on the converted original text sentences in the original text and removing stop words therein so as to obtain content words; obtaining all the translated items of each content word of each original text sentence; matching all the translated items of each content word of each original text sentence in the converted translated text sentences of the translated text so as to obtain the similarity between each word content of each original text sentence and the translated text sentences; matching each original text sentence with the translated text sentences according to the similarity between each word content of each original text sentence and the translated text sentences so as to obtain the similarity between each original text sentence and the translated text sentences; and matching and aligning the translated text sentences having the highest similarity with the original text sentences with the original text sentences. The invention furthermore discloses a parallel corpus aligning device. According to the method and device, the problem of aligning between the original texts and the translated texts is solved.

Description

The method and apparatus of parallel corpora alignment

Technical field

The present invention relates to translation technology field, the method and apparatus being specifically related to the alignment of a kind of parallel corpora.

Background technology

Parallel Corpus all plays basic effect in various fields such as machine translation, supplementary translation, semantic disambiguation and dictionary writings. The alignment of Parallel Corpus refers to, by different segmentation granularities, original text and translation is carried out correspondence, forms the language pair of specification. The unit of language material alignment has the different granularities such as chapter, paragraph, sentence, word from big to small, and the parallel corpora that granularity is more little, the linguistic information of its offer is more abundant, and using value is also more big.

It is said that in general, language material is if pressing chapter or paragraph alignment, it is possible to carry out in order aliging by original text and translation. But being undertaken aliging by sentence or smaller particle size by original text and translation in paragraph then cannot such simple process, due to original language style, object language style, the translation a variety of causes such as writing style, content adjustment, if the original text statement in paragraph and translation statement simply carry out aliging often causing the situation of a large amount of mispairing in order. Manually processing so this granularity generally requires less than the former translation alignment work of sentence, both wasted time and energy, efficiency is also very low.

Summary of the invention

The above-mentioned deficiency aiming to overcome that prior art of the embodiment of the present invention, it is provided that the method for a kind of parallel corpora alignment, the method, based on the similarity of notional word, solves original text and the problem of translation alignment.

The another object of the embodiment of the present invention is in that to overcome the above-mentioned deficiency of prior art, it is provided that the device of a kind of parallel corpora alignment, and this device, based on the similarity of notional word, solves original text and the problem of translation alignment.

In order to realize foregoing invention purpose, the technical scheme of the embodiment of the present invention is as follows:

A kind of method of parallel corpora alignment, including: all original text statements in original text and all translation statements in translation are converted to the character of identical coded system; To all described original text statement participle in the described original text after conversion, remove stop words therein, it is thus achieved that notional word; The all of each notional word obtaining described original text statement translate item; All described translation statement in all described translations translating item after conversion of each notional word of each described original text statement will be mated, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement; All notional words according to each described original text statement and the similarity of described translation statement, mate each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement; By the described translation statement the highest with described original text statement similarity and described original text statement matching and align.

Further, all items of translating of described each notional word by each described original text statement mate in all described translation statements, it is thus achieved that the process of each notional word of each described original text statement and the similarity of described translation statement includes: according to sim (nw_jl,TR_inw_r)=L/ (dis (nw_jl,TR_inw_r)+L) l of jth notional word that obtains original text statement OR translates a nw_jlWith i-th translation statement TR_iThe r notional word TR_inw_rSimilarity;According toTranslate a nw for l of the jth notional word obtaining described original text statement OR_jlWith described translation statement TR_iSimilarity; According toObtain the jth notional word nw of described original text statement OR_jWith translation statement TR described in i-th_iSimilarity; Wherein, described original text statement OR has m notional word, total n described translation statement, described translation statement TR in described translation_iHaving p notional word, jth notional word has k and translates item, and L represents adjustment parameter, dis (nw_jl,TR_inw_r) represent that the l of the jth notional word of described original text statement OR translates a nw_jlWith translation statement TR described in i-th_iThe r notional word TR_inw_rThe distance of the code in dictionary, i=1,2 ..., n, j=1,2 ..., m, l=1,2 ..., k, r=1,2 ..., p.

Further, the similarity of described all notional words according to each described original text statement and described translation statement, each described original text statement and described translation statement are mated, it is thus achieved that the process of the similarity of each described original text statement and described translation statement includes: according toObtain described original text statement OR and described translation statement TR_iSimilarity.

Further, described the described translation statement the highest with described original text statement similarity and described original text statement matching the process alignd are included: according to

\max_{i = 1, 2, ..., n} (s i m (O R, {TR}_{i})) = \max_{i = 1, 2, ..., n} (\underset{j = 1, 2, ..., m}{Π} s i m ({nw}_{j}, {TR}_{i}))

Obtain the described translation statement the highest with the similarity of described original text statement OR; The described translation statement the highest with the similarity of described original text statement OR and described original text statement OR are mated, and align described original text statement OR and described translation statement.

Further, also include: the described original text statement in the described original text after conversion is numbered in order; Described translation statement in described translation after conversion is numbered in order; If the similarity of same described translation statement and multiple described original text statement is the highest, then obtain multiple described original text statement described numbering in described original text and the described numbering that described translation statement is in described translation; If the described numbering in described original text of the described original text statement in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by this described original text statement and described translation statement matching and align; If the described numbering in described original text of two the described original text statements in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by described original text statement less for described numbering and described translation statement matching and align; Relatively number the height of bigger described original text statement and the similarity remaining described translation statement described in two described original text statements, by described original text statement matching bigger to described translation statement the highest for the similarity remaining described original text statement bigger with described numbering in described translation statement and described numbering and align; Repeat said process, until each described original text statement all with each described translation statement matching aliging.

Further: described dictionary is the synonym classified dictionary by tree structure coding, described have unique described code by each node of the synonym classified dictionary of tree structure coding.

Further: described adjustment parameter L is the number of plies of the described synonym classified dictionary encoded by described tree structure.

Further, the process of the described character that all original text statements in original text and all translation statements in translation are converted to identical coded system includes: read the character in described original text statement or character string according to the coded system of the character of all described original text statement in described original text, and the coded system according to the character of all described translation statement in described translation reads the character in described translation statement or character string;The character in the described original text statement read and described translation statement or character string is converted to target code character or character string respectively according to same target coded system.

Further, described notional word includes: noun, verb, adjective and adverbial word.

And, the device of a kind of parallel corpora alignment, including: first module, for all original text statements in original text and all translation statements in translation being converted to the character of identical coded system; Second unit, for all described original text statement participle in the described original text after conversion, removing stop words therein, it is thus achieved that notional word; Unit the 3rd, all of each notional word for obtaining described original text statement translate item; Unit the 4th, for mating in all described translation statements translated in the item described translation after labelling of each notional word of each described original text statement, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement; Unit the 5th, for the similarity of all notional words according to each described original text statement and described translation statement, mates each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement; Unit the 6th, for by the described translation statement the highest with described original text statement similarity and described original text statement matching and align.

Having the beneficial effect that of the embodiment of the present invention:

1, the method for the parallel corpora alignment of the embodiment of the present invention, based on the similarity of notional word, solves the former translation alignment problem translating post processing.

2, the method for the parallel corpora alignment of the embodiment of the present invention, it is not necessary to by artificial treatment, save the time, improve efficiency.

3, the method for the parallel corpora alignment of the embodiment of the present invention, by original text statement and translation statement being converted to the character of identical coding, solve owing to different coding mode produces the problem of mess code and the coded system by unified original text and the character of translation, it is simple to alignment original text and translation.

4, the device of the parallel corpora alignment of the embodiment of the present invention, based on the similarity of notional word, solves the former translation alignment problem translating post processing.

5, the device of the parallel corpora alignment of the embodiment of the present invention, it is achieved that automatization, saves the time, improves efficiency.

6, the device of the parallel corpora alignment of the embodiment of the present invention, by original text statement and translation statement being converted to the character of identical coding, solve owing to different coding mode produces the problem of mess code and the coded system by unified original text and the character of translation, it is simple to alignment original text and translation.

Accompanying drawing explanation

Fig. 1 is the flow chart of the method for the parallel corpora alignment of the embodiment of the present invention;

Fig. 2 is the flow chart of the device of the parallel corpora alignment of the embodiment of the present invention.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated. Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.

The method embodiments providing the alignment of a kind of parallel corpora. As it is shown in figure 1, be the flow chart of the method for the parallel corpora alignment of the embodiment of the present invention. The detailed process of the method for this parallel corpora alignment is as follows:

Step S10: all original text statements in original text and all translation statements in translation are converted to the character of identical coded system.

Step S10 comprises the following steps that

Step S101: read the character in original text statement or character string according to the coded system of the character of all original text statements in original text, and the coded system according to the character of all translation statements in translation reads the character in translation statement or character string.

Step S102: convert the character in the original text statement of reading and translation statement or character string to target code character or character string respectively according to same target coded system.

Concrete, the detailed process of above-mentioned steps can be realized by following mode:

(1) coded system obtaining the character of all original text statements in original text obtains the code character data set of original text statement, and the coded system of the character of all translation statements in acquisition translation obtains the code character data set of translation statement.

(2) read the character in the code character data set of original text statement or character string according to the coded system streaming of the character of original text statement, and read the character in the code character data set of translation statement or character string according to the coded system streaming of the character of translation statement.

(3) according to same target coded system to the character in the code character data set of original text statement and translation statement or character string converted the target code character of original text statement or the target code character of character string and translation statement or character string.

(4) the target code character of original text statement or character string are joined in the dynamic object character set of original text statement, and the target code character of translation statement or character string are joined in the dynamic object character set of translation statement.

(5) step (1)��(4) are repeated, until character or character string in the code character data set of the code character data set of original text statement and translation statement read complete.

(6) the dynamic object character set of original text statement is taken out the target code character data collection converting original text statement to, and the dynamic object character set of translation statement is taken out the target code character data collection converting translation statement to.

Original text statement and translation statement can be converted to through step S10 the character of identical coded system, solve owing to different coding mode produces the problem of mess code and the coded system by unified original text and the character of translation, it is simple to alignment original text and translation.

Step S20: to all original text statement participles in the original text after conversion, remove stop words therein, it is thus achieved that notional word.

Concrete, notional word includes: noun, verb, adjective and adverbial word. Preferably, all notional words are set up a notional word set.

Step S30: all of each notional word obtaining original text statement translate item.

One notional word often has one that multiple expression implication, each different implication are referred to as this notional word to translate item. Translating item is the least unit in synonymicon, has corresponding code corresponding in dictionary. Such as: " pride " this word can have two kinds of implications, " pride " and " arrogance ", these two kinds of implications are exactly that two of this word proud translate item. Accordingly, it would be desirable to obtain all of each notional word to translate item. Such as, notional word as shown in table 1 can be set up and translate the corresponding table of item.

Table 1 notional word and the corresponding table translating item

Step S40: will mate in all translation statements translated in item translation after labelling of each notional word of each original text statement, it is thus achieved that each notional word of each original text statement and the similarity of translation statement.

Wherein, step S40 specifically includes following process:

Step S401:

According to sim (nw_jl,TR_inw_r)=L/ (dis (nw_jl,TR_inw_r)+L) l of jth notional word that obtains original text statement OR translates a nw_jlWith i-th translation statement TR_iThe r notional word TR_inw_rSimilarity.

The similarity between notional word and notional word in the present embodiment is span numerical value between [0,1]. If a notional word is the semanteme of of another notional word itself, then the similarity between two notional words is 1; If two notional words all can not be replaced in any context, then the similarity between two notional words is 0.

Wherein, original text statement OR has m notional word. Total n translation statement in translation. Translation statement TR_iThere is p notional word. I represents the counting of translation statement, i=1,2 ..., n. Jth notional word has k and translates item. J represents the counting of the notional word in an original text statement, j=1,2 ..., m. L represents the counting translating item of a notional word, l=1,2 ..., k. R represents the counting of the notional word in a translation statement, r=1,2 ..., p. Dis (nw_jl,TR_inw_r) represent in dictionary, translate a nw for the l of the jth notional word of original text statement OR_jlWith i-th translation statement TR_iThe r notional word TR_inw_rBetween distance. L represents adjustment parameter. Translate a nw for the l of the jth notional word of original text statement OR_jlWith i-th translation statement TR_iThe r notional word TR_inw_rDistance in dictionary of similarity and two words be inverse relation. Concrete, heretofore described dictionary refers to the dictionary by the synonym classification of tree structure coding. Such as, the synonym ontology tool such as " Chinese thesaurus " and " WordNet ". In this dictionary, should have unique code by each node of the synonym classified dictionary of tree structure coding. Each code is corresponding, and several translate item. Regulate the number of plies that parameter L is the synonym classified dictionary by tree structure coding, the i.e. number of plies of tree structure. Dis (nw_jl,TR_inw_r) l of jth notional word that is specially original text statement OR translates a nw_jlWith i-th translation statement TR_iThe r notional word TR_inw_rThe distance of the code in dictionary, i.e. difference between two codes.

Step S402:

According to following formula

s i m ({nw}_{j l}, {TR}_{i}) = \max_{r = 1, 2, ..., p} (s i m ({nw}_{j l}, {TR}_{i} {nw}_{r})) = \max_{r = 1, 2, ..., p} (L / (d i s ({nw}_{j l}, {TR}_{i} {nw}_{r}) + L))

Translate a nw for l of the jth notional word obtaining original text statement OR_jlWith translation statement TR_iSimilarity.

Step S403:

According to following formula

s i m ({nw}_{j}, {TR}_{i}) = \max_{l = 1, 2, ..., k} (s i m ({nw}_{j l}, {TR}_{i})) = \max_{l = 1, 2, ..., k} (\max_{r = 1, 2, ..., p} (L / (d i s ({TR}_{i} {nw}_{r}) + L))

Obtain the jth notional word nw of original text statement OR_jWith translation statement TR_iSimilarity.

Step S50: the similarity according to all notional words of each original text statement and translation statement, mates each original text statement and translation statement, it is thus achieved that the similarity of each original text statement and translation statement.

According to following formula

s i m (O R, {TR}_{i}) = \underset{j = 1, 2, ..., m}{Π} s i m ({nw}_{j}, {TR}_{i}) = \underset{j = 1, 2, ..., m}{Π} (\max_{l = 1, 2, ..., k} (\max_{r = 1, 2, ..., p} (L / (d i s ({nw}_{j l}, {TR}_{i} {nw}_{r}) + L)))

Obtain original text statement OR and translation statement TR_iSimilarity.

Step S60: by the translation statement the highest with original text statement similarity and original text statement matching and align.

Concrete, according to

\max_{i = 1, 2, ..., n} (s i m (O R, {TR}_{i})) = \max_{i = 1, 2, ..., n} (\underset{j = 1, 2, ..., m}{Π} si m ({nw}_{j}, {TR}_{i}))

Obtain the translation statement the highest with the similarity of original text statement OR.

The translation statement the highest with the similarity of original text statement OR and original text statement OR are mated and align.

In step S60, it is understood that there may be same translation statement is identical with the similarity of multiple former sentences and is the highest situation, then step S60 specifically also includes following comparison process:

(1) the original text statement in the original text after conversion is numbered in order.

Such as, first in original text being numbered 1, second is numbered 2, the like. Preferably, the original text statement after all numberings is set up the set of an original text statement.

(2) the translation statement in the translation after conversion is numbered in order.

Such as, first in translation being numbered 1, second is numbered 2, the like.Preferably, the translation statement after all numberings is set up the set of a translation statement.

(3) multiple original text statement numbering in original text and the numbering that translation statement is in translation are obtained.

Concrete, this process includes again following two kind processing mode:

1) if the numbering in original text of the original text statement in multiple original text statement and translation statement numbering in translation are closest, then by this original text statement and translation statement matching and align.

Such as, 4 original text statements numbering in original text respectively 1,2,3,4. Translation statement is numbered 5 in translation. The similarity of these 4 original text statements and this translation statement is the highest. The numbering being numbered the numbering of the original text statement of 4 and translation statement is closest, then will be numbered the original text statement of 4 and be numbered the translation statement matching of 5 and align.

2) if the numbering in original text of two the original text statements in multiple original text statement and translation statement numbering in translation are closest, then original text statement less for numbering mated with translation and align. Compare the height numbering bigger original text statement in two original text statements with the similarity of residue translation statement, translation statement the highest with the similarity numbering bigger original text statement in translation statement will be remained and number bigger original text statement matching and align.

Such as, 4 original text statements numbering in original text respectively 1,3,5,7. Translation statement is numbered 4 in translation. The similarity of these 4 original text statements and translation statement is the highest. The numbering that is numbered the original text statement of 3, the original text statement being numbered 5 numbering all and the numbering of this translation statement closest. Owing to numbering 3 is less than numbering 5, then will be numbered the original text statement of 3 and be numbered the translation statement matching of 4 and align. Being numbered in the residue translation statement beyond the translation statement of 4, if a translation statement is the highest with the similarity of the original text statement being numbered 5, then by this translation statement and be numbered 5 original text statement matching.

3) repeat said process 1) and 2), until each original text statement all with each translation statement matching aliging.

The method of the present invention, based on the similarity of notional word, solves the former translation alignment problem translating post processing. The method can be completed by machine, it is not necessary to by artificial treatment, saves the time, improves efficiency.

The embodiment of the present invention additionally provides the device of a kind of parallel corpora alignment. As in figure 2 it is shown, the device of this parallel corpora alignment includes:

First module 101, for being converted to the character of identical coded system by all original text statements in original text and all translation statements in translation.

Second unit 102, for all original text statement participles in the original text after conversion, removing stop words therein, it is thus achieved that notional word.

3rd unit 103, all of each notional word for obtaining original text statement translate item.

4th unit 104, for mating in all translation statements translated in item translation after labelling of each notional word of each original text statement, it is thus achieved that each notional word of each original text statement and the similarity of translation statement.

5th unit 105, for the similarity of all notional words according to each original text statement and translation statement, mates each original text statement and translation statement, it is thus achieved that the similarity of each original text statement and translation statement.

6th unit 106, for by the translation statement the highest with original text statement similarity and original text statement matching and align.

The device of the parallel corpora alignment of the present invention, in order to the method realizing above-mentioned parallel corpora alignment, based on the similarity of notional word, solves the former translation alignment problem translating post processing.This device makes the above-mentioned method need not by artificial realization, it is achieved that automatization, saves the time, improves efficiency.

The invention provides and described in more than one, be only presently preferred embodiments of the present invention; not in order to limit the present invention; all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, all should include within protection scope of the present invention.

Claims

1. the method for a parallel corpora alignment, it is characterised in that including:

All original text statements in original text and all translation statements in translation are converted to the character of identical coded system;

To all described original text statement participle in the described original text after conversion, remove stop words therein, it is thus achieved that notional word;

The all of each notional word obtaining described original text statement translate item;

All described translation statement in all described translations translating item after conversion of each notional word of each described original text statement will be mated, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement;

All notional words according to each described original text statement and the similarity of described translation statement, mate each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement;

By the described translation statement the highest with described original text statement similarity and described original text statement matching and align.

2. the method for parallel corpora alignment as claimed in claim 1, it is characterized in that, all items of translating of described each notional word by each described original text statement mate in all described translation statements, it is thus achieved that the process of each notional word of each described original text statement and the similarity of described translation statement includes:

According to sim (nw_jl,TR_inw_r)=L/ (dis (nw_jl,TR_inw_r)+L) l of jth notional word that obtains original text statement OR translates a nw_jlWith i-th translation statement TR_iThe r notional word TR_inw_rSimilarity;

According to

s i m ({nw}_{j l}, {TR}_{i}) = \underset{r = 1, 2, ..., p}{m a x} (s i m ({nw}_{j l}, {TR}_{i} {nw}_{r}))

Translate a nw for l of the jth notional word obtaining described original text statement OR_jlWith described translation statement TR_iSimilarity;

According to

s i m ({nw}_{j}, {TR}_{i}) = \underset{l = 1, 2, ..., k}{m a x} (s i m ({nw}_{j l}, {TR}_{i}))

Obtain the jth notional word nw of described original text statement OR_jWith translation statement TR described in i-th_iSimilarity;

Wherein, described original text statement OR has m notional word, total n described translation statement, described translation statement TR in described translation_iHaving p notional word, jth notional word has k and translates item, and L represents adjustment parameter, dis (nw_jl,TR_inw_r) represent that the l of the jth notional word of described original text statement OR translates a nw_jlWith translation statement TR described in i-th_iThe r notional word TR_inw_rThe distance of the code in dictionary, i=1,2 ..., n, j=1,2 ..., m, l=1,2 ..., k, r=1,2 ..., p.

3. the method for parallel corpora alignment as claimed in claim 2, it is characterized in that, the similarity of described all notional words according to each described original text statement and described translation statement, each described original text statement and described translation statement are mated, it is thus achieved that the process of the similarity of each described original text statement and described translation statement includes:

According toObtain described original text statement OR and described translation statement TR_iSimilarity.

4. the method for parallel corpora alignment as claimed in claim 3, it is characterised in that described the described translation statement the highest with described original text statement similarity and described original text statement matching the process alignd are included:

According to

\max_{i = 1, 2, ..., n} (s i m (O R, {TR}_{i})) = \max_{i = 1, 2, ..., n} (\underset{j = 1, 2, ..., m}{Π} s i m ({nw}_{j}, {TR}_{i}))

Obtain the described translation statement the highest with the similarity of described original text statement OR;

The described translation statement the highest with the similarity of described original text statement OR and described original text statement OR are mated, and align described original text statement OR and described translation statement.

5. the method for parallel corpora alignment as claimed in claim 1, it is characterised in that also include:

Described original text statement in described original text after conversion is numbered in order;

Described translation statement in described translation after conversion is numbered in order;

If the similarity of same described translation statement and multiple described original text statement is the highest, then obtain multiple described original text statement described numbering in described original text and the described numbering that described translation statement is in described translation;

If the described numbering in described original text of the described original text statement in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by this described original text statement and described translation statement matching and align;

If the described numbering in described original text of two the described original text statements in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by described original text statement less for described numbering and described translation statement matching and align;

Relatively number the height of bigger described original text statement and the similarity remaining described translation statement described in two described original text statements, by described original text statement matching bigger to described translation statement the highest for the similarity remaining described original text statement bigger with described numbering in described translation statement and described numbering and align;

Repeat said process, until each described original text statement all with each described translation statement matching aliging.

6. the method for parallel corpora alignment as claimed in claim 2, it is characterised in that: described dictionary is the synonym classified dictionary by tree structure coding, and described have unique described code by each node of the synonym classified dictionary of tree structure coding.

7. the method for parallel corpora alignment as claimed in claim 6, it is characterised in that: described adjustment parameter L is the number of plies of the described synonym classified dictionary encoded by described tree structure.

8. the method for parallel corpora alignment as claimed in claim 1, it is characterised in that the process of the described character that all original text statements in original text and all translation statements in translation are converted to identical coded system includes:

The coded system of the character according to all described original text statement in described original text reads the character in described original text statement or character string, and the coded system according to the character of all described translation statement in described translation reads the character in described translation statement or character string;

The character in the described original text statement read and described translation statement or character string is converted to target code character or character string respectively according to same target coded system.

9. the method for parallel corpora alignment as claimed in claim 1, it is characterised in that described notional word includes: noun, verb, adjective and adverbial word.

10. the device of a parallel corpora alignment, it is characterised in that including:

First module, for being converted to the character of identical coded system by all original text statements in original text and all translation statements in translation;

Second unit, for all described original text statement participle in the described original text after conversion, removing stop words therein, it is thus achieved that notional word;

Unit the 3rd, all of each notional word for obtaining described original text statement translate item;

Unit the 4th, for mating in all described translation statements translated in the item described translation after labelling of each notional word of each described original text statement, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement;

Unit the 5th, for the similarity of all notional words according to each described original text statement and described translation statement, mates each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement;

Unit the 6th, for by the described translation statement the highest with described original text statement similarity and described original text statement matching and align.