CN105653516A - Parallel corpus aligning method and device - Google Patents

Parallel corpus aligning method and device Download PDF

Info

Publication number
CN105653516A
CN105653516A CN201511022223.9A CN201511022223A CN105653516A CN 105653516 A CN105653516 A CN 105653516A CN 201511022223 A CN201511022223 A CN 201511022223A CN 105653516 A CN105653516 A CN 105653516A
Authority
CN
China
Prior art keywords
statement
original text
translation
similarity
described original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201511022223.9A
Other languages
Chinese (zh)
Other versions
CN105653516B (en
Inventor
江潮
张芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201511022223.9A priority Critical patent/CN105653516B/en
Publication of CN105653516A publication Critical patent/CN105653516A/en
Application granted granted Critical
Publication of CN105653516B publication Critical patent/CN105653516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a parallel corpus aligning method. The method comprises the following steps: converting all the original text sentences in an original text and all the translated text sentences in a translated text into characters with a same encoding manner; carrying out word segmentation on the converted original text sentences in the original text and removing stop words therein so as to obtain content words; obtaining all the translated items of each content word of each original text sentence; matching all the translated items of each content word of each original text sentence in the converted translated text sentences of the translated text so as to obtain the similarity between each word content of each original text sentence and the translated text sentences; matching each original text sentence with the translated text sentences according to the similarity between each word content of each original text sentence and the translated text sentences so as to obtain the similarity between each original text sentence and the translated text sentences; and matching and aligning the translated text sentences having the highest similarity with the original text sentences with the original text sentences. The invention furthermore discloses a parallel corpus aligning device. According to the method and device, the problem of aligning between the original texts and the translated texts is solved.

Description

The method and apparatus of parallel corpora alignment
Technical field
The present invention relates to translation technology field, the method and apparatus being specifically related to the alignment of a kind of parallel corpora.
Background technology
Parallel Corpus all plays basic effect in various fields such as machine translation, supplementary translation, semantic disambiguation and dictionary writings. The alignment of Parallel Corpus refers to, by different segmentation granularities, original text and translation is carried out correspondence, forms the language pair of specification. The unit of language material alignment has the different granularities such as chapter, paragraph, sentence, word from big to small, and the parallel corpora that granularity is more little, the linguistic information of its offer is more abundant, and using value is also more big.
It is said that in general, language material is if pressing chapter or paragraph alignment, it is possible to carry out in order aliging by original text and translation. But being undertaken aliging by sentence or smaller particle size by original text and translation in paragraph then cannot such simple process, due to original language style, object language style, the translation a variety of causes such as writing style, content adjustment, if the original text statement in paragraph and translation statement simply carry out aliging often causing the situation of a large amount of mispairing in order. Manually processing so this granularity generally requires less than the former translation alignment work of sentence, both wasted time and energy, efficiency is also very low.
Summary of the invention
The above-mentioned deficiency aiming to overcome that prior art of the embodiment of the present invention, it is provided that the method for a kind of parallel corpora alignment, the method, based on the similarity of notional word, solves original text and the problem of translation alignment.
The another object of the embodiment of the present invention is in that to overcome the above-mentioned deficiency of prior art, it is provided that the device of a kind of parallel corpora alignment, and this device, based on the similarity of notional word, solves original text and the problem of translation alignment.
In order to realize foregoing invention purpose, the technical scheme of the embodiment of the present invention is as follows:
A kind of method of parallel corpora alignment, including: all original text statements in original text and all translation statements in translation are converted to the character of identical coded system; To all described original text statement participle in the described original text after conversion, remove stop words therein, it is thus achieved that notional word; The all of each notional word obtaining described original text statement translate item; All described translation statement in all described translations translating item after conversion of each notional word of each described original text statement will be mated, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement; All notional words according to each described original text statement and the similarity of described translation statement, mate each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement; By the described translation statement the highest with described original text statement similarity and described original text statement matching and align.
Further, all items of translating of described each notional word by each described original text statement mate in all described translation statements, it is thus achieved that the process of each notional word of each described original text statement and the similarity of described translation statement includes: according to sim (nwjl,TRinwr)=L/ (dis (nwjl,TRinwr)+L) l of jth notional word that obtains original text statement OR translates a nwjlWith i-th translation statement TRiThe r notional word TRinwrSimilarity;According toTranslate a nw for l of the jth notional word obtaining described original text statement ORjlWith described translation statement TRiSimilarity; According toObtain the jth notional word nw of described original text statement ORjWith translation statement TR described in i-thiSimilarity; Wherein, described original text statement OR has m notional word, total n described translation statement, described translation statement TR in described translationiHaving p notional word, jth notional word has k and translates item, and L represents adjustment parameter, dis (nwjl,TRinwr) represent that the l of the jth notional word of described original text statement OR translates a nwjlWith translation statement TR described in i-thiThe r notional word TRinwrThe distance of the code in dictionary, i=1,2 ..., n, j=1,2 ..., m, l=1,2 ..., k, r=1,2 ..., p.
Further, the similarity of described all notional words according to each described original text statement and described translation statement, each described original text statement and described translation statement are mated, it is thus achieved that the process of the similarity of each described original text statement and described translation statement includes: according toObtain described original text statement OR and described translation statement TRiSimilarity.
Further, described the described translation statement the highest with described original text statement similarity and described original text statement matching the process alignd are included: according to max i = 1 , 2 , ... , n ( s i m ( O R , TR i ) ) = max i = 1 , 2 , ... , n ( Π j = 1 , 2 , ... , m s i m ( nw j , TR i ) ) Obtain the described translation statement the highest with the similarity of described original text statement OR; The described translation statement the highest with the similarity of described original text statement OR and described original text statement OR are mated, and align described original text statement OR and described translation statement.
Further, also include: the described original text statement in the described original text after conversion is numbered in order; Described translation statement in described translation after conversion is numbered in order; If the similarity of same described translation statement and multiple described original text statement is the highest, then obtain multiple described original text statement described numbering in described original text and the described numbering that described translation statement is in described translation; If the described numbering in described original text of the described original text statement in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by this described original text statement and described translation statement matching and align; If the described numbering in described original text of two the described original text statements in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by described original text statement less for described numbering and described translation statement matching and align; Relatively number the height of bigger described original text statement and the similarity remaining described translation statement described in two described original text statements, by described original text statement matching bigger to described translation statement the highest for the similarity remaining described original text statement bigger with described numbering in described translation statement and described numbering and align; Repeat said process, until each described original text statement all with each described translation statement matching aliging.
Further: described dictionary is the synonym classified dictionary by tree structure coding, described have unique described code by each node of the synonym classified dictionary of tree structure coding.
Further: described adjustment parameter L is the number of plies of the described synonym classified dictionary encoded by described tree structure.
Further, the process of the described character that all original text statements in original text and all translation statements in translation are converted to identical coded system includes: read the character in described original text statement or character string according to the coded system of the character of all described original text statement in described original text, and the coded system according to the character of all described translation statement in described translation reads the character in described translation statement or character string;The character in the described original text statement read and described translation statement or character string is converted to target code character or character string respectively according to same target coded system.
Further, described notional word includes: noun, verb, adjective and adverbial word.
And, the device of a kind of parallel corpora alignment, including: first module, for all original text statements in original text and all translation statements in translation being converted to the character of identical coded system; Second unit, for all described original text statement participle in the described original text after conversion, removing stop words therein, it is thus achieved that notional word; Unit the 3rd, all of each notional word for obtaining described original text statement translate item; Unit the 4th, for mating in all described translation statements translated in the item described translation after labelling of each notional word of each described original text statement, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement; Unit the 5th, for the similarity of all notional words according to each described original text statement and described translation statement, mates each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement; Unit the 6th, for by the described translation statement the highest with described original text statement similarity and described original text statement matching and align.
Having the beneficial effect that of the embodiment of the present invention:
1, the method for the parallel corpora alignment of the embodiment of the present invention, based on the similarity of notional word, solves the former translation alignment problem translating post processing.
2, the method for the parallel corpora alignment of the embodiment of the present invention, it is not necessary to by artificial treatment, save the time, improve efficiency.
3, the method for the parallel corpora alignment of the embodiment of the present invention, by original text statement and translation statement being converted to the character of identical coding, solve owing to different coding mode produces the problem of mess code and the coded system by unified original text and the character of translation, it is simple to alignment original text and translation.
4, the device of the parallel corpora alignment of the embodiment of the present invention, based on the similarity of notional word, solves the former translation alignment problem translating post processing.
5, the device of the parallel corpora alignment of the embodiment of the present invention, it is achieved that automatization, saves the time, improves efficiency.
6, the device of the parallel corpora alignment of the embodiment of the present invention, by original text statement and translation statement being converted to the character of identical coding, solve owing to different coding mode produces the problem of mess code and the coded system by unified original text and the character of translation, it is simple to alignment original text and translation.
Accompanying drawing explanation
Fig. 1 is the flow chart of the method for the parallel corpora alignment of the embodiment of the present invention;
Fig. 2 is the flow chart of the device of the parallel corpora alignment of the embodiment of the present invention.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated. Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.
The method embodiments providing the alignment of a kind of parallel corpora. As it is shown in figure 1, be the flow chart of the method for the parallel corpora alignment of the embodiment of the present invention. The detailed process of the method for this parallel corpora alignment is as follows:
Step S10: all original text statements in original text and all translation statements in translation are converted to the character of identical coded system.
Step S10 comprises the following steps that
Step S101: read the character in original text statement or character string according to the coded system of the character of all original text statements in original text, and the coded system according to the character of all translation statements in translation reads the character in translation statement or character string.
Step S102: convert the character in the original text statement of reading and translation statement or character string to target code character or character string respectively according to same target coded system.
Concrete, the detailed process of above-mentioned steps can be realized by following mode:
(1) coded system obtaining the character of all original text statements in original text obtains the code character data set of original text statement, and the coded system of the character of all translation statements in acquisition translation obtains the code character data set of translation statement.
(2) read the character in the code character data set of original text statement or character string according to the coded system streaming of the character of original text statement, and read the character in the code character data set of translation statement or character string according to the coded system streaming of the character of translation statement.
(3) according to same target coded system to the character in the code character data set of original text statement and translation statement or character string converted the target code character of original text statement or the target code character of character string and translation statement or character string.
(4) the target code character of original text statement or character string are joined in the dynamic object character set of original text statement, and the target code character of translation statement or character string are joined in the dynamic object character set of translation statement.
(5) step (1)��(4) are repeated, until character or character string in the code character data set of the code character data set of original text statement and translation statement read complete.
(6) the dynamic object character set of original text statement is taken out the target code character data collection converting original text statement to, and the dynamic object character set of translation statement is taken out the target code character data collection converting translation statement to.
Original text statement and translation statement can be converted to through step S10 the character of identical coded system, solve owing to different coding mode produces the problem of mess code and the coded system by unified original text and the character of translation, it is simple to alignment original text and translation.
Step S20: to all original text statement participles in the original text after conversion, remove stop words therein, it is thus achieved that notional word.
Concrete, notional word includes: noun, verb, adjective and adverbial word. Preferably, all notional words are set up a notional word set.
Step S30: all of each notional word obtaining original text statement translate item.
One notional word often has one that multiple expression implication, each different implication are referred to as this notional word to translate item. Translating item is the least unit in synonymicon, has corresponding code corresponding in dictionary. Such as: " pride " this word can have two kinds of implications, " pride " and " arrogance ", these two kinds of implications are exactly that two of this word proud translate item. Accordingly, it would be desirable to obtain all of each notional word to translate item. Such as, notional word as shown in table 1 can be set up and translate the corresponding table of item.
Table 1 notional word and the corresponding table translating item
Step S40: will mate in all translation statements translated in item translation after labelling of each notional word of each original text statement, it is thus achieved that each notional word of each original text statement and the similarity of translation statement.
Wherein, step S40 specifically includes following process:
Step S401:
According to sim (nwjl,TRinwr)=L/ (dis (nwjl,TRinwr)+L) l of jth notional word that obtains original text statement OR translates a nwjlWith i-th translation statement TRiThe r notional word TRinwrSimilarity.
The similarity between notional word and notional word in the present embodiment is span numerical value between [0,1]. If a notional word is the semanteme of of another notional word itself, then the similarity between two notional words is 1; If two notional words all can not be replaced in any context, then the similarity between two notional words is 0.
Wherein, original text statement OR has m notional word. Total n translation statement in translation. Translation statement TRiThere is p notional word. I represents the counting of translation statement, i=1,2 ..., n. Jth notional word has k and translates item. J represents the counting of the notional word in an original text statement, j=1,2 ..., m. L represents the counting translating item of a notional word, l=1,2 ..., k. R represents the counting of the notional word in a translation statement, r=1,2 ..., p. Dis (nwjl,TRinwr) represent in dictionary, translate a nw for the l of the jth notional word of original text statement ORjlWith i-th translation statement TRiThe r notional word TRinwrBetween distance. L represents adjustment parameter. Translate a nw for the l of the jth notional word of original text statement ORjlWith i-th translation statement TRiThe r notional word TRinwrDistance in dictionary of similarity and two words be inverse relation. Concrete, heretofore described dictionary refers to the dictionary by the synonym classification of tree structure coding. Such as, the synonym ontology tool such as " Chinese thesaurus " and " WordNet ". In this dictionary, should have unique code by each node of the synonym classified dictionary of tree structure coding. Each code is corresponding, and several translate item. Regulate the number of plies that parameter L is the synonym classified dictionary by tree structure coding, the i.e. number of plies of tree structure. Dis (nwjl,TRinwr) l of jth notional word that is specially original text statement OR translates a nwjlWith i-th translation statement TRiThe r notional word TRinwrThe distance of the code in dictionary, i.e. difference between two codes.
Step S402:
According to following formula
s i m ( nw j l , TR i ) = max r = 1 , 2 , ... , p ( s i m ( nw j l , TR i nw r ) ) = max r = 1 , 2 , ... , p ( L / ( d i s ( nw j l , TR i nw r ) + L ) )
Translate a nw for l of the jth notional word obtaining original text statement ORjlWith translation statement TRiSimilarity.
Step S403:
According to following formula
s i m ( nw j , TR i ) = max l = 1 , 2 , ... , k ( s i m ( nw j l , TR i ) ) = max l = 1 , 2 , ... , k ( max r = 1 , 2 , ... , p ( L / ( d i s ( TR i nw r ) + L ) )
Obtain the jth notional word nw of original text statement ORjWith translation statement TRiSimilarity.
Step S50: the similarity according to all notional words of each original text statement and translation statement, mates each original text statement and translation statement, it is thus achieved that the similarity of each original text statement and translation statement.
According to following formula
s i m ( O R , TR i ) = Π j = 1 , 2 , ... , m s i m ( nw j , TR i ) = Π j = 1 , 2 , ... , m ( max l = 1 , 2 , ... , k ( max r = 1 , 2 , ... , p ( L / ( d i s ( nw j l , TR i nw r ) + L ) ) )
Obtain original text statement OR and translation statement TRiSimilarity.
Step S60: by the translation statement the highest with original text statement similarity and original text statement matching and align.
Concrete, according to max i = 1 , 2 , ... , n ( s i m ( O R , TR i ) ) = max i = 1 , 2 , ... , n ( Π j = 1 , 2 , ... , m si m ( nw j , TR i ) ) Obtain the translation statement the highest with the similarity of original text statement OR.
The translation statement the highest with the similarity of original text statement OR and original text statement OR are mated and align.
In step S60, it is understood that there may be same translation statement is identical with the similarity of multiple former sentences and is the highest situation, then step S60 specifically also includes following comparison process:
(1) the original text statement in the original text after conversion is numbered in order.
Such as, first in original text being numbered 1, second is numbered 2, the like. Preferably, the original text statement after all numberings is set up the set of an original text statement.
(2) the translation statement in the translation after conversion is numbered in order.
Such as, first in translation being numbered 1, second is numbered 2, the like.Preferably, the translation statement after all numberings is set up the set of a translation statement.
(3) multiple original text statement numbering in original text and the numbering that translation statement is in translation are obtained.
Concrete, this process includes again following two kind processing mode:
1) if the numbering in original text of the original text statement in multiple original text statement and translation statement numbering in translation are closest, then by this original text statement and translation statement matching and align.
Such as, 4 original text statements numbering in original text respectively 1,2,3,4. Translation statement is numbered 5 in translation. The similarity of these 4 original text statements and this translation statement is the highest. The numbering being numbered the numbering of the original text statement of 4 and translation statement is closest, then will be numbered the original text statement of 4 and be numbered the translation statement matching of 5 and align.
2) if the numbering in original text of two the original text statements in multiple original text statement and translation statement numbering in translation are closest, then original text statement less for numbering mated with translation and align. Compare the height numbering bigger original text statement in two original text statements with the similarity of residue translation statement, translation statement the highest with the similarity numbering bigger original text statement in translation statement will be remained and number bigger original text statement matching and align.
Such as, 4 original text statements numbering in original text respectively 1,3,5,7. Translation statement is numbered 4 in translation. The similarity of these 4 original text statements and translation statement is the highest. The numbering that is numbered the original text statement of 3, the original text statement being numbered 5 numbering all and the numbering of this translation statement closest. Owing to numbering 3 is less than numbering 5, then will be numbered the original text statement of 3 and be numbered the translation statement matching of 4 and align. Being numbered in the residue translation statement beyond the translation statement of 4, if a translation statement is the highest with the similarity of the original text statement being numbered 5, then by this translation statement and be numbered 5 original text statement matching.
3) repeat said process 1) and 2), until each original text statement all with each translation statement matching aliging.
The method of the present invention, based on the similarity of notional word, solves the former translation alignment problem translating post processing. The method can be completed by machine, it is not necessary to by artificial treatment, saves the time, improves efficiency.
The embodiment of the present invention additionally provides the device of a kind of parallel corpora alignment. As in figure 2 it is shown, the device of this parallel corpora alignment includes:
First module 101, for being converted to the character of identical coded system by all original text statements in original text and all translation statements in translation.
Second unit 102, for all original text statement participles in the original text after conversion, removing stop words therein, it is thus achieved that notional word.
3rd unit 103, all of each notional word for obtaining original text statement translate item.
4th unit 104, for mating in all translation statements translated in item translation after labelling of each notional word of each original text statement, it is thus achieved that each notional word of each original text statement and the similarity of translation statement.
5th unit 105, for the similarity of all notional words according to each original text statement and translation statement, mates each original text statement and translation statement, it is thus achieved that the similarity of each original text statement and translation statement.
6th unit 106, for by the translation statement the highest with original text statement similarity and original text statement matching and align.
The device of the parallel corpora alignment of the present invention, in order to the method realizing above-mentioned parallel corpora alignment, based on the similarity of notional word, solves the former translation alignment problem translating post processing.This device makes the above-mentioned method need not by artificial realization, it is achieved that automatization, saves the time, improves efficiency.
The invention provides and described in more than one, be only presently preferred embodiments of the present invention; not in order to limit the present invention; all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, all should include within protection scope of the present invention.

Claims (10)

1. the method for a parallel corpora alignment, it is characterised in that including:
All original text statements in original text and all translation statements in translation are converted to the character of identical coded system;
To all described original text statement participle in the described original text after conversion, remove stop words therein, it is thus achieved that notional word;
The all of each notional word obtaining described original text statement translate item;
All described translation statement in all described translations translating item after conversion of each notional word of each described original text statement will be mated, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement;
All notional words according to each described original text statement and the similarity of described translation statement, mate each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement;
By the described translation statement the highest with described original text statement similarity and described original text statement matching and align.
2. the method for parallel corpora alignment as claimed in claim 1, it is characterized in that, all items of translating of described each notional word by each described original text statement mate in all described translation statements, it is thus achieved that the process of each notional word of each described original text statement and the similarity of described translation statement includes:
According to sim (nwjl,TRinwr)=L/ (dis (nwjl,TRinwr)+L) l of jth notional word that obtains original text statement OR translates a nwjlWith i-th translation statement TRiThe r notional word TRinwrSimilarity;
According to s i m ( nw j l , TR i ) = m a x r = 1 , 2 , ... , p ( s i m ( nw j l , TR i nw r ) ) Translate a nw for l of the jth notional word obtaining described original text statement ORjlWith described translation statement TRiSimilarity;
According to s i m ( nw j , TR i ) = m a x l = 1 , 2 , ... , k ( s i m ( nw j l , TR i ) ) Obtain the jth notional word nw of described original text statement ORjWith translation statement TR described in i-thiSimilarity;
Wherein, described original text statement OR has m notional word, total n described translation statement, described translation statement TR in described translationiHaving p notional word, jth notional word has k and translates item, and L represents adjustment parameter, dis (nwjl,TRinwr) represent that the l of the jth notional word of described original text statement OR translates a nwjlWith translation statement TR described in i-thiThe r notional word TRinwrThe distance of the code in dictionary, i=1,2 ..., n, j=1,2 ..., m, l=1,2 ..., k, r=1,2 ..., p.
3. the method for parallel corpora alignment as claimed in claim 2, it is characterized in that, the similarity of described all notional words according to each described original text statement and described translation statement, each described original text statement and described translation statement are mated, it is thus achieved that the process of the similarity of each described original text statement and described translation statement includes:
According toObtain described original text statement OR and described translation statement TRiSimilarity.
4. the method for parallel corpora alignment as claimed in claim 3, it is characterised in that described the described translation statement the highest with described original text statement similarity and described original text statement matching the process alignd are included:
According to max i = 1 , 2 , ... , n ( s i m ( O R , TR i ) ) = max i = 1 , 2 , ... , n ( Π j = 1 , 2 , ... , m s i m ( nw j , TR i ) ) Obtain the described translation statement the highest with the similarity of described original text statement OR;
The described translation statement the highest with the similarity of described original text statement OR and described original text statement OR are mated, and align described original text statement OR and described translation statement.
5. the method for parallel corpora alignment as claimed in claim 1, it is characterised in that also include:
Described original text statement in described original text after conversion is numbered in order;
Described translation statement in described translation after conversion is numbered in order;
If the similarity of same described translation statement and multiple described original text statement is the highest, then obtain multiple described original text statement described numbering in described original text and the described numbering that described translation statement is in described translation;
If the described numbering in described original text of the described original text statement in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by this described original text statement and described translation statement matching and align;
If the described numbering in described original text of two the described original text statements in multiple described original text statements and the described translation statement described numbering in described translation are closest, then by described original text statement less for described numbering and described translation statement matching and align;
Relatively number the height of bigger described original text statement and the similarity remaining described translation statement described in two described original text statements, by described original text statement matching bigger to described translation statement the highest for the similarity remaining described original text statement bigger with described numbering in described translation statement and described numbering and align;
Repeat said process, until each described original text statement all with each described translation statement matching aliging.
6. the method for parallel corpora alignment as claimed in claim 2, it is characterised in that: described dictionary is the synonym classified dictionary by tree structure coding, and described have unique described code by each node of the synonym classified dictionary of tree structure coding.
7. the method for parallel corpora alignment as claimed in claim 6, it is characterised in that: described adjustment parameter L is the number of plies of the described synonym classified dictionary encoded by described tree structure.
8. the method for parallel corpora alignment as claimed in claim 1, it is characterised in that the process of the described character that all original text statements in original text and all translation statements in translation are converted to identical coded system includes:
The coded system of the character according to all described original text statement in described original text reads the character in described original text statement or character string, and the coded system according to the character of all described translation statement in described translation reads the character in described translation statement or character string;
The character in the described original text statement read and described translation statement or character string is converted to target code character or character string respectively according to same target coded system.
9. the method for parallel corpora alignment as claimed in claim 1, it is characterised in that described notional word includes: noun, verb, adjective and adverbial word.
10. the device of a parallel corpora alignment, it is characterised in that including:
First module, for being converted to the character of identical coded system by all original text statements in original text and all translation statements in translation;
Second unit, for all described original text statement participle in the described original text after conversion, removing stop words therein, it is thus achieved that notional word;
Unit the 3rd, all of each notional word for obtaining described original text statement translate item;
Unit the 4th, for mating in all described translation statements translated in the item described translation after labelling of each notional word of each described original text statement, it is thus achieved that each notional word of each described original text statement and the similarity of described translation statement;
Unit the 5th, for the similarity of all notional words according to each described original text statement and described translation statement, mates each described original text statement and described translation statement, it is thus achieved that the similarity of each described original text statement and described translation statement;
Unit the 6th, for by the described translation statement the highest with described original text statement similarity and described original text statement matching and align.
CN201511022223.9A 2015-12-30 2015-12-30 The method and apparatus of parallel corpora alignment Active CN105653516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511022223.9A CN105653516B (en) 2015-12-30 2015-12-30 The method and apparatus of parallel corpora alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511022223.9A CN105653516B (en) 2015-12-30 2015-12-30 The method and apparatus of parallel corpora alignment

Publications (2)

Publication Number Publication Date
CN105653516A true CN105653516A (en) 2016-06-08
CN105653516B CN105653516B (en) 2018-08-10

Family

ID=56490853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511022223.9A Active CN105653516B (en) 2015-12-30 2015-12-30 The method and apparatus of parallel corpora alignment

Country Status (1)

Country Link
CN (1) CN105653516B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697287A (en) * 2018-12-20 2019-04-30 龙马智芯(珠海横琴)科技有限公司 Sentence-level bilingual alignment method and system
CN112446224A (en) * 2020-12-07 2021-03-05 北京彩云环太平洋科技有限公司 Parallel corpus processing method, device and equipment and computer readable storage medium
CN113705158A (en) * 2021-09-26 2021-11-26 上海一者信息科技有限公司 Method for intelligently restoring original text style in document translation

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040260535A1 (en) * 2003-06-05 2004-12-23 International Business Machines Corporation System and method for automatic natural language translation of embedded text regions in images during information transfer
CN101187924A (en) * 2007-11-28 2008-05-28 北京金山软件有限公司 Method and system for obtaining word pair translation from bilingual sentence
CN101308512A (en) * 2008-06-25 2008-11-19 北京金山软件有限公司 Mutual translation pair extraction method and device based on web page
CN101488126A (en) * 2008-12-31 2009-07-22 深圳市点通数据有限公司 Double-language sentence alignment method and device
KR20120077794A (en) * 2010-12-31 2012-07-10 에스케이플래닛 주식회사 Method, translation apparatus and terminal for providing meaning of word of chinese language sentence in automatic translation system, and record medium storing program for executing the same
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN103678684A (en) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 Chinese word segmentation method based on navigation information retrieval
CN104360996A (en) * 2014-11-27 2015-02-18 武汉传神信息技术有限公司 Sentence alignment method of bilingual text

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040260535A1 (en) * 2003-06-05 2004-12-23 International Business Machines Corporation System and method for automatic natural language translation of embedded text regions in images during information transfer
CN101187924A (en) * 2007-11-28 2008-05-28 北京金山软件有限公司 Method and system for obtaining word pair translation from bilingual sentence
CN101308512A (en) * 2008-06-25 2008-11-19 北京金山软件有限公司 Mutual translation pair extraction method and device based on web page
CN101488126A (en) * 2008-12-31 2009-07-22 深圳市点通数据有限公司 Double-language sentence alignment method and device
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
KR20120077794A (en) * 2010-12-31 2012-07-10 에스케이플래닛 주식회사 Method, translation apparatus and terminal for providing meaning of word of chinese language sentence in automatic translation system, and record medium storing program for executing the same
CN103678684A (en) * 2013-12-25 2014-03-26 沈阳美行科技有限公司 Chinese word segmentation method based on navigation information retrieval
CN104360996A (en) * 2014-11-27 2015-02-18 武汉传神信息技术有限公司 Sentence alignment method of bilingual text

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697287A (en) * 2018-12-20 2019-04-30 龙马智芯(珠海横琴)科技有限公司 Sentence-level bilingual alignment method and system
CN109697287B (en) * 2018-12-20 2020-01-21 龙马智芯(珠海横琴)科技有限公司 Sentence-level bilingual alignment method and system
CN112446224A (en) * 2020-12-07 2021-03-05 北京彩云环太平洋科技有限公司 Parallel corpus processing method, device and equipment and computer readable storage medium
CN113705158A (en) * 2021-09-26 2021-11-26 上海一者信息科技有限公司 Method for intelligently restoring original text style in document translation
CN113705158B (en) * 2021-09-26 2024-05-24 上海一者信息科技有限公司 Method for intelligently restoring original text style in document translation

Also Published As

Publication number Publication date
CN105653516B (en) 2018-08-10

Similar Documents

Publication Publication Date Title
CN101079025B (en) File correlation computing system and method
CN105446962A (en) Original text and translated text alignment method and apparatus
CN105005557A (en) Chinese ambiguity word processing method based on dependency parsing
CN103365838A (en) Method for automatically correcting syntax errors in English composition based on multivariate features
Chu et al. Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese.
CN101021842A (en) Automatic learning and extending evolution handling method for Chinese basic block descriptive rule
Lavie et al. Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora
CN105653516A (en) Parallel corpus aligning method and device
CN102929865B (en) PDA (Personal Digital Assistant) translation system for inter-translating Chinese and languages of ASEAN (the Association of Southeast Asian Nations) countries
Aswani et al. A hybrid approach to align sentences and words in English-Hindi parallel corpora
CN104536951A (en) Microblog text normalizing, word segmenting and part-speech tagging method and system
Gehlot et al. Hindi to English transfer based machine translation system
CN106777404A (en) Converting system and conversion method from LaTeX form to XML format
CN1776673A (en) Method for converting PDF file to XML file
Sinhal et al. Machine translation approaches and design aspects
Guo et al. Character-level dependency model for joint word segmentation, POS tagging, and dependency parsing in Chinese
Gornostay et al. Terminology extraction from comparable corpora for latvian
Vignesh et al. Automatic question generator in Tamil
Altenbek et al. Identification of basic phrases for kazakh language using maximum entropy model
CN103119585B (en) Knowledge acquisition device and method
Maimaiti et al. Construction of Uyghur named entity corpus
Karmani et al. Building a standardized Wordnet in the ISO LMF for aeb language
Wang et al. An automatic treebank conversion algorithm for corpus sharing
CN105677621B (en) The localization method and device of translation error
Uddin et al. Bangla to english text conversion using opennlp tools

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 Hubei, East Lake, Wuhan New Technology Development Zone, software park, No., E City, building E2, building five, building

Applicant before: Wuhan Transn Information Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant