CN103092829B - A kind of repetition resource acquiring method and system - Google Patents
A kind of repetition resource acquiring method and system Download PDFInfo
- Publication number
- CN103092829B CN103092829B CN201110332674.8A CN201110332674A CN103092829B CN 103092829 B CN103092829 B CN 103092829B CN 201110332674 A CN201110332674 A CN 201110332674A CN 103092829 B CN103092829 B CN 103092829B
- Authority
- CN
- China
- Prior art keywords
- sentence
- rule
- sys
- word alignment
- translation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
This application discloses a kind of repetition resource acquiring method and system.A kind of resource acquiring method of repeating comprises: obtain the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A; Utilize sys_AB to translate a0, obtain b1; Utilize sys_BA to translate b1 further, obtain a2; Utilize sys_BA to translate b0, obtain a1; Take a0 as standard, translation quality evaluation is carried out to the corresponding sentence in a2 and a1, a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1; In b1, obtain the sentence set b1 ' corresponding with a2 ', in b0, obtain the sentence set b0 ' corresponding with a1 ', the repetition sentence obtaining second language is to gathering (b0 ', b1 ').Application such scheme, be conducive to obtaining the higher repetition resource of accuracy rate, and the translated resources obtained also can be more applicable for translation system.
Description
Technical field
The application relates to Computer Applied Technology field, particularly relates to a kind of repetition resource acquiring method and system.
Background technology
Mechanical translation (MachineTranslation), is also called automatic translation, is to utilize computing machine the process of a kind of natural source language shift for another kind of natural target language, is generally applied to the whole sentence between two kinds of natural languages or translation in full.Statistical machine translation (StatisticalMachineTranslation, SMT) is the one of mechanical translation, is also performance a kind of preferably method in the mechanical translation of current non-limiting field.The basic thought of statistical machine translation is: carry out statistical study by the parallel corpora (bilingualcorpus also claims bilingual intertranslation language material) to some, then build statistical translation model by training, and then use this model to translate.At present, mechanical translation is transitioned into the translation based on phrase gradually from the early stage translation based on word, and merges semantic information, to improve the intelligent of translation and accuracy further.
To in research on the machine translation process, discuss at present comparatively widely a kind of technology be repeat (paraphrases) technology.Repeating the different expression-forms of general reference to identical semanteme, is a kind of universal phenomenon in human language.Research shows, repeats the performance can improving translation system in many aspects.Such as, based on repetition technology, can solve the uncommon phrase repetition run in repetition process is the common phrases of synonym, thus improves translation system coverage rate; The clause that repetition technology can also treat cypher text is rewritten, and generates the sentence being more suitable for translation system process, thus reduces the intractability of translation system.
Utilize repetition to carry out mechanical translation, need enough repetition resources as support.Repetition resource mentioned here, comprises the repetition sentence that granularity is larger, also comprises the less repetition phrase of granularity or repeats rule.Wherein, repeat the corpus that sentence directly can be used as the paraphrases generation of Corpus--based Method, also may be used for extracting further and repeat phrase and repeat rule.In prior art, be used for obtaining and repeat the main method of resource and be: from potential exist the particular data repeating resource to extract repeat resource, such as: for the different news reports etc. of same subject event.The defect of this method is that available resource quantity is less on the one hand; On the other hand, when extracting repetition resource, needing to utilize the technology such as text cluster, Similarity Measure to find corresponding textual resources, then forming possible repetition resource.Not only realize complexity, and the final repetition resource obtained affects by factors such as cluster errors, often containing much noise, accuracy rate is lower, is difficult to the actual needs meeting machine translation system.
Summary of the invention
For solving the problems of the technologies described above, the embodiment of the present application provides a kind of and repeats resource acquiring method and system, and to obtain the repetition resource of better quality, technical scheme is as follows:
A kind of repetition resource acquiring method, comprising:
Obtain the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
Utilize sys_AB to translate a0, obtain b1; Utilize sys_BA to translate b1 further, obtain a2;
Utilize sys_BA to translate b0, obtain a1;
Take a0 as standard, translation quality evaluation is carried out to the corresponding sentence in a2 and a1, a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
In b1, obtain the sentence set b1 ' corresponding with a2 ', in b0, obtain the sentence set b0 ' corresponding with a1 ', the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
In a kind of embodiment of the application, said method comprises further:
Rule is repeated to extracting in set (b0 ', b1 ') from repetition sentence.
In a kind of embodiment of the application, describedly repeating rule from repetition sentence to extracting in set (b0 ', b1 '), comprising:
Set up the word alignment relation of sentence in b0 ' to b1 ';
Set up word alignment relation is filtered;
Extract from filter result and repeat rule.
In a kind of embodiment of the application, the described word alignment relation setting up sentence in b0 ' to b1 ', comprising:
According to the word alignment relation of parallel corpora (a0, b0), and the word alignment relation of a0 and b1 that sys_AB sets up in translation process, set up the word alignment relation of sentence in b0 ' to b1 '.
In a kind of embodiment of the application, described set up word alignment relation to be filtered, comprising:
According to preset word alignment rule, filter set up word alignment relation, wherein, described word alignment rule comprises:
If two words repeating sentence centering have determine alignment relation, then only retain this and determine alignment relation, delete other intersection alignment relation of these two words;
And/or
Stop words and punctuate only have alignment relation with stop words or punctuate.
In a kind of embodiment of the application, described extraction from filter result repeats rule, comprising:
According to the repetition rule constrain condition preset, extract from filter result and repeat rule, wherein, described repetition rule constrain condition comprises:
Every bar is repeated rule and is comprised left end and right-hand member, the textual form respectively before corresponding repetition and the textual form after repeating;
Repeat regular left end and right-hand member is all made up of non-variables and variable, or left end and right-hand member all only comprise non-variables;
A non-variables is had at least between the variable repeating regular left end.
In a kind of embodiment of the application, described translation system sys_AB and sys_BA utilizes parallel corpora (a0, b0) to train to obtain.
A kind of repetition resource acquisition system, comprising:
Initial setting up unit, for obtaining the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
First translation unit, for utilizing sys_AB to translate a0, obtains b1; Utilize sys_BA to translate b1 further, obtain a2;
Second translation unit, for utilizing sys_BA to translate b0, obtains a1;
Translation quality evaluation unit, for taking a0 as standard, carries out translation quality evaluation to the corresponding sentence in a2 and a1, and a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
Repeat sentence to acquiring unit, for obtaining the sentence set b1 ' corresponding with a2 ' in b1, obtain the sentence set b0 ' corresponding with a1 ' in b0, the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
In a kind of embodiment of the application, said system comprises further:
Repeat Rule Extraction unit, for repeating rule from repetition sentence to extracting in set (b0 ', b1 ').
In a kind of embodiment of the application, described repetition Rule Extraction unit, comprising:
Word alignment subelement, for setting up the word alignment relation of sentence in b0 ' to b1 ';
Word alignment filters subelement, filters for the word alignment relation set up described word alignment subelement;
Repeating Rule Extraction subelement, repeating rule for extracting in the filter result from described word alignment filtration subelement.
In a kind of embodiment of the application, described word alignment subelement, concrete configuration is:
For the word alignment relation according to parallel corpora (a0, b0), and the word alignment relation of a0 and b1 that sys_AB sets up in translation process, set up the word alignment relation of sentence in b0 ' to b1 '.
In a kind of embodiment of the application, described word alignment filters subelement, and concrete configuration is:
For according to preset word alignment rule, set up word alignment relation is filtered, wherein, described according to preset word alignment rule, set up word alignment relation is filtered, comprising:
If two words repeating sentence centering have determine alignment relation, then only retain this and determine alignment relation, delete other intersection alignment relation of these two words;
And/or
Stop words and punctuate only have alignment relation with stop words or punctuate.
In a kind of embodiment of the application, described repetition Rule Extraction subelement, concrete configuration is:
For the repetition rule constrain condition that basis is preset, extract from filter result and repeat rule, wherein, described repetition rule constrain condition comprises:
Every bar is repeated rule and is comprised left end and right-hand member, the textual form respectively before corresponding repetition and the textual form after repeating;
Repeat regular left end and right-hand member is all made up of non-variables and variable, or left end and right-hand member all only comprise non-variables;
The beginning and end repeating regular left end is non-variables;
A non-variables is had at least between the variable repeating regular left end.
In a kind of embodiment of the application, described initial setting up unit, concrete configuration is:
Train for utilizing parallel corpora (a0, b0) and obtain translation system sys_AB and sys_BA.
The technical scheme that application the embodiment of the present application provides, can utilize existing parallel corpora to obtain repetition resource, thus considerably increase the quantity of available resources.In addition, for the data that " potential existence " repeats resource, the quality of parallel corpora own is higher, is conducive to obtaining the higher repetition resource of accuracy rate.And the application's scheme obtains according to the translation result of translation system self to repeat resource, the translated resources finally obtained also can be more applicable for translation system.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the application, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram that the embodiment of the present application repeats resource acquiring method;
Fig. 2 is the translation process schematic diagram of the embodiment of the present application step S102;
Fig. 3 is the translation process schematic diagram of the embodiment of the present application step S103;
Fig. 4 is the another kind of process flow diagram that the embodiment of the present application repeats resource acquiring method;
Fig. 5 is that the embodiment of the present application is from repeating sentence to extracting the method flow diagram repeating rule;
Fig. 6 is the schematic diagram that the embodiment of the present application word alignment relation is filtered;
Fig. 7 is the structural representation that the embodiment of the present application repeats resource acquisition system;
Fig. 8 is the another kind of structural representation that the embodiment of the present application repeats resource acquisition system;
Fig. 9 is the structural representation that the embodiment of the present application repeats Rule Extraction unit.
Embodiment
First the one provided the embodiment of the present application is repeated resource acquiring method and is described, and the method can comprise the following steps:
Obtain the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
Utilize sys_AB to translate a0, obtain b1; Utilize sys_BA to translate b1 further, obtain a2;
Utilize sys_BA to translate b0, obtain a1;
Take a0 as standard, translation quality evaluation is carried out to the corresponding sentence in a2 and a1, a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
In b1, obtain the sentence set b1 ' corresponding with a2 ', in b0, obtain the sentence set b0 ' corresponding with a1 ', the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
The technical scheme that application the embodiment of the present application provides, can utilize existing parallel corpora to obtain repetition resource, thus considerably increase the quantity of available resources.In addition, for the data that " potential existence " repeats resource, the quality of parallel corpora own is higher, is conducive to obtaining the higher repetition resource of accuracy rate.And the application's scheme obtains according to the translation result of translation system self to repeat resource, the translated resources finally obtained also can be more applicable for translation system.
Technical scheme in the application is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is described in detail, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, the every other embodiment that those of ordinary skill in the art obtain, all should belong to the scope of the application's protection.
Figure 1 shows that a kind of process flow diagram repeating resource acquiring method of application, the method can comprise the following steps:
S101, obtains the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
The application's scheme is the repetition resource obtaining second language B is target, first from existing corpus, obtain the parallel corpora between first language A and second language B, wherein, corpus can be bilingualism corpora, can be the multi-lingual corpus (namely comprising the corpus of more than three kinds linguistic intertranslation language materials) comprising first language and second language, the embodiment of the present invention need this to limit yet.
In addition, for realizing the application's scheme, also need the two cover translation systems of A to B and B to A.This two covers translation system directly can adopt any existing machine translation system.Also the parallel corpora between A and the B obtained can be utilized above to train obtain.
In the present embodiment, will with first language A for English, second language B is described for example for Chinese: first obtain English-in bilingual parallel corpora (a0, b0), wherein, a0 and b0 represents the set of parallel corpora Chinese and English sentence and Chinese sentence respectively, and the sentence in two set exists intertranslation relation one to one.
Also need a set of Chinese-English translation system sys_BA and a set of English-Chinese translation system sys_AB further, the translation system of both direction can be trained here based on parallel corpora (a0, b0) respectively, also can directly adopt existing translation system.
S102, utilizes sys_AB to translate a0, obtains b1; Utilize sys_BA to translate b1 further, obtain a2;
In this step, need altogether to carry out twice translation:
First utilize sys_AB to carry out the translation in English-Chinese direction to the sentence in a0, the translation result obtained is the set b1 of Chinese sentence; Wherein, the sentence quantity in b1 and a0 is identical.
Then utilize sys_BA to carry out the translation in English-Chinese direction to the sentence in b1, the translation result obtained is the set a2 of english sentence.Wherein, the sentence quantity in a2 and b1 is identical.
The schematic diagram of said process can be shown in Figure 2.
S103, utilizes sys_BA to translate b0, obtains a1;
In this step, need once to translate:
Utilize sys_BA to carry out the translation in English-Chinese direction to the sentence in b0, the translation result obtained is the set a1 of english sentence; Wherein, the sentence quantity in b0 and a1 is identical.
The schematic diagram of said process can be shown in Figure 3.
S104, take a0 as standard, carries out translation quality evaluation to the corresponding sentence in a2 and a1, and a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
By step S102 and S103, obtain set a2 and a1 of two groups of english sentences respectively, and the sentence quantity in a0, b0, a1, b1, a2 is all identical.According to step above, all there is a corresponding sentence in each sentence in a0 in a2 and a1.The intertranslation sentence y that suppose to there is sentence x in a0, correspondingly there is x in b0, the result that so x obtains after English-Chinese, Sino-British twice translation is the corresponding sentence of x in a2, and the result that y obtains after a Chinese-English translation is the corresponding sentence of x in a1.
With the original sentence in a0 for standard, evaluate the translation quality of sentence corresponding in a2 and a1 respectively, can adopt the automatic Evaluation standards such as conventional BLEU, NIST, WER, PER here, evaluation score here represents the effect of translation.For methods such as BLEU, NIST, the score evaluated is higher, illustrate translation and original text similarity higher, and then can think that translation effect is better, and the methods such as WER, PER are based on error rate (ER, ErrorRate), error rate lower explanation translation effect is better, therefore, if use the methods such as WER, when so error rate is lower, higher evaluation score should be able to be obtained mutually.
In this step, for any sentence in a0, if its corresponding sentence in a2 obtains higher evaluation score than the corresponding sentence in a1, then sentence corresponding in a2 and a1 is remained, after all sentences of traversal, a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1.Wherein a2 ' is identical with the sentence quantity comprised in a1 ', and in the ordinary course of things, this quantity should be less than the sentence quantity in a0.
S105, obtains the sentence set b1 ' corresponding with a2 ', in b0, obtains the sentence set b0 ' corresponding with a1 ' in b1, and the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
Due to the subset that a2 ' is a2, and a2 is translated by b1 to obtain one by one, therefore must also there is subset b1 ' and a2 ' one_to_one corresponding in b1; In like manner, subset b1 ' and a2 ' one_to_one corresponding must also be there is in b.Obtain after b1 ' and b0 ', form the right set of sentence (b0 ', b1 ') further, this set is needs the repetition sentence obtained to set.
In such scheme, (b0, b1) it is right that sentence itself corresponding in can form repetition sentence, but in order to obtain high-quality, the repetition sentence that especially can improve machine translation system translation quality is right, again to (b0 in scheme, b1) further Screening Treatment has been done: relative to original English language material a0, English language material a2 and a1 is obtained respectively by different translation path, because translation process b1 → a2 and translation process b0 → a1 uses same Chinese-English translation system sys_BA, so can think that the difference of translation result is caused by the difference inputted completely, the sentence score in a2 ' is so caused more than the reason of a1 ' to be: the sentence set b0 ' of input is more suitable for the translation of sys_BA than b1 '.Therefore utilize b0 ' and b1 ' to form and repeat the right set of sentence, in B → A Chinese-English translation process afterwards, utilize the repetition relation of b0 ' → b1 ', just translation system can be bad to translate Chinese form and be rewritten as the form of being good at, thus improve translation system coverage rate, reduce the intractability of translation system.
For the repetition sentence obtained to resource, therefrom can also extract further and repeat rule, shown in Figure 4, in another embodiment of the application, the repetition Rules extraction method provided comprises the following steps:
S101, obtains the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
S102, utilizes sys_AB to translate a0, obtains b1; Utilize sys_BA to translate b1 further, obtain a2;
S103, utilizes sys_BA to translate b0, obtains a1;
S104, take a0 as standard, carries out translation quality evaluation to the corresponding sentence in a2 and a1, and a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
S105, obtains the sentence set b1 ' corresponding with a2 ', in b0, obtains the sentence set b0 ' corresponding with a1 ' in b1, and the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
S106, repeats rule from repetition sentence to extracting in set (b0 ', b1 ').
Compared with a upper embodiment, in the present embodiment, further increase and repeat sentence to extracting the step S106 repeating rule in set, shown in Figure 5, this step specifically again can sub-step as follows:
S106a, sets up the word alignment relation of sentence in b0 ' to b1 ';
In this step, directly can carry out automatic aligning process to the word in b0 ' and b1 ', also the word alignment relation of a0 and the b0 had and the word alignment relation of a0 and b1 can be utilized, obtain the word alignment relation of b0 and b1 further, due to the subset that b0 ' and b1 ' is b0 and b1 respectively, the word alignment relation of b0 ' and b1 ' thus also just can be obtained further.
S106b, filters set up word alignment relation;
In the alignment relation that S106a sets up automatically, some is also not suitable for carrying out follow-up repetition Rule Extraction, therefore, can according to certain word alignment rule, do further filtration to set up alignment relation, wherein, available rule can comprise one or more combination following:
1) if two words repeating sentence centering have determine alignment relation, then only retain this and determine alignment relation, delete other intersection alignment relation of these two words;
As shown on the left of Fig. 6, repeat sentence centering, " I likes " and " I likes " has intersection alignment relation, and this is the result of automatic aligning process.Owing to repeating sentence to being all same language, " I " therefore in two sentences has and determines alignment relation, so just can delete other alignment relation of two " I ", and filter result is as shown on the right side of Fig. 6.
2) stop words and punctuate only have alignment relation with stop words or punctuate.
If in automatic aligning result, there is the alignment relation of stop words and punctuate and non-stop words or punctuate, then delete these relations, only retain the alignment relation between stop words and punctuate and stop words or punctuate.
Certainly, except above-mentioned two kinds of rules, those skilled in the art can also utilize other rule to filter word alignment relation, and the application does not need this to limit.
S106c, extracts and repeats rule from filter result.
In this step, further according to the repetition rule constrain condition preset, can extract from filter result and repeat rule, wherein, repeat rule constrain condition comprise following some:
1) every bar repetition rule comprises left end and right-hand member, the textual form respectively before corresponding repetition and the textual form after repetition;
2) repeat regular left end and right-hand member is all made up of non-variables and variable, only can certainly comprise non-variables; Wherein, non-variables can comprise common word and punctuate.
3) have a non-variables at least between the variable repeating regular left end, that is, in the sentence before repetition, variable can not be adjacent, otherwise will delete this rule.
Utilize said method, after obtaining various repetition rule, just can solve the various problems that translation system runs in translation process, simple declaration of below illustrating:
1) text before repeating: for all me, usually all go on business.
Repeat rule: for all me → at me
Text after repeating: at me, usually all go on business.
Text after translation: Inmycase, itisusuallyonbusiness.
This example is the repetition of phrase level, " for all me " is repeated into " at me ", the former is a uncommon phrase for translation system, and the phrase after repeating is the phrase that translation is good in a systematic comparison, this repetition makes the coverage rate of system improve, and then improves the quality of translation result.
2) text before repeating: have what sports equipment?
Repeat rule: have what X? does → what X have?
Text after repeating: what sports equipment has?
Text: Whatkindofsportfacilitiesdoyouhave after translation?
This example utilizes repetition to carry out structure to adjust sequence, adjusts vocabulary in program process not change.Wherein X represents the variable repeated in rule.Top the list by special question word " what " is adjusted to, a tail adjusted in " having " this verb, makes sentence more meet English word order, finally can improve the quality of translation result.
3) text before repeating: I am his telephone number and live to you.
Repeat rule: give you X.→ to you X.
Text after repeating: I gives you his telephone number and address.
Text: I ' llgiveyouhisphonenumberandaddress. after translation
This example utilizes repetition to carry out clause change.Original clause " ... to you ", in any case adjust sequence, " " word can not find the correct position meeting English word order all the time.And by repeating rule, " giving you X " is transformed into " to you X ", and special clause becomes general clause, thus reduces the intractability of translation system.
Above embodiment, be utilize Chinese and English parallel corpora to obtain Chinese to repeat resource, be understandable that, the scheme that application the application provides, Chinese and English parallel corpora can be utilized equally to obtain Chinese and to repeat resource, the parallel corpora of other language also can be utilized to obtain the repetition resource of other language.
Corresponding to embodiment of the method above, the application also provides a kind of and repeats resource acquisition system, shown in Figure 7, comprising:
Initial setting up unit 210, for obtaining the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
First from existing corpus, obtain the parallel corpora between first language A and second language B, wherein, corpus can be bilingualism corpora, can be the multi-lingual corpus (namely comprising the corpus of more than three kinds linguistic intertranslation language materials) comprising first language and second language, the embodiment of the present invention need this to limit yet.
In addition, for realizing the application's scheme, also need the two cover translation systems of A to B and B to A.This two covers translation system directly can adopt any existing machine translation system.Also the parallel corpora between A and the B obtained can be utilized above to train obtain.
In the present embodiment, will with first language A for English, second language B is described for example for Chinese: first obtain English-in bilingual parallel corpora (a0, b0), wherein, a0 and b0 represents the set of parallel corpora Chinese and English sentence and Chinese sentence respectively, and the sentence in two set exists intertranslation relation one to one.
Also need a set of Chinese-English translation system sys_BA and a set of English-Chinese translation system sys_AB further, the translation system of both direction can be trained here based on parallel corpora (a0, b0) respectively, also can directly adopt existing translation system.
First translation unit 220, for utilizing sys_AB to translate a0, obtains b1; Utilize sys_BA to translate b1 further, obtain a2;
First translation unit 220 needs to carry out twice translation altogether:
First utilize sys_AB to carry out the translation in English-Chinese direction to the sentence in a0, the translation result obtained is the set b1 of Chinese sentence; Wherein, the sentence quantity in b1 and a0 is identical.
Then utilize sys_BA to carry out the translation in English-Chinese direction to the sentence in b1, the translation result obtained is the set a2 of english sentence.Wherein, the sentence quantity in a2 and b1 is identical.
The schematic diagram of said process can be shown in Figure 2.
Second translation unit 230, for utilizing sys_BA to translate b0, obtains a1;
Second translation unit 230 needs once to translate:
Utilize sys_BA to carry out the translation in English-Chinese direction to the sentence in b0, the translation result obtained is the set a1 of english sentence; Wherein, the sentence quantity in b0 and a1 is identical.
The schematic diagram of said process can be shown in Figure 3.
Translation quality evaluation unit 240, for taking a0 as standard, carries out translation quality evaluation to the corresponding sentence in a2 and a1, and a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
Translation quality evaluation unit 240 with the original sentence in a0 for standard, respectively the translation quality of sentence corresponding in a2 and a1 is evaluated, here can adopt the automatic Evaluation standards such as conventional BLEU, NIST, WER, PER, evaluation score here represents the effect of translation.For methods such as BLEU, NIST, the score evaluated is higher, illustrate translation and original text similarity higher, and then can think that translation effect is better, and the methods such as WER, PER are based on error rate (ER, ErrorRate), error rate lower explanation translation effect is better, therefore, if use the methods such as WER, when so error rate is lower, higher evaluation score should be able to be obtained mutually.
For any sentence in a0, if its corresponding sentence in a2 obtains higher evaluation score than the corresponding sentence in a1, then translation quality evaluation unit 240 remains sentence corresponding in a2 and a1, after all sentences of traversal, a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1.Wherein a2 ' is identical with the sentence quantity comprised in a1 ', and in the ordinary course of things, this quantity should be less than the sentence quantity in a0.
Repeat sentence to acquiring unit 250, for obtaining the sentence set b1 ' corresponding with a2 ' in b1, obtain the sentence set b0 ' corresponding with a1 ' in b0, the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
Due to the subset that a2 ' is a2, and a2 is translated by b1 to obtain one by one, therefore must also there is subset b1 ' and a2 ' one_to_one corresponding in b1; In like manner, subset b1 ' and a2 ' one_to_one corresponding must also be there is in b.Obtain after b1 ' and b0 ', form the right set of sentence (b0 ', b1 ') further, this set is needs the repetition sentence obtained to set.
Shown in Figure 8, the repetition resource acquisition system that the application provides, can further include:
Repeat Rule Extraction unit 260, for repeating rule from repetition sentence to extracting in set (b0 ', b1 ').
Shown in Figure 9, described repetition Rule Extraction unit 260, specifically can comprise:
Word alignment subelement 261, for setting up the word alignment relation of sentence in b0 ' to b1 ';
In a kind of embodiment of the application, word alignment subelement 261, can concrete configuration be:
For the word alignment relation according to parallel corpora (a0, b0), and the word alignment relation of a0 and b1 that sys_AB sets up in translation process, set up the word alignment relation of sentence in b0 ' to b1 '.
Word alignment filters subelement 262, filters for the word alignment relation set up described word alignment subelement 261;
In a kind of embodiment of the application, word alignment filters subelement 262, can concrete configuration be:
For according to preset word alignment rule, set up word alignment relation is filtered, wherein, described according to preset word alignment rule, set up word alignment relation is filtered, comprising:
If two words repeating sentence centering have determine alignment relation, then only retain this and determine alignment relation, delete other intersection alignment relation of these two words;
Stop words and punctuate only have alignment relation with stop words or punctuate.
Repeating Rule Extraction subelement 263, repeating rule for extracting in the filter result from described word alignment filtration subelement 262.
In a kind of embodiment of the application, repeat Rule Extraction subelement 263, concrete configuration is:
For the repetition rule constrain condition that basis is preset, extract from filter result and repeat rule, wherein, described repetition rule constrain condition comprises:
Every bar is repeated rule and is comprised left end and right-hand member, the textual form respectively before corresponding repetition and the textual form after repeating;
Repeat regular left end and right-hand member is all made up of non-variables and variable, or left end and right-hand member all only comprise non-variables;
A non-variables is had at least between the variable repeating regular left end.
For convenience of description, various unit is divided into describe respectively with function when describing above device.Certainly, the function of each unit can be realized in same or multiple software and/or hardware when implementing the application.
As seen through the above description of the embodiments, those skilled in the art can be well understood to the mode that the application can add required general hardware platform by software and realizes.Based on such understanding, the technical scheme of the application can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the method described in some part of each embodiment of the application or embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device or system embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.Apparatus and system embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
The above is only the embodiment of the application; it should be pointed out that for those skilled in the art, under the prerequisite not departing from the application's principle; can also make some improvements and modifications, these improvements and modifications also should be considered as the protection domain of the application.
Claims (14)
1. repeat a resource acquiring method, it is characterized in that, comprising:
Obtain the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
Utilize sys_AB to translate a0, obtain b1; Utilize sys_BA to translate b1 further, obtain a2;
Utilize sys_BA to translate b0, obtain a1;
Take a0 as standard, translation quality evaluation is carried out to the corresponding sentence in a2 and a1, a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
In b1, obtain the sentence set b1 ' corresponding with a2 ', in b0, obtain the sentence set b0 ' corresponding with a1 ', the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
2. method according to claim 1, is characterized in that, described method comprises further:
Rule is repeated to extracting in set (b0 ', b1 ') from repetition sentence.
3. method according to claim 2, is characterized in that, describedly repeats rule from repetition sentence to extracting in set (b0 ', b1 '), comprising:
Set up the word alignment relation of sentence in b0 ' to b1 ';
Set up word alignment relation is filtered;
Extract from filter result and repeat rule.
4. method according to claim 3, is characterized in that, the described word alignment relation setting up sentence in b0 ' to b1 ', comprising:
According to the word alignment relation of parallel corpora (a0, b0), and the word alignment relation of a0 and b1 that sys_AB sets up in translation process, set up the word alignment relation of sentence in b0 ' to b1 '.
5. method according to claim 3, is characterized in that, describedly filters set up word alignment relation, comprising:
According to preset word alignment rule, filter set up word alignment relation, wherein, described word alignment rule comprises:
If two words repeating sentence centering have determine alignment relation, then only retain this and determine alignment relation, delete other intersection alignment relation of these two words;
And/or
Stop words and punctuate only have alignment relation with stop words or punctuate.
6. method according to claim 3, is characterized in that, described extraction from filter result repeats rule, comprising:
According to the repetition rule constrain condition preset, extract from filter result and repeat rule, wherein, described repetition rule constrain condition comprises:
Every bar is repeated rule and is comprised left end and right-hand member, the textual form respectively before corresponding repetition and the textual form after repeating;
Repeat regular left end and right-hand member is all made up of non-variables and variable, or left end and right-hand member all only comprise non-variables;
A non-variables is had at least between the variable repeating regular left end.
7. according to the method described in any one of claim 1 to 6, it is characterized in that, described translation system sys_AB and sys_BA, is utilize parallel corpora (a0, b0) to train to obtain.
8. repeat a resource acquisition system, it is characterized in that, comprising:
Initial setting up unit, for obtaining the parallel corpora (a0, b0) between first language A and second language B in advance, and the translation system sys_AB of A to B and the translation system sys_BA of B to A;
First translation unit, for utilizing sys_AB to translate a0, obtains b1; Utilize sys_BA to translate b1 further, obtain a2;
Second translation unit, for utilizing sys_BA to translate b0, obtains a1;
Translation quality evaluation unit, for taking a0 as standard, carries out translation quality evaluation to the corresponding sentence in a2 and a1, and a2 score is formed sentence to set (a2 ', a1 ') more than the corresponding sentence of a1;
Repeat sentence to acquiring unit, for obtaining the sentence set b1 ' corresponding with a2 ' in b1, obtain the sentence set b0 ' corresponding with a1 ' in b0, the repetition sentence obtaining second language is to gathering (b0 ', b1 ').
9. system according to claim 8, is characterized in that, described system comprises further:
Repeat Rule Extraction unit, for repeating rule from repetition sentence to extracting in set (b0 ', b1 ').
10. system according to claim 9, is characterized in that, described repetition Rule Extraction unit, comprising:
Word alignment subelement, for setting up the word alignment relation of sentence in b0 ' to b1 ';
Word alignment filters subelement, filters for the word alignment relation set up described word alignment subelement;
Repeating Rule Extraction subelement, repeating rule for extracting in the filter result from described word alignment filtration subelement.
11. systems according to claim 10, is characterized in that, described word alignment subelement, and concrete configuration is:
For the word alignment relation according to parallel corpora (a0, b0), and the word alignment relation of a0 and b1 that sys_AB sets up in translation process, set up the word alignment relation of sentence in b0 ' to b1 '.
12. systems according to claim 10, is characterized in that, described word alignment filters subelement, and concrete configuration is:
For according to preset word alignment rule, set up word alignment relation is filtered, wherein, described according to preset word alignment rule, set up word alignment relation is filtered, comprising:
If two words repeating sentence centering have determine alignment relation, then only retain this and determine alignment relation, delete other intersection alignment relation of these two words;
And/or
Stop words and punctuate only have alignment relation with stop words or punctuate.
13. systems according to claim 10, is characterized in that, described repetition Rule Extraction subelement, and concrete configuration is:
For the repetition rule constrain condition that basis is preset, extract from filter result and repeat rule, wherein, described repetition rule constrain condition comprises:
Every bar is repeated rule and is comprised left end and right-hand member, the textual form respectively before corresponding repetition and the textual form after repeating;
Repeat regular left end and right-hand member is all made up of non-variables and variable, or left end and right-hand member all only comprise non-variables;
A non-variables is had at least between the variable repeating regular left end.
14., according to the system described in any one of claim 8 to 13, is characterized in that, described initial setting up unit, and concrete configuration is:
Train for utilizing parallel corpora (a0, b0) and obtain translation system sys_AB and sys_BA.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110332674.8A CN103092829B (en) | 2011-10-27 | 2011-10-27 | A kind of repetition resource acquiring method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110332674.8A CN103092829B (en) | 2011-10-27 | 2011-10-27 | A kind of repetition resource acquiring method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103092829A CN103092829A (en) | 2013-05-08 |
CN103092829B true CN103092829B (en) | 2015-11-25 |
Family
ID=48205417
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110332674.8A Active CN103092829B (en) | 2011-10-27 | 2011-10-27 | A kind of repetition resource acquiring method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103092829B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020102131A (en) * | 2018-12-25 | 2020-07-02 | 株式会社日立製作所 | Text generation method, text generation device and trained model |
WO2021092730A1 (en) * | 2019-11-11 | 2021-05-20 | 深圳市欢太科技有限公司 | Digest generation method and apparatus, electronic device, and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004252495A (en) * | 2002-09-19 | 2004-09-09 | Advanced Telecommunication Research Institute International | Method and device for generating training data for training statistical machine translation device, paraphrase device, method for training the same, and data processing system and computer program for the method |
JP2006190072A (en) * | 2005-01-06 | 2006-07-20 | Advanced Telecommunication Research Institute International | Automatic paraphrasing apparatus, automatic paraphrasing method and paraphrasing process program |
CN100371927C (en) * | 2003-11-12 | 2008-02-27 | 微软公司 | System for identifying paraphrases using machine translation techniques |
-
2011
- 2011-10-27 CN CN201110332674.8A patent/CN103092829B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004252495A (en) * | 2002-09-19 | 2004-09-09 | Advanced Telecommunication Research Institute International | Method and device for generating training data for training statistical machine translation device, paraphrase device, method for training the same, and data processing system and computer program for the method |
CN100371927C (en) * | 2003-11-12 | 2008-02-27 | 微软公司 | System for identifying paraphrases using machine translation techniques |
JP2006190072A (en) * | 2005-01-06 | 2006-07-20 | Advanced Telecommunication Research Institute International | Automatic paraphrasing apparatus, automatic paraphrasing method and paraphrasing process program |
Non-Patent Citations (3)
Title |
---|
A Hierarchical Phrase-Based Model for Statistical Machine Translation;David Chiang;《Proceedings of the 43rd Annual Meeting of the ACL》;20050630;266页左栏Definition 1 * |
A Novel Statistical Pre-Processing Model for Rule-Based Machine Translation System;Yanli Sun等;《In Proceedings of EAMT》;20100531;正文第2页右栏(i)-(iii)、第3页左栏第1段1-4行、最后1段、右栏第1段、第4页右栏(i)-(iii) * |
基于统计的复述获取与生成技术研究;赵世奇;《中国优秀博士学位论文全文数据库-信息科技辑》;20110515(第5期);正文第5页1.2.1节第1段、第26页最后1段、第40页第1段 * |
Also Published As
Publication number | Publication date |
---|---|
CN103092829A (en) | 2013-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9934203B2 (en) | Performance detection and enhancement of machine translation | |
EP3073391A1 (en) | Performance detection and enhancement of machine translation | |
Othman et al. | English-asl gloss parallel corpus 2012: Aslg-pc12 | |
CN106055623A (en) | Cross-language recommendation method and system | |
Ranathunga et al. | Si-ta: Machine translation of sinhala and tamil official documents | |
Graën | Exploiting alignment in multiparallel corpora for applications in linguistics and language learning | |
Farhath et al. | Integration of bilingual lists for domain-specific statistical machine translation for sinhala-tamil | |
CN104516870A (en) | Translation check method and system | |
CN103092830B (en) | A kind of tune sequence regulation obtaining method and device | |
Rahman et al. | An annotated bangla sentiment analysis corpus | |
CN110245361B (en) | Phrase pair extraction method and device, electronic equipment and readable storage medium | |
Dayter | Collocations in non-interpreted and simultaneously interpreted English: a corpus study | |
Jiang et al. | Quantitative analysis of dependency structures | |
Mansouri et al. | State-of-the-art english to persian statistical machine translation system | |
Singh et al. | Statistical tagger for Bhojpuri (employing support vector machine) | |
CN103092829B (en) | A kind of repetition resource acquiring method and system | |
Collados | Splitting complex sentences for natural language processing applications: Building a simplified Spanish corpus | |
Wołk et al. | Real-time statistical speech translation | |
Rabbani et al. | A new verb based approach for English to Bangla machine translation | |
Schottmüller et al. | Issues in translating verb-particle constructions from german to english | |
Ogrodniczuk et al. | Rule-based coreference resolution module for Polish | |
Vandeghinste et al. | Parse and corpus-based machine translation | |
Winkler et al. | Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants | |
Mohammed et al. | English to Arabic machine translation based on reordring algorithm | |
Lin et al. | Error analysis of Chinese-English machine translation on the clause-complex level |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |