CN101989287A - Method and equipment for generating rule for statistics-based machine translation - Google Patents

Method and equipment for generating rule for statistics-based machine translation Download PDF

Info

Publication number
CN101989287A
CN101989287A CN200910160943XA CN200910160943A CN101989287A CN 101989287 A CN101989287 A CN 101989287A CN 200910160943X A CN200910160943X A CN 200910160943XA CN 200910160943 A CN200910160943 A CN 200910160943A CN 101989287 A CN101989287 A CN 101989287A
Authority
CN
China
Prior art keywords
phrase
rule
extracted
language
source language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910160943XA
Other languages
Chinese (zh)
Other versions
CN101989287B (en
Inventor
何中军
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN200910160943.XA priority Critical patent/CN101989287B/en
Priority claimed from CN200910160943.XA external-priority patent/CN101989287B/en
Publication of CN101989287A publication Critical patent/CN101989287A/en
Application granted granted Critical
Publication of CN101989287B publication Critical patent/CN101989287B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method and equipment for generating a rule for statistics-based machine translation. The equipment for generating the rule for the statistics-based machine translation comprises a rule extraction device and a rule filter device, wherein the rule extraction device is used for extracting rules from a parallel corpus; and the rule filter device is used for filtering the extracted rules, any one of source language phrases or target language phrases of which is not predetermined phrase.

Description

Generation is used for the method and apparatus based on the rule of the mechanical translation of statistics
Technical field
The present invention relates to utilize computing machine with the machine translation mothod based on statistics of a kind of automatic translation of natural language for another natural language, it is particularly related to and generates the method and apparatus that is used for based on the rule of the mechanical translation of adding up.
Background technology
Mechanical translation is meant utilizes computing machine that a kind of natural language (source language) is translated as another natural language (target language).Close day by day along with international exchange and cooperation, people press for high-quality and efficient service for language translation.Mechanical translation has wide application prospect, and it also is the difficult point and the vital task of natural language processing simultaneously.
At present, the translation technology of main flow is based on the mechanical translation (statistical machine translation) of statistics.It carries out mathematical modeling to translation process, can automatically learn translation knowledge from Parallel Corpus, has that language independence is strong, system development cycle is short, the robustness advantages of higher.
The needed important resource of the process of mechanical translation is a rule list.In statistical machine translation, use the rule list that from Parallel Corpus, obtains.Rule list has been portrayed the corresponding relation of source language and target language.The quality of rule list and ability to express directly influence the performance of translation system.Yet the rule list of learning automatically from Parallel Corpus is very huge, causes the huge Computer Storage space of needs on the one hand, causes translation efficiency low on the other hand.This makes the statistical machine translation technology be difficult to be applied to storage space and computational resource less equipment such as mobile phone, PDA.
In fact, in translation process, rule list has very big redundancy.List of references [1] has proposed a kind of bilingual corpora library information that utilizes and has carried out the method that rule list filters, but complexity is higher.List of references [2] has proposed a kind of target language dependency tree information of utilizing and has carried out the method that rule list filters, but in order to guarantee that translation quality does not descend, has increased extra model again.The scale of rule list directly influences translation efficiency and translation quality.How not reducing or not obvious reduction translation quality and not increasing under the prerequisite of model complexity, reducing the scale of rule list, improve translation efficiency, is problem demanding prompt solution in the practical application.
List of references [1]: Howard Johnson, Joel Martin, George Foster, andRoland Kuhn.2007.Improving translation quality by discarding most ofthe phrasetable.In Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Processing and ComputationalNatural Language Learning (EMNLPCoNLL), pages 967-975, Prague, Czech Republic, June.
List of references [2]: Libin Shen, Jinxi Xu, and Ralph Weischedel.2008.Anew string-to-dependency machine translation algorithm with a targetdependency language model.In Proceedings of ACL-08:HLT, pages577-585, Columbus, Ohio, June.
Summary of the invention
One object of the present invention is to provide and generates the method and apparatus that is used for based on the rule of the mechanical translation of adding up, and wherein rule list is filtered, to reduce the demand of statictic machine translation system to computational resource.
One embodiment of the present of invention are that a kind of generation is used for the equipment based on the rule of the mechanical translation of adding up, and comprising: the Rule Extraction device, and it is extracting rule from Parallel Corpus; With the rule-based filtering device, its filtered source language phrase or target language phrase from the rule of being extracted are not any one rules of predetermined phrase.
Further, the said equipment can also comprise the phrase extraction element, and it extracts statistical nature and satisfies the phrase of pre-provisioning request as predetermined phrase from single language corpus of source language or target language.
Further, in the said equipment, predetermined phrase can comprise continuous phrase and non-continuous phrase.
Further, in the said equipment, statistical nature can comprise in the following characteristics one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs in corresponding corpus, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted.
Further, in the said equipment, the rule-based filtering device can be configured to any one rule that from the rule of being extracted filtered source language phrase is not predetermined phrase.
One embodiment of the present of invention are that a kind of generation is used for the method based on the rule of the mechanical translation of adding up, and comprising: extracting rule from Parallel Corpus; With filtered source language phrase from the rule of being extracted or target language phrase be not any one rule of predetermined phrase.
Further, said method can also comprise that extracting statistical nature from single language corpus of source language or target language satisfies the phrase of pre-provisioning request as described predetermined phrase.
Further, in said method, predetermined phrase can comprise continuous phrase and non-continuous phrase.
Further, in said method, statistical nature can comprise in the following characteristics one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs in corresponding corpus, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted.
Further, in said method, filtration can be that the filtered source language phrase is not any one rule of predetermined phrase from the rule of being extracted.
One embodiment of the present of invention are that a kind of generation is used for the equipment based on the rule of the mechanical translation of adding up, and comprising: the Rule Extraction device, and it is extracting rule from Parallel Corpus; The rule recognition device, it discerns dull rule of combination from the rule of being extracted, and described dull rule of combination can comprise littler rule, and the order of its source language phrase is identical with the order of the corresponding target language phrase of described source language phrase; With the rule-based filtering device, it filters the dull rule of combination of being discerned from the rule of being extracted.
Further, in the said equipment, littler rule is the rule in the rule of being extracted.
According to embodiments of the invention, use predetermined phrase that the rule that the Rule Extraction device obtains is filtered, thereby reduce the scale of rule list.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose of the present invention, characteristics and advantage more easily to the embodiment of the invention.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.
The block diagram of Fig. 1 shows according to an embodiment of the invention and to generate the exemplary configurations that is used for based on the equipment of the rule of the mechanical translation of statistics;
The process flow diagram of Fig. 2 shows according to an embodiment of the invention and to generate the example process that is used for based on the method for the rule of the mechanical translation of statistics;
The block diagram of Fig. 3 shows in accordance with a preferred embodiment of the present invention generation and is used for exemplary configurations based on the equipment of the rule of the mechanical translation of statistics;
The process flow diagram of Fig. 4 shows in accordance with a preferred embodiment of the present invention generation and is used for example process based on the method for the rule of the mechanical translation of statistics;
The block diagram of Fig. 5 shows the exemplary configurations according to phrase extraction element in the equipment of the embodiment of the invention;
The process flow diagram of Fig. 6 shows the example process according to phrase extraction step in the method for the embodiment of the invention;
The block diagram of Fig. 7 shows in accordance with another embodiment of the present invention and to generate the exemplary configurations that is used for based on the equipment of the rule of the mechanical translation of statistics;
The process flow diagram of Fig. 8 shows in accordance with another embodiment of the present invention and to generate the example process that is used for based on the method for the rule of the mechanical translation of statistics;
Fig. 9 a shows an example of minimum rule, and Fig. 9 b shows an example of compound rule, and Fig. 9 c shows an example of dull compound rule, and Fig. 9 d shows an example of non-dull compound rule.
Figure 10 is the block diagram that the exemplary configurations of the computing machine of realizing the embodiment of the invention is shown.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.
Be convenient below to the explanation of embodiments of the invention, at first provide the definition that relational term " phrase ", " subphrase " reach " phrase length ".
Phrase: speech string arbitrarily in the sentence.Whether continuous according to the position of the word in the phrase in sentence, can be divided into 2 classes:
Continuous phrase: the position of the word of phrase inside in sentence is continuous;
Non-continuous phrase: the position of the word of phrase inside in sentence is discontinuous.
Suppose a sentence that contains certain language (for example Chinese, English) of J speech
Figure B200910160943XD0000041
C wherein j(1≤j≤J) is the speech of this language, so
Figure B200910160943XD0000042
It is a continuous phrase;
If
Figure B200910160943XD0000043
Be a phrase, Be a continuous phrase, and i≤i '≤j '≤j, so
Figure B200910160943XD0000045
It is a non-continuous phrase.Wherein, X is a variable, and k is the numbering of variable, is used for distinguishing the variable on the diverse location.
Subphrase: phrase inside is the speech string arbitrarily.
Phrase length: the word that comprises in the phrase and the number of variable.
Further specify below by a concrete example.
Suppose a Chinese sentence:
Figure B200910160943XD0000051
Wherein " | " do not represent flesh and blood, only is used to separate each word.
According to definition,
Figure B200910160943XD0000052
Figure B200910160943XD0000053
Be two continuous phrases, Q so 1=" X 1| | economy | development | very | be a non-continuous phrase rapidly ".
In addition,
Figure B200910160943XD0000054
Be continuous phrase, then a Q 2=X 1| | X 2| very | be a non-continuous phrase rapidly ".
In last example,
Figure B200910160943XD0000055
Q 1, Q 2All be Subphrase.
The block diagram of Fig. 1 shows according to an embodiment of the invention and to generate the exemplary configurations that is used for based on the equipment 100 of the rule of the mechanical translation of statistics.
As shown in Figure 1, equipment 100 comprises Rule Extraction device 101 and rule-based filtering device 102.
Rule Extraction device 101 is extracting rule from Parallel Corpus 103.
Rule can be the tlv triple<S with following form, T ,~, wherein, S is the source language phrase, T is the target language phrase ,~be the variable of S and T inside or the corresponding relation between the speech.
Be two examples of rule below,
Rule 1:<economy | development, economic|development, 1-1|2-2 〉, " | " do not represent flesh and blood, only be used to separate each word or corresponding relation, wherein 1-1 represents in the source language phrase in the 1st variable or speech " economy " and the target language phrase that the 1st variable or speech " economic " are corresponding, and that 2-2 represents in the source language phrase in the 2nd variable or speech " development " and the target language phrase the 1st variable or speech " development " is corresponding;
Rule 2:<X 1| | X 2, X 2| of|X 1, 1-3|2-2|3-1 〉, wherein 1-3 represents in the source language phrase the 1st variable or speech X 1With the 3rd variable in the target language phrase or speech X 1Correspondence, 2-2 represent in the source language phrase the 2nd variable or speech " " corresponding with the 2nd variable or speech " of " in the target language phrase, 3-1 represents in the source language phrase the 3rd variable or speech X 2With the 1st variable in the target language phrase or speech X 2Corresponding.
Can utilize known technology to carry out Rule Extraction.For example can utilize David Chiang at " A hierarchical phrase-based model for statistical machinetranslation. " In Proceedings of 43rd Annual Meeting of the ACL, .2005., the method for describing among the pages 263-270.
Rule-based filtering device 102 filtered source language phrase or target language phrase from the rule of being extracted are not any one rules of predetermined phrase.
Predetermined phrase can be the pre-prepd set of wishing as the phrase of translation object that comprises.Can prepare predetermined phrase according to the characteristics of concrete translation application.For example, at the works and expressions for everyday use translation application, predetermined phrase can mainly comprise phrase commonly used in the daily life; For the translation application such as professional domains such as law, military affairs, aviations, predetermined phrase can mainly comprise the common phrases of corresponding professional domain.Depend on concrete filtering rule, predetermined phrase can only comprise the source language phrase, only comprise target language phrase or both comprised the source language phrase, comprises the target language phrase again.
The situation that only comprises the source language phrase with predetermined phrase is an example, each rule<S that rule-based filtering device 102 extracts at Rule Extraction device 101, and T ,~carry out following processing:
If S is identical with a phrase in the predetermined phrase, promptly S is one of predetermined phrase, then keeps this rule.If any one phrase in S and the predetermined phrase is all inequality, promptly S is not one of predetermined phrase, then this rule-based filtering is fallen (promptly abandoning).
The situation that only comprises the target language phrase with predetermined phrase is an example, each rule<S that rule-based filtering device 102 extracts at Rule Extraction device 101, and T ,~carry out following processing:
If T is identical with a phrase in the predetermined phrase, promptly T is one of predetermined phrase, then keeps this rule.If any one phrase in T and the predetermined phrase is all inequality, promptly T is not one of predetermined phrase, then this rule-based filtering is fallen (promptly abandoning).
Both having comprised the situation that the source language phrase also comprises the target language phrase with predetermined phrase is example, each rule<S that rule-based filtering device 102 extracts at Rule Extraction device 101, and T ,~carry out following processing:
If S is identical with a phrase in the predetermined origin language phrase, promptly S is one of predetermined origin language phrase, and T is identical with a phrase in the predicted target instruction phrase, and promptly T is one of predicted target instruction phrase, then keeps this rule.Otherwise, this rule-based filtering is fallen (promptly abandoning).
Additional or preferably, if a rule<S, T ,~S be a word (phrase length is 1), keep this rule so.Such purpose is the coverage rate that keeps rule.
For the ease of understanding, illustrate how to carry out rule-based filtering by an example.
Suppose from Parallel Corpus, to extract 10 following rules: (having omitted the word corresponding relation between source language phrase and the target language phrase herein)
Numbering The source language phrase The target language phrase
1 Mistake error
2 Mistake incorrect
3 Carry out | formal | the close friend | visit pay|an|official|friendly|visit
4 Carry out | formal | the close friend | visit official|and|goodwill|visit
5 Weapon | equipment | weapons|and|equipment
6 ?X 1| | strengthen The|strengthening|of|X 1
7 Country | X 1| directorate National|X 1|command|headquarters
8 Destroy | | X 1| | X 2 undermines|the|X 2|of|the|X 1
9 The area | X 1| | X 2 X 2|of|a|region|X 1
10 The area | X 1| | X 2 X 2|of|X 1|in|areas
Predetermined phrase only comprises following source language phrase:
Numbering The source language phrase
1 Formally | the close friend | visit
2 X1|'s | strengthen
3 Destroy | | X1|'s | X2
So, being numbered 5,7,9,10 rule will be filtered, because their source language phrase is not predetermined phrase.Be numbered 1,2 rule, because its source language phrase is a word, although it also not in predetermined phrase, in order to improve regular coverage rate, is still kept.Finally, regular as follows through what filter:
Numbering The source language phrase The target language phrase
1 Mistake error
2 Mistake incorrect
3 Carry out | formal | the close friend | visit pay|an|official|friendly|visit
4 Carry out | formal | the close friend | visit official|and|goodwill|visit
6 X1|'s | strengthen The|strengthening|of|X1
8 Destroy | | X1|'s | X2 undermines|the|X2|of|the|X1
Rule after the filtration only comprises 6 rules, has reduced 40% than original.
The process flow diagram of Fig. 2 shows according to an embodiment of the invention and to generate the example process that is used for based on the method for the rule of the mechanical translation of statistics.
As shown in Figure 2, method is from step 201.In step 203, extracting rule from Parallel Corpus.Can utilize known technology to carry out Rule Extraction.For example can utilize DavidChiang at " A hierarchical phrase-based model for statistical machinetranslation. " In Proceedings of 43rd Annual Meeting of the ACL, .2005., the method for describing among the pages 263-270.
In step 205, filtered source language phrase or target language phrase are not any one rules of predetermined phrase from the rule of being extracted.
Predetermined phrase can be the pre-prepd set of wishing as the phrase of translation object that comprises.Can prepare predetermined phrase according to the characteristics of concrete translation application.For example, at the works and expressions for everyday use translation application, predetermined phrase can mainly comprise phrase commonly used in the daily life; For the translation application such as professional domains such as law, military affairs, aviations, predetermined phrase can mainly comprise the common phrases of corresponding professional domain.Depend on concrete filtering rule, predetermined phrase can only comprise the source language phrase, only comprise target language phrase or both comprised the source language phrase, comprises the target language phrase again.
The situation that only comprises the source language phrase with predetermined phrase is an example, at each rule<S that step 203 is extracted, and T ,~carry out following processing:
If S is identical with a phrase in the predetermined phrase, promptly S is one of predetermined phrase, then keeps this rule.If any one phrase in S and the predetermined phrase is all inequality, promptly S is not one of predetermined phrase, then this rule-based filtering is fallen (promptly abandoning).
The situation that only comprises the target language phrase with predetermined phrase is an example, at each rule<S that step 203 is extracted, and T ,~carry out following processing:
If T is identical with a phrase in the predetermined phrase, promptly T is one of predetermined phrase, then keeps this rule.If any one phrase in T and the predetermined phrase is all inequality, promptly T is not one of predetermined phrase, then this rule-based filtering is fallen (promptly abandoning).
Both having comprised the situation that the source language phrase also comprises the target language phrase with predetermined phrase is example, at each rule<S that step 203 is extracted, and T ,~carry out following processing:
If S is identical with a phrase in the predetermined origin language phrase, promptly S is one of predetermined origin language phrase, and T is identical with a phrase in the predicted target instruction phrase, and promptly T is one of predicted target instruction phrase, then keeps this rule.Otherwise, this rule-based filtering is fallen (promptly abandoning).
Additional or preferably, if a rule<S, T ,~S be a word (phrase length is 1), keep this rule so.Such purpose is the coverage rate that keeps rule.
Method finishes in step 207.
The block diagram of Fig. 3 shows in accordance with a preferred embodiment of the present invention generation and is used for exemplary configurations based on the equipment 300 of the rule of the mechanical translation of statistics.
As shown in Figure 3, equipment 300 comprises Rule Extraction device 301, rule-based filtering device 302 and phrase extraction element 304.Rule Extraction device 301 shown in Figure 3, rule-based filtering device 302 and Parallel Corpus 303 Rule Extraction device 101, the rule-based filtering device 102 with shown in Figure 1 respectively are identical with Parallel Corpus 103, no longer repeat specification here.
Phrase extraction element 304 extracts statistical nature and satisfies the phrase of pre-provisioning request as predetermined phrase from single language corpus 305 of source language or target language.
Predetermined phrase can only comprise the source language phrase, only comprise target language phrase or both comprised the source language phrase, comprises the target language phrase again.Correspondingly, single language corpus 305 can only comprise the source language language material, only comprise target language language material or both comprised the source language language material, comprises the target language language material again.
Predetermined phrase can be to comprise the set of wishing as the phrase of translation object.Can there be various standards to determine the translation object whether phrase wishes.For example whether whether phrase used always in certain translation is used, occurred or the like in abundant sentence.In a word, the phrase that unlikely runs in translation does not wish to be comprised in the predetermined phrase.
For example, statistical nature can include but not limited to that following feature one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted in corresponding corpus.
There is N candidate's phrase in the phantom order language corpus.All set 4 a tuples<l (p) at each candidate's phrase p, f (p), s (p), n (p) 〉, its implication is as follows:
L (p), the length of candidate's phrase p;
F (p), the occurrence number of candidate's phrase p in single language corpus;
S (p), candidate's phrase p appears at number of times in other candidate's phrases as subphrase;
N (p) comprises the candidate phrase number of candidate's phrase p as subphrase.
The computing formula of C-value value is:
C - value ( P ) = ( l ( p ) - 1 ) × f ( p ) n ( p ) = 0 ( l ( p ) - 1 ) × ( f ( p ) - s ( p ) n ( p ) ) n ( p ) ≠ 0
For example, candidate's phrase P 1The four-tuple of=" the United Nations | safety | council " is<3,100,0,0〉and, the length that this candidate's phrase is described is 3, is occurring 100 times in single language corpus.This appearance of 100 times all is independent the appearance, and the subphrase as other candidate's phrases does not occur.So, C-value (P 1)=(3-1) * 100=200
Phrase P 2=" the United Nations | X 1| council " four-tuple be<3,100,100,1, the length that this candidate's phrase is described is 3, is occurring 100 times in single language corpus.Simultaneously, this candidate's phrase has also occurred 100 times as the subphrase of other candidate's phrases, comprises P 2Candidate's phrase number as subphrase is 1.So,
C - value ( P 2 ) = ( 3 - 1 ) × ( 100 - 100 1 ) = 0
By above two examples as can be seen, candidate's phrase P 2Be P 1Subphrase.Whenever P 1When occurring, P 2Also must occur, with P 1In the 2nd speech " safely " replace with " X 1" just obtained P 2Because P 2And P 1Therefore the number of times identical (all being 100 times) that occurs can think P 2It is nonsensical independently occurring, and its importance compares P 1Little.This point can reflect C-value (P from the C-value value 1)>C-value (P 2).
In a specific implementation, phrase extraction unit 304 can be enumerated all phrases from single language corpus 305, with as candidate's phrase.Preferably, in order to control the quantity of candidate's phrase, can carry out following restriction one of at least:
The length of phrase is the longest can not to surpass L.L is an integer, as required, can be between [5,10] value;
The occurrence number of phrase in single language corpus can not be less than F.F is an integer, as required, can be between [2,10] value;
Variable number in the non-continuous phrase can not surpass K at most.K is an integer, as required, can be between [2,3] value;
Have a speech between the variable in the non-continuous phrase at least, promptly variable can not be adjacent.
Phrase extraction element 304 is according to the statistical nature of single language corpus 305 calculated candidate phrases, and the candidate's phrase that statistical nature is satisfied predetermined condition (for example, being higher than predetermined threshold) is as predetermined phrase.The predetermined phrase that is extracted is provided for rule-based filtering device 302.
The block diagram of Fig. 5 shows the exemplary configurations according to phrase extraction element 500 in the equipment of the embodiment of the invention.
As shown in Figure 5, phrase extraction element 500 comprises phrase extraction unit 501, phrase assessment unit 502 and phrase filter element 504.
Phrase extraction unit 501 extracts candidate's phrase from single language corpus 503 of source language or target language.Phrase assessment unit 502 calculates the statistical nature of each candidate's phrase according to single language corpus 503.Phrase filter element 504 is compared the statistical nature of each candidate's phrase with predetermined threshold, if more than predetermined threshold, then with the corresponding candidate phrase as predetermined phrase; Otherwise abandon the corresponding candidate phrase.Predetermined threshold can be a real number value, and as required, can be between [10,1000] value.Predetermined threshold is big more, and the predetermined phrase that obtains is just few more.
The process flow diagram of Fig. 4 shows in accordance with a preferred embodiment of the present invention generation and is used for example process based on the method for the rule of the mechanical translation of statistics.
As shown in Figure 4, method is from step 401.In step 403, executing rule extracts, and processing wherein is identical with step 203 shown in Figure 2.
In step 405, from single language corpus of source language or target language, extract statistical nature and satisfy the phrase of pre-provisioning request as predetermined phrase.
Predetermined phrase can only comprise the source language phrase, only comprise target language phrase or both comprised the source language phrase, comprises the target language phrase again.Correspondingly, single language corpus 305 can only comprise the source language language material, only comprise target language language material or both comprised the source language language material, comprises the target language language material again.
Predetermined phrase can be to comprise the set of wishing as the phrase of translation object.Can there be various standards to determine the translation object whether phrase wishes.For example whether whether phrase used always in certain translation is used, occurred or the like in abundant sentence.In a word, the phrase that unlikely runs in translation does not wish to be comprised in the predetermined phrase.
For example, statistical nature can include but not limited to that following feature one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted in corresponding corpus.
In an object lesson, can from single language corpus, enumerate all phrases, with as candidate's phrase.Preferably, in order to control the quantity of candidate's phrase, can carry out following restriction one of at least:
The length of phrase is the longest can not to surpass L.L is an integer, as required, can be between [5,10] value;
The occurrence number of phrase in single language corpus can not be less than F.F is an integer, as required, can be between [2,10] value;
Variable number in the non-continuous phrase can not surpass K at most.K is an integer, as required, can be between [2,3] value;
Have a speech between the variable in the non-continuous phrase at least, promptly variable can not be adjacent.
In step 405, according to the statistical nature of single language corpus calculated candidate phrase, and the candidate's phrase that statistical nature is satisfied predetermined condition (for example, being higher than predetermined threshold) is as predetermined phrase.
Though it is noted that in Fig. 4 step 403 and 405 is depicted as order to be carried out, step 405 also can before step 403, carry out, or execution simultaneously with it is as long as can provide predetermined phrase before execution in step 407.
In step 407, executing rule filters, and processing wherein is identical with step 205 shown in Figure 2.Method finishes in step 409.
The process flow diagram of Fig. 6 shows the example process according to phrase extraction step in the method for the embodiment of the invention.
As shown in Figure 6, the phrase extraction step is from step 601.In step 603, from single language corpus of source language or target language, extract candidate's phrase.In step 605, calculate the statistical nature of each candidate's phrase according to single language corpus.In step 607, the statistical nature of each candidate's phrase is compared with predetermined threshold, if more than predetermined threshold, then with the corresponding candidate phrase as predetermined phrase; Otherwise abandon the corresponding candidate phrase.Predetermined threshold can be a real number value, and as required, can be between [10,1000] value.Predetermined threshold is big more, and the predetermined phrase that obtains is just few more.In step 609, the processing of phrase extraction step finishes.
The embodiment that the rule of extracting is filtered according to predetermined phrase has been described in the front.In combination or individually, also can filter the rule of extracting according to other criterion.
For example, the rule that corresponding relation can not related to the phrase outside the rule is called complete rule, and the rule that can not further split out littler complete rule is called minimum rule.Fig. 9 a shows an example of minimum rule<of short duration | visit, a|short|visit, 1-1|1-2|2-1|2-3 〉.Can will be able to be called compound rule by the rule that two or more minimum rules are formed.Fig. 9 b shows an example of compound rule, wherein compound rule<associating | hold |, jointly|held|by, 1-1|2-2|2-3|3-3〉can be by minimum rule<associating, jointly, 1-1 and<hold |, held|by, 1-1|1-2|2-2〉combine.
According to putting in order of source and target language phrase (variable) in the minimum rule that is constituted, compound rule can be divided into two classes, promptly dull compound rule and non-dull compound rule.Dull compound rule is made of monotonously minimum rule, the mutual order of source and target language phrase (variable) in dull compound rule identical with the mutual order of corresponding minimum rule (order of order target language phrase corresponding with it that is the source language phrase is identical) in the wherein different minimum rules.Non-dull compound rule constitutes monotonously by minimum rule is non-, the mutual order of source and target language phrase (variable) in dull compound rule (order of order corresponding target language phrase with it that is the source language phrase is inequality) inequality with the mutual order of corresponding minimum rule in the wherein different minimum rules.Fig. 9 c shows an example of dull compound rule<just | X 1| and | X 2, about|X 1| and|X 2, 1-1|2-2|3-3|4-4 〉, wherein this rule can be by four minimum rules<just, about, 1-1 〉,<X 1, X 1, 1-1 〉,<and, and, 1-1〉and<X 2, X 2, 1-1〉form monotonously.Fig. 9 d shows an example of non-dull compound rule<just | X 1| with | X 2, with the|X 2| on|X 1, 1-3|2-4|3-1|4-2 〉, though wherein should rule can be by four minimum rules<just, on, 1-1,<X 1, X 1, 1-1 〉,<and, with the, 1-1〉and<X 2, X 1, 1-1〉form, however source language phrase (variable) " just ", X 1, " with " and X 2And target language phrase (variable) " with the ", X 2, " on " and X 1Mutual order in this rule is different with the mutual order of four minimum rules.
As can be seen, aspect translation result, utilizing dull compound rule is identical with utilizing the minimum rule of forming dull compound rule.Therefore, can consider to filter out dull compound rule.
The block diagram of Fig. 7 shows in accordance with another embodiment of the present invention and to generate the exemplary configurations that is used for based on the equipment 700 of the rule of the mechanical translation of statistics.
As shown in Figure 7, equipment 700 comprises Rule Extraction device 701, rule-based filtering device 702 and regular recognition device.Rule Extraction device 701 extracts rule from Parallel Corpus 703.The Rule Extraction device 101 of the processing of Rule Extraction device 701 and Fig. 1, therefore no longer repeat specification.
Rule recognition device 704 is discerned dull rule of combination from the rule of being extracted.
Rule-based filtering device 702 filters out the dull rule of combination of being discerned from the rule of being extracted.
The process flow diagram of Fig. 8 shows in accordance with another embodiment of the present invention and to generate the example process that is used for based on the method for the rule of the mechanical translation of statistics.
As shown in Figure 8, method is from step 801.In step 803, from Parallel Corpus, extract rule.In step 805, the dull rule of combination of identification from the rule of being extracted.In step 807, from the rule of being extracted, filter out the dull rule of combination of being discerned.Method finishes in step 809.
In Fig. 7 and embodiment shown in Figure 8, be identified as dull compound rule the rule extracted to some extent all deleted.Yet for institute's extracting rule that is identified as dull compound rule, if also exist at least one other extracting rule to comprise identical source language phrase as non-dull compound rule, promptly this source language phrase has two kinds of interpretative systems: dull with nonmonotonic, deletion institute's extracting rule of being identified as dull compound rule can cause selecting when translation increasing as the probability of other extracting rule of non-dull compound rule so simply, causes probability calculation inaccurate.Therefore, preferably, the source language phrase of the dull compound rule that rule-based filtering device 702 and step 807 are discerned is not the source language phrase of any non-dull compound rule.
Beneficial effect of the present invention include but not limited in following one of at least: reduce the scale of rule list, reduce storage space effectively, improve translation efficiency.It should be noted that under the situation that rule list significantly reduces translation quality does not only descend, even also is being improved in some cases.The scheme that the present invention proposes has language and model independence, go for various language to the statistical machine translation model, extensibility is strong.Environment that storage space, counting yield are had relatively high expectations, for example translation service that provides on the mobile device such as mobile phone, PDA etc. are provided in the present invention.
Rule list that generates according to the solution of the present invention and concrete statistical machine translation model are independently, go for all statistical machine translation methods, for example based on the method for phrase, based on the method for sentence structure etc., have widely applicable advantage.
One embodiment of the present of invention are that a kind of generation is used for the equipment based on the rule of the mechanical translation of adding up, and comprising: the Rule Extraction device, and it is extracting rule from Parallel Corpus; With the rule-based filtering device, its filtered source language phrase or target language phrase from the rule of being extracted are not any one rules of predetermined phrase.
In a further embodiment, also comprise: the phrase extraction element, it extracts statistical nature and satisfies the phrase of pre-provisioning request as predetermined phrase from single language corpus of source language or target language.
In a further embodiment, predetermined phrase comprises continuous phrase and non-continuous phrase.
In a further embodiment, statistical nature comprise in the following characteristics one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs in corresponding corpus, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted.
In a further embodiment, the rule-based filtering device is configured to that the filtered source language phrase is not any one rule of described predetermined phrase from the rule of being extracted.
One embodiment of the present of invention are that a kind of generation is used for the method based on the rule of the mechanical translation of adding up, and comprising: extracting rule from Parallel Corpus; With filtered source language phrase from the rule of being extracted or target language phrase be not any one rule of predetermined phrase.
In a further embodiment, also comprise: from single language corpus of source language or target language, extract statistical nature and satisfy the phrase of pre-provisioning request as described predetermined phrase.
In a further embodiment, predetermined phrase comprises continuous phrase and non-continuous phrase.
In a further embodiment, statistical nature comprise in the following characteristics one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs in corresponding corpus, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted.
In a further embodiment, be filtered into that the filtered source language phrase is not any one rule of described predetermined phrase from the rule of being extracted.
One embodiment of the present of invention are that a kind of generation is used for the equipment based on the rule of the mechanical translation of adding up, and comprising: the Rule Extraction device, and it is extracting rule from Parallel Corpus; The rule recognition device, it discerns dull rule of combination from the rule of being extracted, and described dull rule of combination can comprise littler rule, and the order of its source language phrase is identical with the order of the corresponding target language phrase of described source language phrase; With the rule-based filtering device, it filters the dull rule of combination of being discerned from the rule of being extracted.
In a further embodiment, the source language phrase of the dull rule of combination of being discerned is not the source language phrase of any non-dull compound rule.
In addition, should also be noted that above-mentioned series of processes promptly can realize by hardware with installing, also can realize by software and firmware.Under situation about realizing by software or firmware, from storage medium or network to computing machine with specialized hardware structure, multi-purpose computer 1000 for example shown in Figure 10 is installed the program that constitutes this software, and this computing machine can be carried out various functions or the like when various program is installed.
In Figure 10, CPU (central processing unit) (CPU) 1001 carries out various processing according to program stored among ROM (read-only memory) (ROM) 1002 or from the program that storage area 1008 is loaded into random-access memory (ram) 1003.In RAM 1003, also store data required when CPU 1001 carries out various processing or the like as required.
CPU 1001, ROM 1002 and RAM 1003 are connected to each other via bus 1004.Input/output interface 1005 also is connected to bus 1004.
Following parts are connected to input/output interface 1005: importation 1006 comprises keyboard, mouse or the like; Output 1007 comprises display, such as cathode ray tube (CRT), LCD (LCD) or the like and loudspeaker or the like; Storage area 1008 comprises hard disk or the like; With communications portion 1009, comprise that network interface unit is such as LAN card, modulator-demodular unit or the like.Communications portion 1009 is handled such as the Internet executive communication via network.
As required, driver 1010 also is connected to input/output interface 1005.Detachable media 1011 is installed on the driver 1010 as required such as disk, CD, magneto-optic disk, semiconductor memory or the like, makes the computer program of therefrom reading be installed to as required in the storage area 1008.
Realizing by software under the situation of above-mentioned series of processes, such as detachable media 1011 program that constitutes software is being installed such as the Internet or storage medium from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 7 wherein having program stored therein, distribute separately so that the detachable media 1011 of program to be provided to the user with equipment.The example of detachable media 1011 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 1002, the storage area 1008 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.
The step that also it is pointed out that the above-mentioned series of processes of execution can order following the instructions naturally be carried out in chronological order, but does not need necessarily to carry out according to time sequencing.Some step can walk abreast or carry out independently of one another.
Though described the present invention and advantage thereof in detail, be to be understood that and under not breaking away from, can carry out various changes, alternative and conversion by the situation of the appended the spirit and scope of the present invention that claim limited.

Claims (10)

1. a generation is used for the equipment based on the rule of the mechanical translation of statistics, it is characterized in that, comprising:
The Rule Extraction device, it is extracting rule from Parallel Corpus; With
Rule-based filtering device, its filtered source language phrase or target language phrase from the rule of being extracted are not any one rules of predetermined phrase.
2. equipment according to claim 1 is characterized in that, also comprises:
The phrase extraction element, it extracts statistical nature and satisfies the phrase of pre-provisioning request as described predetermined phrase from single language corpus of source language or target language.
3. equipment according to claim 1, it is characterized in that, described statistical nature comprise in the following characteristics one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs in corresponding corpus, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted.
4. equipment according to claim 1 is characterized in that, described rule-based filtering device is configured to that the filtered source language phrase is not any one rule of described predetermined phrase from the rule of being extracted.
5. a generation is used for the method based on the rule of the mechanical translation of statistics, it is characterized in that, comprising:
Extracting rule from Parallel Corpus; With
Filtered source language phrase or target language phrase are not any one rules of predetermined phrase from the rule of being extracted.
6. method according to claim 5 is characterized in that, also comprises:
From single language corpus of source language or target language, extract statistical nature and satisfy the phrase of pre-provisioning request as described predetermined phrase.
7. method according to claim 5, it is characterized in that, described statistical nature comprise in the following characteristics one of at least: the C-value value of the probability of the information entropy of the number of times that the phrase that is extracted occurs in corresponding corpus, the phrase that is extracted, the phrase that is extracted and the phrase that is extracted.
8. method according to claim 5 is characterized in that, describedly is filtered into that the filtered source language phrase is not any one rule of described predetermined phrase from the rule of being extracted.
9. a generation is used for the equipment based on the rule of the mechanical translation of statistics, it is characterized in that, comprising:
The Rule Extraction device, it is extracting rule from Parallel Corpus;
The rule recognition device, it discerns dull rule of combination from the rule of being extracted, and described dull rule of combination can comprise littler rule, and the order of its source language phrase is identical with the order of the corresponding target language phrase of described source language phrase; With
The rule-based filtering device, it filters the dull rule of combination of being discerned from the rule of being extracted.
10. equipment as claimed in claim 9 is characterized in that the source language phrase of the dull rule of combination discerned is not the source language phrase of any non-dull compound rule.
CN200910160943.XA 2009-07-31 Generate the regular method and apparatus for machine translation based on statistics Expired - Fee Related CN101989287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910160943.XA CN101989287B (en) 2009-07-31 Generate the regular method and apparatus for machine translation based on statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910160943.XA CN101989287B (en) 2009-07-31 Generate the regular method and apparatus for machine translation based on statistics

Publications (2)

Publication Number Publication Date
CN101989287A true CN101989287A (en) 2011-03-23
CN101989287B CN101989287B (en) 2016-12-14

Family

ID=

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999486A (en) * 2012-11-16 2013-03-27 沈阳雅译网络技术有限公司 Phrase rule extracting method based on combination
CN103914447A (en) * 2013-01-09 2014-07-09 富士通株式会社 Information processing device and information processing method
CN104391885A (en) * 2014-11-07 2015-03-04 哈尔滨工业大学 Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN104750676A (en) * 2013-12-31 2015-07-01 橙译中科信息技术(北京)有限公司 Machine translation processing method and device
CN105095193A (en) * 2014-05-08 2015-11-25 华为技术有限公司 Machine translation method and device thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1380619A (en) * 2001-04-16 2002-11-20 李玉 Improved English-Chinese translating machine
CN101482860A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Automatic extraction and filtration method for Chinese-English phrase translation pairs

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1380619A (en) * 2001-04-16 2002-11-20 李玉 Improved English-Chinese translating machine
CN101482860A (en) * 2008-01-09 2009-07-15 中国科学院自动化研究所 Automatic extraction and filtration method for Chinese-English phrase translation pairs

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999486A (en) * 2012-11-16 2013-03-27 沈阳雅译网络技术有限公司 Phrase rule extracting method based on combination
CN102999486B (en) * 2012-11-16 2016-12-21 沈阳雅译网络技术有限公司 Phrase rule abstracting method based on combination
CN103914447A (en) * 2013-01-09 2014-07-09 富士通株式会社 Information processing device and information processing method
CN103914447B (en) * 2013-01-09 2017-04-19 富士通株式会社 Information processing device and information processing method
CN104750676A (en) * 2013-12-31 2015-07-01 橙译中科信息技术(北京)有限公司 Machine translation processing method and device
CN104750676B (en) * 2013-12-31 2017-10-24 橙译中科信息技术(北京)有限公司 Machine translation processing method and processing device
CN105095193A (en) * 2014-05-08 2015-11-25 华为技术有限公司 Machine translation method and device thereof
CN105095193B (en) * 2014-05-08 2018-02-16 华为技术有限公司 The method and its equipment of a kind of machine translation
CN104391885A (en) * 2014-11-07 2015-03-04 哈尔滨工业大学 Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN104391885B (en) * 2014-11-07 2017-07-28 哈尔滨工业大学 A kind of abstracting method of the chapter level than the parallel phrase pair of language material trained based on parallel corpora

Similar Documents

Publication Publication Date Title
US10970487B2 (en) Templated rule-based data augmentation for intent extraction
US20210192389A1 (en) Method for ai optimization data governance
CN106446148A (en) Cluster-based text duplicate checking method
CN104391842A (en) Translation model establishing method and system
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
US20190171713A1 (en) Semantic parsing method and apparatus
CN107423440A (en) A kind of question and answer context switching based on sentiment analysis is with strengthening system of selection
CN103914494A (en) Method and system for identifying identity of microblog user
CN102591988A (en) Short text classification method based on semantic graphs
CN105468371A (en) Business process chart combination method based on topic clustering
Salle et al. Enhancing the lexvec distributed word representation model using positional contexts and external memory
CN106611041A (en) New text similarity solution method
CN109299471A (en) A kind of method, apparatus and terminal of text matches
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN104699797A (en) Webpage data structured analytic method and device
CN104281565A (en) Semantic dictionary constructing method and device
CN105045933A (en) Method for mapping between ship equipment maintenance and guarantee information relation data base mode and ship equipment maintenance and guarantee information body
Xiao et al. A patent recommendation method based on KG representation learning
CN103455638A (en) Behavior knowledge extracting method and device combining reasoning and semi-automatic learning
Yangarber et al. Redundancy-based correction of automatically extracted facts
CN101989287A (en) Method and equipment for generating rule for statistics-based machine translation
CN113919351A (en) Network security named entity and relationship joint extraction method and device based on transfer learning
CN103631771A (en) Method and device for improving linguistic model
CN106021286A (en) Method for language understanding based on language structure
CN103324608A (en) Lemmatization method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161214

Termination date: 20180731

CF01 Termination of patent right due to non-payment of annual fee