CN102591857B - Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system - Google Patents

Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system Download PDF

Info

Publication number
CN102591857B
CN102591857B CN201110021725.5A CN201110021725A CN102591857B CN 102591857 B CN102591857 B CN 102591857B CN 201110021725 A CN201110021725 A CN 201110021725A CN 102591857 B CN102591857 B CN 102591857B
Authority
CN
China
Prior art keywords
language
word string
corpus
intertranslation
public
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110021725.5A
Other languages
Chinese (zh)
Other versions
CN102591857A (en
Inventor
郑仲光
何中军
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201110021725.5A priority Critical patent/CN102591857B/en
Publication of CN102591857A publication Critical patent/CN102591857A/en
Application granted granted Critical
Publication of CN102591857B publication Critical patent/CN102591857B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

An embodiment of the invention discloses a bilingual corpus resource acquisition method and a bilingual corpus resource acquisition system. The bilingual corpus resource acquisition method includes the steps: acquiring a matched intermediate language common word string between a first language database and a second language database; and forming a mutually-translated text pair of a first language and a second language, wherein the mutually-translated text pair is used for forming bilingual corpus resources of the first language and the second language. The first language database comprises bilingual corpora of the first language and an intermediate language, and the second language database comprises bilingual corpora of the second language and the intermediate language. By means of applying the scheme provided by the embodiment, the bilingual corpora of the two languages are acquired by the aid of the third-party language, so that the problem of corpus resource scarcity between the languages is solved, and a high-quality translation rule can be acquired to construct a statistical machine translation system.

Description

A kind of parallel corpora resource acquiring method and system
Technical field
Relate generally to Computer Applied Technology field of the present invention, especially relates to a kind of parallel corpora resource acquiring method and system.
Background technology
Mechanical translation (Machine Translation), is also called automatic translation, is to utilize computing machine the process of a kind of natural source language shift for another kind of natural target language, refers generally to sentence and translation in full between natural language.Statistical machine translation (Statistical Machine Translation, SMT) is the one of mechanical translation, is also performance a kind of preferably method in the mechanical translation of current non-limiting field.The basic thought of statistical machine translation is: carry out statistical study by the parallel corpora (bilingual corpus also claims bilingual intertranslation language material) to some, then build statistical translation model by training, and then use this model to translate.At present, mechanical translation is transitioned into the translation based on phrase gradually from the early stage translation based on word, and merges semantic information, to improve the intelligent of translation and accuracy further.
When training statictic machine translation system, parallel corpora (namely having the word or phrase of determining intertranslation relation) is needed to make reference.Only when having the parallel corpora of some, just can therefrom extract more translation rule.But in actual applications, between a variety of language, not there is parallel corpora resource, or only there is less parallel corpora resource, be thus also difficult to obtain translation rule between these language to build statictic machine translation system by parallel corpora.
Summary of the invention
In view of this, a kind of parallel corpora resource acquiring method and system is embodiments provided.The scheme that the application embodiment of the present invention provides, utilize third party's language to obtain the parallel corpora between bilingual, thus the problem of language material scarcity of resources between solution language, and the translation rule being conducive to obtaining better quality is to build statictic machine translation system.
The embodiment of the present invention provides a kind of parallel corpora resource acquiring method, comprising:
Obtain the public word string of the intermediate language matched between the first corpus and the second corpus;
According to obtained public word string, form the intertranslation text pair of first language and second language, described intertranslation text is to the parallel corpora resource for the formation of first language and second language;
Wherein, described first corpus comprises the parallel corpora of first language and intermediate language;
Described second corpus comprises the parallel corpora of second language and intermediate language.
According to another aspect of the embodiment of the present invention, a kind of parallel corpora resource acquisition system is provided, comprises:
Public word string acquisition module, for obtaining the public word string of the intermediate language matched between the first corpus and the second corpus;
Intertranslation text is to composition module, and for the public word string obtained according to described public word string acquisition module, form the intertranslation text pair of first language and second language, described intertranslation text is to the parallel corpora resource for the formation of first language and second language;
Wherein, described first corpus comprises the parallel corpora of first language and intermediate language;
Described second corpus comprises the parallel corpora of second language and intermediate language.
According to the one side again of the embodiment of the present invention, additionally provide a kind of program product storing the instruction code of machine-readable, when described instruction code is read by machine and performs, above-mentioned parallel corpora resource acquiring method can be performed.
According to the another aspect of the embodiment of the present invention, a kind of storage medium, it carries the instruction code of machine-readable, when described instruction code is read by machine and performs, can perform above-mentioned parallel corpora resource acquiring method.
Provide the various specific implementations of the embodiment of the present invention in instructions part below, wherein, describe the preferred embodiment being used for the openly embodiment of the present invention fully in detail, and do not apply to limit to it.
Accompanying drawing explanation
Below in conjunction with specific embodiment, and with reference to accompanying drawing, the above-mentioned of the embodiment of the present invention and other object and advantage are further described.In the accompanying drawings, the identical or corresponding Reference numeral of employing represents by the technical characteristic of identical or correspondence or parts.In the drawings and in which:
Fig. 1 is the schematic diagram of translation rule in translation model;
Fig. 2 is the process flow diagram of the parallel corpora resource acquiring method according to the embodiment of the present invention;
Fig. 3 is the process flow diagram according to the public word string method of the acquisition intermediate language of the embodiment of the present invention;
Fig. 4 (a)-4 (b) is a kind of constraint condition schematic diagram of the public word string according to the embodiment of the present invention;
Fig. 5 is the another kind of constraint condition schematic diagram of the public word string according to the embodiment of the present invention;
Fig. 6 (a)-6 (c) is the schematic diagram of the acquisition Ying-intertranslation language material according to the embodiment of the present invention;
Fig. 7 is the structural representation of the parallel corpora resource acquisition system according to the embodiment of the present invention;
Fig. 8 is the structural representation of the public word string acquisition module according to the embodiment of the present invention;
Fig. 9 is the block diagram of the example arrangement of personal computer as the messaging device adopted in embodiments of the invention.
Embodiment
With reference to the accompanying drawings embodiments of the invention are described.
When not having sufficient parallel corpora resource between bilingual, in order to obtain the translation rule between this bilingual, intermediate language can be utilized to merge translation rule, thus indirectly obtain the translation rule between this bilingual.Such as, current known two cover translation model M1 and M2, wherein:
M1 is the translation model of first language and intermediate language
M2 is the translation model of intermediate language and second language
The translation rule of some is all comprised in two cover translation model M1 and M2.The translation model of statistical machine translation is mainly divided into 4 parts: first language rule, second language are regular, alignment relation information Sum fanction probability.Figure 1 shows that the schematic diagram of a translation rule example.
By comparing the intermediate language part in rule list that language rule forms, merging identical intermediate language rule, indirectly can obtain the translation rule of first language and second language, but the mode of this acquisition translation rule at least having following problem:
1) if there is m1 rule identical with the m2 rule intermediate language part in M2 in M1, so new coupling rule out will reach m1 × m2 bar, thus cause rule list to expand, and translation system efficiency reduces.
2) because rule comprises probability, for the rule that every bar newly mates out, need estimated probability again, and the estimation of probability is again based on the probability of rule in M1, M2, because in M1, M2, the probability of rule draws based on estimation, therefore, the accuracy of new coupling regular probability is out difficult to ensure more.
3) due to and do not know the rule in M1 and M2 extracts from which type of sentence environment, therefore simple coupling can produce a lot of ambiguity rule, thus affects final translation quality.
Visible, the secondhand translation rule by intermediate language, translation efficiency and accuracy all can not reach good effect.For solving the problem, the technical scheme that the embodiment of the present invention provides is: utilize intermediate language to obtain the parallel corpora between first language and second language, like this, follow-uply still can extract translation rule between first language and second language, to ensure the quality of translation rule based on parallel corpora.
Such as, the parallel corpora scarcity of resources between English-Japanese, but, English-middle parallel corpora that current existence is a large amount of and in-parallel corpora.So, just can using Chinese as a kind of third-party intermediate language, according to current existing English-middle parallel corpora and in-parallel corpora, obtain more Ying-parallel corpora.
Figure 2 shows that the process flow diagram of a kind of parallel corpora resource acquiring method that the embodiment of the present invention provides, comprise the following steps:
S101, obtains the public word string of the intermediate language matched between the first corpus and the second corpus.
S102, according to obtained public word string, forms the intertranslation text pair of first language and second language.
The first above-mentioned corpus and the second corpus are all current existing corpus, record the text pair with corresponding intertranslation relation in corpus.Wherein, the parallel corpora of first language and intermediate language is comprised at the first corpus, first corpus can be bilingualism corpora, can be the multi-lingual corpus (namely comprising the corpus of more than three kinds linguistic intertranslation language materials) comprising first language and intermediate language, the embodiment of the present invention need this to limit yet; Similarly, in the second corpus, comprise the parallel corpora of second language and intermediate language, the second corpus can be bilingualism corpora, also can be the multi-lingual corpus comprising second language and intermediate language.
For convenience of description, in embodiments of the present invention, represent the first language textual resources in the first corpus with S, T represents the second language textual resources in the second corpus.Because the first corpus and the second corpus two overlap independently corpus, therefore, intermediate language textual resources wherein generally can not be completely the same, for ease of distinguishing, represent the intermediate language textual resources in the first corpus below with P1, the intermediate language textual resources represented with P2 in the second corpus.
The scheme that the application embodiment of the present invention provides: first obtain the public word string p matched between P1 and P2 i(i=1,2,3 ..., N, wherein N is natural number, the number of the public word string matched between P1 and P2 that expression obtains), must exist respectively and p in S and T icorresponding intertranslation text s iand t i, just can think s further iand t ibetween also form intertranslation text pair.The final s obtained iand t ibetween the right set of intertranslation text, just may be used for the parallel corpora resource forming new first language and second language.Certainly, can in alternative embodiment, also can for intertranslation text s iand t ifurther constraint condition is set to determine that satisfactory intertranslation text will be to (will be described in detail this below).
In one embodiment of the invention, utilize the mode of information retrieval (Information Retrieval), obtain the public word string of the intermediate language matched between the first corpus and the second corpus, shown in Figure 3, can comprise the following steps:
S101a, in the first corpus, selects intermediate language sentence p '.
S101b, in the second corpus, the similarity of retrieval and p ' is greater than the intermediate language sentence p of predetermined threshold value ".
S101c, obtain p ' and p " between the public word string matched.
For all sentences in P2, index I can be set up, then using p ' as a retrieval request, retrieve in I.For same retrieval request, some results may be had eligible.For each result for retrieval, have a mark to weigh the similarity of this result and retrieval request, the result for retrieval selecting similarity higher by setting threshold value, can reduce ambiguity rule so further, the expansion of all right effectively control law.Wherein, the similarity between sentence can calculate in accordance with the following methods:
First according to the Feature Words that each sentence comprises, be feature weight vector (w by sentence expression 1, w 2..., w n) (n is natural number), wherein w jrepresent the weight of a jth Feature Words, can represent with the following methods further:
w j=tf j*IDF j
Wherein, tf jrepresent the frequency of occurrences of a jth word in whole document (i.e. the intermediate language textual resources of corpus), and IDF jthen represent the frequency of occurrences of sentence in whole document comprising a jth word.
Further, according in multidimensional Euclidean space to the definition of vector angle, vector angle formula can be used to represent the similarity between any two sentences.Such as: the feature weight vector of sentence 1 is a, the feature weight vector of sentence 2 is b, and so, the similarity similarity (a, b) of vectorial a and vectorial b can be expressed as:
Similarity(a,b)=cos(a,b)
=(inner product of vectorial a and vectorial b)/(product of vectorial a and vectorial b mould)
Certainly, above introduction be only a kind of embodiment calculating similarity between sentence, those skilled in the art can also adopt other method to calculate similarity between sentence, and the embodiment of the present invention does not need this to limit.
According to above method, in conjunction with the similarity threshold pre-set, if in P2, there is the sentence p similar to p ' " similarity between (i.e. p ' and p " is greater than the similarity threshold pre-set, the p wherein satisfied condition " may have multiple); just can obtain p ' and p further " between the public word string matched, this public word string can be designated as p1 (corresponding p " have multiple situations, can by obtain public word string be designated as p successively 2, p 3...).Then, in the first corpus, the intermediate language sentence reselecting other, as p ', repeats above-mentioned steps S101a-S101c, until the intermediate language sentence in traversal the first corpus, the set <p of the public word string matched between P1 and P2 just can be obtained i> (i is natural number), this set may be used for the parallel corpora resource forming new first language and second language.
In the present embodiment, utilize the mode of information retrieval to obtain the public word string of intermediate language, the object done like this finds p ' the most similar and p " sentence right.Identical vocabulary or phrase can be comprised in similar sentence.Because sentence comprises contextual information, therefore, the public word string obtained according to result for retrieval effectively can reduce the possibility of the translation rule that produces ambiguity.
Certainly, be understandable that, because the relation between the first corpus and the second corpus is reciprocity, therefore, also can set up index for all sentences in P1, then using the sentence in P2 as request, retrieve in the index.
In addition, in actual applications, except the mode utilizing information retrieval, other modes such as known such as transcription comparison or text screening also can be utilized to obtain the public word string of intermediate language, and the embodiment of the present invention does not need this to limit.
In another embodiment of the present invention, a kind of specific implementation of above-mentioned step S101c can be: obtain p ' and p " between meet default public the longest of word string constraint condition and mate public word string.
Wherein, the constraint condition of public word string can comprise following condition 1)-3) in one or more combination in any:
1) the total word number comprised in public word string is not less than the first default word number threshold value.
The feature of statistical machine translation can merge semantic information in translation process, constraint condition 1) can ensure finally to obtain for forming parallel corpora resource public word string in there is the word number of some, avoid in extracted translation rule, only comprising simple word or phrase intertranslation.The word number comprised in public word string is more, and it is semantic relatively more complete, and the translation rule therefrom extracted also more has practicality.
2) ratio of the stop words number that comprises of public word string and total word number is no more than default ratio threshold.
Generally, stop words (Stop Words) is broadly divided into following two classes: a class be use very extensive, or even some words too frequently.Such as English " i ", " is ", " what ", " I ", "Yes" etc. of Chinese; Another kind of is that in text, the frequency of occurrences is very high, but the word that practical significance is little again.This class mainly includes auxiliary words of mood, adverbial word, preposition, conjunction etc., usually self there is no its meaning, only puts it in a complete sentence and just has certain effect, as common " ", " ", " with " etc.
In embodiments of the present invention, pre-set a ratio threshold, if the ratio of the stop words number that public word string comprises and total word number has exceeded this thresholding, then abandon recording this public word string, thus avoided public word string to comprise too much stop words and affect the semanteme of extracted translation rule.
Such as, the ratio threshold preset is 0.5, and for public word string " a cat on the ", wherein " a ", " on ", " the " are stop words, so the ratio of the stop words number that comprises of this public word string and total word number is 3/4, does not meet default ratio threshold requirement.
Wherein, stop words can in the light of actual conditions set, and such as, can define according to inactive vocabulary general in language-specific, can get front several word that in corpus textual resources, word frequency is the highest as stop words, the embodiment of the present invention does not need this to limit yet.
3) public word string is at the corresponding translation of the first corpus or the second corpus, only has alignment relation with the word in described public word string.
Object of this constraint is: ensure that translation corresponding to public word string can not correspond to any word except this public word string.
Shown in Figure 4, wi (i=1,2,3) is intermediate language word string, and ti (i=1,2) is second language word string, and line represents between word with word has corresponding intertranslation relation.So, according to constraint condition 3), the word string " w1 w2 " in Fig. 4 (a) meets constraint condition; And in Fig. 4 (b), translation t1 has corresponded to the word w3 outside public word string w1 w2, therefore " w1 w2 " has not met constraint condition.
In addition, be understandable that, according to general information retrieval principle, during the public word string of the longest coupling between acquisition first corpus and the second corpus, also can not require to mate completely, namely allow the word differing some in P1 between word string and the text of word string in P2, once there be a side to exceed this quantity, then stop coupling, but do not comprise these differentiated words among final determined public word string p.Such as:
P1 comprises word string: semiconductor laser has the first diffraction light grid region
P2 comprises word string: the optical fiber with diffraction light grid region
Suppose the maximum phase tolerace of regulation 2 words, two word strings are mated from " having ", comprise " ", " one " in P1, and do not comprise in P2, but, now do not exceed the quantity of the word of maximum phase tolerace, therefore can proceed coupling, when being matched to " district ", word below cannot successful match, now exceeded the quantity of the word of maximum phase tolerace, matching process terminates.The last p ' obtained and p " be respectively:
P ': there is the first diffraction light grid region
P ": there is diffraction light grid region
Removing p ' and p " between differentiated word " the " and " one ", then finally determine that public word string p is:
There is diffraction light grid region
Several constraint conditions more than provided, when practical application, individually can use, also can be combined arbitrarily, thus make the public word string determined more be conducive to the high-quality translation rule of subsequent extracted.
After obtaining public word string p, due to intertranslation text s and t corresponding with p must be there is in S with T respectively, just can think further and also form intertranslation text pair between s and t.But in actual applications, in order to make formed intertranslation text to being more conducive to follow-up rule extraction, can also further for the determination that intertranslation text is right increases some constraint conditions.In another embodiment of the present invention, a kind of specific implementation of above-mentioned step S102 can be:
Judge whether public word string meets default intertranslation text to constraint condition with public word string at the corresponding translation t of the second corpus at the corresponding translation s of the first corpus, if so, then utilizes s and t to form the intertranslation text pair of first language and second language.
Wherein, intertranslation text can comprise following condition 1 to constraint condition)-3) in the combination in any of one or more:
1) in s and t, the word number with public word string without alignment relation is no more than the second default word number threshold value respectively.
Japan-China intertranslation sentence centering as shown in Figure 5, " The " of Japanese side does not have alignment relation, such word all can not exceed preset threshold value in first language side or second language side, if exceeded, such sentence then can not be utilized formation parallel corpora, thus there is the possibility of ambiguity in reduction translation rule.
2) the punctuate number difference in s and t is no more than default punctuate difference limen value.
Such as, Japan-China sentence is right:
Acidifying コ バ Le ト acidifying <-> cobalt oxide, oxidation
In Chinese sentence, comprise punctuate " pause mark ", then this between punctuate difference be 1.If the punctuate number difference in s and t too much, then such sentence can not be utilized formation parallel corpora, thus there is the possibility of ambiguity in reduction translation rule.
3) to belong to default ratio threshold interval for the word number ratio of s and t or number of characters ratio.
The object of this condition ensures that word number between s and t or number of characters difference are not too many, and such as, the ratio of stated day cliction number and Chinese word number is no more than threshold value 2, and so, the left and right end points in this ratio threshold interval should be set to 0.5 and 2 respectively.
Still for Japan-China sentence to " acidifying コ バ Le ト acidifying <-> cobalt oxide, oxidation ", wherein, word number ratio is:
Day/in: 3/4=0.75 (punctuate calculates according to a word)
In/day: 4/3=1.33
Visible, this right Japan-China word number ratio and Sino-Japan word number ratio are all positioned at and belong to ratio threshold interval, and therefore this sentence is to meeting constraint requirements 3).
Certainly, in other cases, except utilizing except word number compares, number of characters can also be utilized to compare, or consider word number and number of characters compares, the embodiment of the present invention does not need this to limit.Such as, at sentence to when not meeting word number constraint condition, can judging whether sentence is to meeting number of characters constraint condition further, if so, then still can thinking that sentence is to meeting the requirements.Mention at these needs, " text " in " text to " both can refer to sentence, also can refer to phrase.
Several constraint conditions more than provided, when practical application, individually can use, also can be combined arbitrarily, thus make the final intertranslation text for the formation of parallel corpora to being more conducive to the high-quality translation rule of subsequent extracted.
Below in conjunction with an actual example, the method for the acquisition parallel corpora resource of the embodiment of the present invention is described.Suppose to need the translation system building English-Japanese, and current exist English-middle parallel corpora and in-intertranslation language material, so, can according to current already present English-middle parallel corpora and in-parallel corpora, obtain Ying-intertranslation language material.Description in conjunction with preceding embodiment is known, in this embodiment, is using Chinese as intermediate language, English and Japanese corresponding first language and second language respectively.
Fig. 6 (a) is depicted as the intertranslation text pair in English-middle Parallel Corpus, the intertranslation text pair during Fig. 6 (b) is depicted as in-Parallel Corpus.
First, by information retrieval can obtain sentence " semiconductor laser has the first diffraction light grid region " in English-middle Parallel Corpus with in the sentence of-Parallel Corpus " on optic fibre light path, make the method for diffraction grating and there is the optical fiber in diffraction light grid region " there is higher similarity; Can obtain in two sentences the longest public word string matched further is " having diffraction light grid region "; Finally, English translation corresponding to " having diffraction light grid region " difference as shown in dotted line frame in Fig. 6 (a) and Fig. 6 (b) and Japanese Translation (as shown in dotted line frame in Fig. 6 (a) and Fig. 6 (b)), just can obtain Britain and Japan's intertranslation text pair as Suo Shi Fig. 6 (c).
Visible, the parallel corpora resource acquiring method that the application embodiment of the present invention provides, utilize third party's language to obtain the parallel corpora between bilingual, thus solve the problem of language material scarcity of resources between language, and be conducive to the translation rule obtaining better quality.
More than embodiments provides the method obtaining parallel corpora resource, corresponding to embodiment of the method above, the embodiment of the present invention also provides a kind of parallel corpora resource acquisition system, shown in Figure 7, comprising:
Public word string acquisition module 710, for obtaining the public word string of the intermediate language matched between the first corpus and the second corpus;
Intertranslation text is to composition module 720, and for the public word string obtained according to public word string acquisition module 710, form the intertranslation text pair of first language and second language, this intertranslation text is to the parallel corpora resource for the formation of first language and second language.
The first above-mentioned corpus and the second corpus are all current existing corpus, record the text pair with corresponding intertranslation relation in corpus.Wherein, the parallel corpora of first language and intermediate language is comprised at the first corpus, first corpus can be bilingualism corpora, can be the multi-lingual corpus (namely comprising the corpus of more than three kinds linguistic intertranslation language materials) comprising first language and intermediate language, the embodiment of the present invention need this to limit yet; Similarly, in the second corpus, comprise the parallel corpora of second language and intermediate language, the second corpus can be bilingualism corpora, also can be the multi-lingual corpus comprising second language and intermediate language.
For convenience of description, in embodiments of the present invention, represent the first language textual resources in the first corpus with S, T represents second language textual resources in the second corpus.Because the first corpus and the second corpus two overlap independently corpus, therefore, intermediate language textual resources wherein generally can not be completely the same, for ease of distinguishing, represent the intermediate language textual resources in the first corpus below with P1, the intermediate language textual resources represented with P2 in the second corpus.
The scheme that the application embodiment of the present invention provides: first obtain the public word string p matched between P1 and P2 i(i=1,2,3 ..., N, wherein N is natural number, the number of the public word string matched between P1 and P2 that expression obtains), must exist respectively and p in S and T icorresponding intertranslation text s iand t i, just can think s further iand t ibetween also form intertranslation text pair.The final s obtained iand t ibetween the right set of intertranslation text, just may be used for the parallel corpora resource forming new first language and second language.
In another embodiment of the present invention, described public word string acquisition module 710 can utilize the mode of information retrieval, obtains the public word string of the intermediate language matched between the first corpus and the second corpus.Shown in Figure 8, described public word string acquisition module 710, specifically can comprise:
Chooser module 711, in the first corpus, selects intermediate language sentence p ';
Retrieval submodule 712, in the second corpus, the similarity of retrieval and p ' is greater than the intermediate language sentence p of predetermined threshold value ";
Obtain submodule 713, for obtaining p ' and p " between the public word string matched.
For all sentences in P2, retrieval submodule 712 can set up index I, then using p ' as a retrieval request, retrieves in I.For same retrieval request, some results may be had eligible.For each result for retrieval, have a mark to weigh the similarity of this result and retrieval request, the result for retrieval selecting similarity higher by setting threshold value, can reduce ambiguity rule so further, the expansion of all right effectively control law.
Wherein, described retrieval submodule 712, can concrete configuration be, calculates the similarity between sentence according to following methods:
According to the Feature Words that each sentence comprises, form the proper vector of each sentence;
Based on the proper vector of sentence, utilize vector angle formula, calculate the similarity between two sentences.
Certainly, above introduction be only a kind of concrete configuration mode of retrieval submodule 712, retrieval submodule 712 can also be configured to utilize other method to calculate similarity between sentence, and the embodiment of the present invention does not need this to limit.
Certainly, be understandable that, because the relation between the first corpus and the second corpus is reciprocity, therefore retrieve submodule 712 and also can set up index for all sentences in P1, then using the sentence in P2 as request, retrieve in the index.
In the present embodiment, public word string acquisition module 710 utilizes the mode of information retrieval to obtain the public word string of intermediate language, and the object done like this finds p ' the most similar and p " sentence right.Identical vocabulary or phrase can be comprised in similar sentence.Because sentence comprises contextual information, therefore, the public word string obtained according to result for retrieval effectively can reduce the possibility of the rule that produces ambiguity.
In addition, in actual applications, except the mode utilizing information retrieval, public word string acquisition module 710 also can utilize the mode such as transcription comparison or text screening to obtain the public word string of intermediate language, and the embodiment of the present invention does not need this to limit.
In another embodiment of the present invention, described acquisition submodule 713, can concrete configuration be:
For obtaining p ' and p " between meet default public the longest of word string constraint condition and mate public word string.
Wherein, public word string constraint condition can comprise following condition 1)-3) in the combination in any of one or more:
1) the total word number comprised in public word string is not less than the first default word number threshold value.
2) ratio of the stop words number that comprises of public word string and total word number is no more than default ratio threshold.
3) public word string is at the corresponding translation of the first corpus or the second corpus, only has alignment relation with the word in described public word string.
In addition, be understandable that, according to general information retrieval principle, when obtaining the public word string of the longest coupling of submodule 713 between acquisition first corpus and the second corpus, also can not require to mate completely, namely allow the word differing some in P1 between word string and the text of word string in P2, once there be a side to exceed this quantity, then stop coupling, but do not comprise these differentiated words among final determined public word string p.
Several constraint conditions more than provided, when practical application, obtaining submodule 713 can individually use, and also can be combined arbitrarily, thus make the public word string determined more be conducive to the high-quality translation rule of subsequent extracted.
In another embodiment of the present invention, described intertranslation text, to composition module 720, can concrete configuration be:
For judging whether public word string meets default intertranslation text to constraint condition with public word string at the corresponding translation t of the second corpus at the corresponding translation s of the first corpus, if so, then utilizes s and t to form the intertranslation text pair of first language and second language.
Wherein, intertranslation text can comprise following condition 1 to constraint condition)-3) in the combination in any of one or more:
1) in s and t, the word number with public word string without alignment relation is no more than the second default word number threshold value respectively.
2) the punctuate number difference in s and t is no more than default punctuate difference limen value.
3) to belong to default ratio threshold interval for the word number ratio of s and t or number of characters ratio.
Several constraint conditions more than provided, when practical application, intertranslation text can individually use composition module 720, also can be combined arbitrarily, thus makes the final intertranslation text for the formation of parallel corpora to being more conducive to the high-quality translation rule of subsequent extracted.
Visible, the parallel corpora resource acquisition system that the application embodiment of the present invention provides, utilizes third party's language to obtain the parallel corpora between bilingual, thus solves the problem of language material scarcity of resources between language, is conducive to the translation rule obtaining better quality.
It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device or system embodiment, the operation performed by its all modules or submodule is substantially similar to the sequence of maneuvers in embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.System embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as module display can be or may not be physical module, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
In addition, should also be noted that and can be realized by hardware, software and/or firmware according to the series of processes of the above-mentioned equipment of each embodiment, the function of system and the method for the present invention.When being realized by software and/or firmware, from storage medium or network to the computing machine with specialized hardware structure, general purpose personal computer 900 such as shown in Fig. 9 installs the program forming this software, this computing machine, when being provided with various program, can perform various function and process etc.
In fig .9, CPU (central processing unit) (CPU) 901 performs various process according to the program stored in ROM (read-only memory) (ROM) 902 or from the program that storage area 908 is loaded into random access memory (RAM) 903.In RAM 903, also store the data required when CPU 901 performs various process etc. as required.
CPU 901, ROM 902 and RAM 903 are connected to each other via bus 904.Input/output interface 905 is also connected to bus 904.
Following parts are connected to input/output interface 905: importation 906, comprise keyboard, mouse etc.; Output 907, comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 908, comprises hard disk etc.; With communications portion 909, comprise network interface unit such as LAN card, modulator-demodular unit etc.Communications portion 909 is via network such as the Internet executive communication process.
As required, driver 910 is also connected to input/output interface 905.Detachable media 911 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on driver 910 as required, and the computer program therefrom read is installed in storage area 908 as required.
When series of processes above-mentioned by software simulating, from network such as the Internet or storage medium, such as detachable media 911 installs the program forming software.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 9, distributes the detachable media 9711 to provide program to user separately with equipment.The example of detachable media 911 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM 902, comprise in storage area 908 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.
Also it is pointed out that the step performing above-mentioned series of processes can order naturally following the instructions perform in chronological order, but do not need necessarily to perform according to time sequencing.Some step can walk abreast or perform independently of one another.
Visible, the embodiment of the invention also discloses a kind of program product storing the instruction code of machine-readable, when this instruction code is read by machine and performs, the route selection control method of previous embodiment of the present invention can be performed.The embodiment of the invention also discloses a kind of storage medium, it carries the instruction code of machine-readable simultaneously, when described instruction code is read by machine and performs, can perform the route selection control method of previous embodiment of the present invention.
About the embodiment comprising above embodiment, also disclose following remarks:
Remarks 1. 1 kinds of parallel corpora resource acquiring methods, comprising:
Obtain the public word string of the intermediate language matched between the first corpus and the second corpus;
According to obtained public word string, form the intertranslation text pair of first language and second language, described intertranslation text is to the parallel corpora resource for the formation of first language and second language;
Wherein, described first corpus comprises the parallel corpora of first language and intermediate language;
Described second corpus comprises the parallel corpora of second language and intermediate language.
The method of remarks 2. according to remarks 1, the public word string of intermediate language matched between described acquisition first corpus and the second corpus, comprising:
In the first corpus, select intermediate language sentence p ';
In the second corpus, the similarity of retrieval and p ' is greater than the intermediate language sentence p of predetermined threshold value ";
Obtain p ' and p " between the public word string matched.
The method of remarks 3. according to remarks 2, between sentence, the computing method of similarity comprise:
According to the Feature Words that each sentence comprises, form the proper vector of each sentence;
Based on the proper vector of sentence, utilize vector angle formula, calculate the similarity between two sentences.
The method of remarks 4. according to remarks 2, described acquisition p ' and p " between the public word string that matches, comprising:
Obtain p ' and p " between meet default public the longest of word string constraint condition and mate public word string, described public word string constraint condition comprises:
Total word number that described public word string comprises is not less than the first default word number threshold value; And/or
The ratio of the stop words number that described public word string comprises and total word number is no more than default ratio threshold; And/or
Described public word string, at the corresponding translation of the first corpus or the second corpus, only has alignment relation with the word in described public word string.
The method of remarks 5. according to remarks 1, described according to obtained public word string, form the intertranslation text pair of first language and second language, comprising:
Judge whether public word string meets default intertranslation text to constraint condition with public word string at the corresponding translation t of the second corpus at the corresponding translation s of the first corpus, if so, then utilizes s and t to form the intertranslation text pair of first language and second language;
Described intertranslation text comprises constraint condition:
In s and t, the word number with described public word string without alignment relation is no more than the second default word number threshold value respectively; And/or
Punctuate number difference in s and t is no more than default punctuate difference limen value; And/or
The word number ratio of s and t or number of characters ratio belong to default ratio threshold interval.
Remarks 6. 1 kinds of parallel corpora resource acquisition system, comprising:
Public word string acquisition module, for obtaining the public word string of the intermediate language matched between the first corpus and the second corpus;
Intertranslation text is to composition module, and for the public word string obtained according to described public word string acquisition module, form the intertranslation text pair of first language and second language, described intertranslation text is to the parallel corpora resource for the formation of first language and second language;
Wherein, described first corpus comprises the parallel corpora of first language and intermediate language;
Described second corpus comprises the parallel corpora of second language and intermediate language.
The system of remarks 7. according to remarks 6, public word string acquisition module, comprising:
Chooser module, in the first corpus, selects intermediate language sentence p ';
Retrieval submodule, in the second corpus, the similarity of retrieval and p ' is greater than the intermediate language sentence p of predetermined threshold value ";
Obtain submodule, for obtaining p ' and p " between the public word string matched.
The system of remarks 8. according to remarks 7, described retrieval submodule, concrete configuration is, calculates the similarity between sentence according to following methods:
According to the Feature Words that each sentence comprises, form the proper vector of each sentence;
Based on the proper vector of sentence, utilize vector angle formula, calculate the similarity between two sentences.
The system of remarks 9. according to remarks 7, described acquisition submodule, concrete configuration is:
For obtaining p ' and p " between meet default public the longest of word string constraint condition and mate public word string, described public word string constraint condition comprises:
Total word number that described public word string comprises is not less than the first default word number threshold value; And/or
The ratio of the stop words number that described public word string comprises and total word number is no more than default ratio threshold; And/or
Described public word string, at the corresponding translation of the first corpus or the second corpus, only has alignment relation with the word in described public word string.
The system of remarks 10. according to remarks 6, described intertranslation text is to composition module, and concrete configuration is:
For judging whether public word string meets default intertranslation text to constraint condition with public word string at the corresponding translation t of the second corpus at the corresponding translation s of the first corpus, if so, then utilizes s and t to form the intertranslation text pair of first language and second language;
Described intertranslation text comprises constraint condition:
In s and t, the word number with described public word string without alignment relation is no more than the second default word number threshold value respectively; And/or
Punctuate number difference in s and t is no more than default punctuate difference limen value;
And/or
The word number ratio of s and t or number of characters ratio belong to default ratio threshold interval.
Remarks 11. 1 kinds stores the program product of the instruction code of machine-readable, when described instruction code is read by machine and performs, can perform the method as described in any one of remarks 1-5.
Remarks 12. 1 kinds of storage mediums, it carries the instruction code of machine-readable, when described instruction code is read by machine and performs, can perform the method as described in any one of remarks 1-5.
Although described the present invention and advantage thereof in detail, be to be understood that and can have carried out various change when not departing from the spirit and scope of the present invention limited by appended claim, substituting and conversion.And, the term of the embodiment of the present invention " comprises ", " comprising " or its other variant any are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key element clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

Claims (8)

1. a parallel corpora resource acquiring method, comprising:
Obtain the public word string of the intermediate language matched between the first corpus and the second corpus, comprising:
In the first corpus, select intermediate language sentence p ';
In the second corpus, the similarity of retrieval and p ' is greater than the intermediate language sentence p of predetermined threshold value ";
Obtain p ' and p " between the public word string matched, comprising: obtain p ' and p " between meet default public the longest of word string constraint condition and mate public word string, described public word string constraint condition comprises:
Described public word string, at the corresponding translation of the first corpus or the second corpus, only has alignment relation with the word in described public word string;
According to obtained public word string, form the intertranslation text pair of first language and second language, described intertranslation text is to the parallel corpora resource for the formation of first language and second language;
Wherein, described first corpus comprises the parallel corpora of first language and intermediate language;
Described second corpus comprises the parallel corpora of second language and intermediate language.
2. method according to claim 1, between sentence, the computing method of similarity comprise:
According to the Feature Words that each sentence comprises, form the proper vector of each sentence;
Based on the proper vector of sentence, utilize vector angle formula, calculate the similarity between two sentences.
3. method according to claim 1, described public word string constraint condition also comprises:
Total word number that described public word string comprises is not less than the first default word number threshold value; And/or
The ratio of the stop words number that described public word string comprises and total word number is no more than default ratio threshold.
4. method according to claim 1, described according to obtained public word string, form the intertranslation text pair of first language and second language, comprising:
Judge whether public word string meets default intertranslation text to constraint condition with public word string at the corresponding translation t of the second corpus at the corresponding translation s of the first corpus, if so, then utilizes s and t to form the intertranslation text pair of first language and second language;
Described intertranslation text comprises constraint condition:
In s and t, the word number with described public word string without alignment relation is no more than the second default word number threshold value respectively; And/or
Punctuate number difference in s and t is no more than default punctuate difference limen value; And/or
The word number ratio of s and t or number of characters ratio belong to default ratio threshold interval.
5. a parallel corpora resource acquisition system, comprising:
Public word string acquisition module, for obtaining the public word string of the intermediate language matched between the first corpus and the second corpus, comprising:
Chooser module, in the first corpus, selects intermediate language sentence p ';
Retrieval submodule, in the second corpus, the similarity of retrieval and p ' is greater than the intermediate language sentence p of predetermined threshold value ";
Obtain submodule, for obtaining p ' and p " between the public word string matched, concrete configuration is: for obtaining p ' and p " between meet default public the longest of word string constraint condition and mate public word string, described public word string constraint condition comprises:
Described public word string, at the corresponding translation of the first corpus or the second corpus, only has alignment relation with the word in described public word string;
Intertranslation text is to composition module, and for the public word string obtained according to described public word string acquisition module, form the intertranslation text pair of first language and second language, described intertranslation text is to the parallel corpora resource for the formation of first language and second language;
Wherein, described first corpus comprises the parallel corpora of first language and intermediate language;
Described second corpus comprises the parallel corpora of second language and intermediate language.
6. system according to claim 5, described retrieval submodule, concrete configuration is, calculates the similarity between sentence according to following methods:
According to the Feature Words that each sentence comprises, form the proper vector of each sentence;
Based on the proper vector of sentence, utilize vector angle formula, calculate the similarity between two sentences.
7. system according to claim 5, described public word string constraint condition also comprises:
Total word number that described public word string comprises is not less than the first default word number threshold value; And/or
The ratio of the stop words number that described public word string comprises and total word number is no more than default ratio threshold.
8. system according to claim 5, described intertranslation text is to composition module, and concrete configuration is:
For judging whether public word string meets default intertranslation text to constraint condition with public word string at the corresponding translation t of the second corpus at the corresponding translation s of the first corpus, if so, then utilizes s and t to form the intertranslation text pair of first language and second language;
Described intertranslation text comprises constraint condition:
In s and t, the word number with described public word string without alignment relation is no more than the second default word number threshold value respectively; And/or
Punctuate number difference in s and t is no more than default punctuate difference limen value;
And/or
The word number ratio of s and t or number of characters ratio belong to default ratio threshold interval.
CN201110021725.5A 2011-01-10 2011-01-10 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system Expired - Fee Related CN102591857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110021725.5A CN102591857B (en) 2011-01-10 2011-01-10 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110021725.5A CN102591857B (en) 2011-01-10 2011-01-10 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system

Publications (2)

Publication Number Publication Date
CN102591857A CN102591857A (en) 2012-07-18
CN102591857B true CN102591857B (en) 2015-06-24

Family

ID=46480526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110021725.5A Expired - Fee Related CN102591857B (en) 2011-01-10 2011-01-10 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system

Country Status (1)

Country Link
CN (1) CN102591857B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678285A (en) * 2012-08-31 2014-03-26 富士通株式会社 Machine translation method and machine translation system
CN103914447B (en) * 2013-01-09 2017-04-19 富士通株式会社 Information processing device and information processing method
CN104123274B (en) * 2013-04-26 2018-06-12 富士通株式会社 The method and apparatus and machine translation method and equipment of the word of the intermediate language of evaluation
CN103577399B (en) * 2013-11-05 2018-01-23 北京百度网讯科技有限公司 The data extending method and apparatus of bilingualism corpora
CN103605644B (en) * 2013-12-02 2017-02-01 哈尔滨工业大学 Pivot language translation method and device based on similarity matching
TWI613554B (en) * 2017-03-24 2018-02-01 Zhuang Shi Cheng Translation assistance system
CN110866407B (en) * 2018-08-17 2024-03-01 阿里巴巴集团控股有限公司 Analysis method, device and equipment for determining similarity between text of mutual translation
CN110046332B (en) * 2019-04-04 2024-01-23 远光软件股份有限公司 Similar text data set generation method and device
CN110489624B (en) * 2019-07-12 2022-07-19 昆明理工大学 Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector
CN110516230B (en) * 2019-07-12 2020-09-08 昆明理工大学 Chinese-Burmese bilingual parallel sentence pair extraction method and device based on pivot language
CN112395856B (en) * 2019-07-31 2022-09-13 阿里巴巴集团控股有限公司 Text matching method, text matching device, computer system and readable storage medium
CN111191473B (en) * 2019-12-31 2024-05-03 深圳市优必选科技股份有限公司 Method and device for acquiring translation text file
CN114692642A (en) * 2020-12-31 2022-07-01 北京猎户星空科技有限公司 Text corpus generation method, device, equipment and medium
CN113627150B (en) * 2021-07-01 2022-12-20 昆明理工大学 Language similarity-based parallel sentence pair extraction method and device for transfer learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003014967A2 (en) * 2001-08-10 2003-02-20 Communications Research Laboratory, Independent Administrative Institution Third language text generating algorithm by multi-lingual text inputting and device and program therefor
CN101030196B (en) * 2006-02-28 2010-05-12 株式会社东芝 Method and apparatus for training bilingual word alignment model, method and apparatus for bilingual word alignment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种综合多特征的句子相似度计算方法;吴全娥等;《计算机系统应用》;20101115;第19卷(第11期);第110-113页 *
新的基于中间语义的多语言信息检索模型;邹小芳等;《小型微型计算机系统》;20100430;第31卷(第4期);第697-699页 *

Also Published As

Publication number Publication date
CN102591857A (en) 2012-07-18

Similar Documents

Publication Publication Date Title
CN102591857B (en) Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
Hearne et al. Statistical machine translation: a guide for linguists and translators
US6782384B2 (en) Method of and system for splitting and/or merging content to facilitate content processing
KR20210116379A (en) Method, apparatus for text generation, device and storage medium
US20100161655A1 (en) System for string matching based on segmentation method and method thereof
Benajiba et al. ANERsys 2.0: Conquering the NER task for the Arabic language by combining the maximum entropy with POS-tag information.
US20150112664A1 (en) System and method for generating a tractable semantic network for a concept
JP6335898B2 (en) Information classification based on product recognition
KR101573854B1 (en) Method and system for statistical context-sensitive spelling correction using probability estimation based on relational words
CN102855263A (en) Method and device for aligning sentences in bilingual corpus
US11269942B2 (en) Automatic keyphrase extraction from text using the cross-entropy method
CN104281716B (en) The alignment schemes and device of parallel corpora
CN109063184A (en) Multilingual newsletter archive clustering method, storage medium and terminal device
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
CN101271448A (en) Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
CN112765977B (en) Word segmentation method and device based on cross-language data enhancement
Pakzad et al. An improved joint model: POS tagging and dependency parsing
JP2018072979A (en) Parallel translation sentence extraction device, parallel translation sentence extraction method and program
CN102890723A (en) Example sentence searching method and system
CN103914447A (en) Information processing device and information processing method
Sofianopoulos et al. Implementing a language-independent MT methodology
CN107168950B (en) Event phrase learning method and device based on bilingual semantic mapping
JP2016167123A (en) Common operation column extraction program, common operation column extraction method, and common operation column extraction apparatus
Khoufi et al. Chunking Arabic texts using conditional random fields
KR20140079545A (en) Method for Multi-language Morphological Analysis and Part-of-Speech Tagging based on conventional decoding scheme

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150624

Termination date: 20190110

CF01 Termination of patent right due to non-payment of annual fee