CN104933038A - Machine translation method and machine translation device - Google Patents

Machine translation method and machine translation device Download PDF

Info

Publication number
CN104933038A
CN104933038A CN201410104256.7A CN201410104256A CN104933038A CN 104933038 A CN104933038 A CN 104933038A CN 201410104256 A CN201410104256 A CN 201410104256A CN 104933038 A CN104933038 A CN 104933038A
Authority
CN
China
Prior art keywords
mentioned
sentence
translation
translated
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410104256.7A
Other languages
Chinese (zh)
Inventor
张大鲲
苏韬
郝杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to CN201410104256.7A priority Critical patent/CN104933038A/en
Publication of CN104933038A publication Critical patent/CN104933038A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a dynamic machine translation method and a dynamic machine translation device, which can improve the translation quality. The machine translation device in an embodiment of the invention comprises an input unit, a calculation unit, a selection unit, a training unit and a translation unit, wherein the input unit is used for inputting a sentence to be translated; the calculation unit is used for calculating the similarity between the sentence to be translated and a source language sentence in a bilingual corpus; the selection unit is used for selecting a plurality of sentence pairs from the bilingual corpus based on the similarity and using the sentence pairs as training corpora; the training unit is used for training the translation system by utilizing the training corpora; and the translation unit is used for translating the sentence to be translated by utilizing the translation system.

Description

Machine translation method and machine translation apparatus
Technical field
The present invention relates to the treatment technology of natural language, in particular to machine translation method and machine translation apparatus.
Background technology
The general flow of statictic machine translation system is, first Confirming model (algorithm), then carrys out training pattern parameter (translation knowledge) based on training data, finally utilizes and trains the sentence of model parameter to input obtained to translate.
Training data usually by the bilingual sentence alignd on a large scale to forming, these sentences are to may from different fields, and the form of sentence is not identical yet, even if the identical sentence of source language also may have different target language translations.Equally, the same word in source language sentence, also difference based on context and different translations may be had.
In common translation system, once complete training process, the translation model of generation just no longer changes.Afterwards, the translation model generated is used to translate sentence to be translated.But due to the diversity of sentence to be translated, the translation system namely no longer changed after this generation can not be suitable for all sentences to be translated usually, and translation quality therefore can be caused not high.
To this, propose the method that some fields adapt to, for constructing the translation system of " dynamically ".First certain methods carries out interpolation to the data in field and the data outside field, then utilizes the data construct translation model after interpolation.First other method carries out cluster according to field to training data, then utilizes the subset of cluster to train independent translation submodel, and the field when translating belonging to sentence to be translated, selects the translation submodel corresponding with its field to translate.
Summary of the invention
The present inventor finds after the method adapted to above-mentioned field is studied, although these methods have certain adaptive faculty, but, just no longer change once generated translation model or translation submodel by training after, namely the translation model generated after training remains " static state ", therefore the adaptive faculty of translation system is limited, and translation quality still can be caused not high.
In order to solve the above-mentioned problems in the prior art, embodiments of the present invention provide the dynamic machine translation method and machine translation apparatus that can improve translation quality.Particularly, following technical scheme is provided.
[1] machine translation method, comprises the following steps: input sentence to be translated; Calculate the similarity between the source language sentence in above-mentioned sentence to be translated and bilingualism corpora; In above-mentioned bilingualism corpora, multiple sentence pair is selected, as corpus based on above-mentioned similarity; Utilize above-mentioned corpus, training translation system; And utilize above-mentioned translation system, above-mentioned sentence to be translated is translated.
The machine translation method of present embodiment, by the language material high with the similarity of sentence to be translated being selected in bilingualism corpora, and construct translation system in real time based on the language material selected, dynamic, pointed translation system can be constructed, thus can translation quality be improved.
[2] according to the machine translation method of above-mentioned [1], above-mentioned selection step comprises the following steps: sort to above-mentioned similarity order from big to small to the sentence in above-mentioned bilingualism corpora; And the top n sentence pair after selected and sorted, as above-mentioned corpus, N is the integer of more than 1.
The machine translation method of present embodiment, by the top n sentence after selected and sorted, when existing in bilingualism corpora in a large number with the language material that the similarity of sentence to be translated is high, language material that is the most similar, some can be utilized to train translation system, thus translation quality can not only be ensured, and the processing load of training translation system can be alleviated.
[3] according to the machine translation method of above-mentioned [1] or [2], above-mentioned selection step comprises the following steps: select the above-mentioned similarity in above-mentioned bilingualism corpora to be greater than the sentence pair of predetermined threshold value, as above-mentioned corpus.
The machine translation method of present embodiment, translation system is trained by the language material selecting similarity to be greater than predetermined threshold value, language material low for similarity can be got rid of, thus language material that similarity is low can be avoided the interference of translation system, can ensure further to translate accuracy.
[4] according to the machine translation method of one of above-mentioned [1] ~ [3], the step of above-mentioned calculating similarity comprises the following steps: utilize the editing distance between the source language sentence in above-mentioned sentence to be translated and above-mentioned bilingualism corpora to calculate above-mentioned similarity.
[5] according to the machine translation method of one of above-mentioned [1] ~ [4], the step of above-mentioned calculating similarity comprises the following steps: the similarity calculating the syntactic structure between the source language sentence in above-mentioned sentence to be translated and above-mentioned bilingualism corpora.
[6] according to the machine translation method of one of above-mentioned [1] ~ [5], further comprising the steps of after above-mentioned translation steps: to preserve above-mentioned sentence to be translated and translation result thereof at translation buffer.
[7] according to the machine translation method of above-mentioned [6], further comprising the steps of after above-mentioned input step: to search above-mentioned sentence to be translated at above-mentioned translation buffer.
The machine translation method of present embodiment, by preserving sentence to be translated and translation result thereof in translation buffer, when translating identical sentence next time, directly can obtain the translation result of this sentence from translation buffer, save computational resource, improve translation efficiency.
[8] according to the machine translation method of one of above-mentioned [1] ~ [7], further comprising the steps of after above-mentioned translation steps: above-mentioned sentence to be translated and its translation result are added above-mentioned bilingualism corpora.
The machine translation method of present embodiment, by sentence to be translated and its translation result are added bilingualism corpora, can expand the corpus data of bilingualism corpora, thus can improve the translation quality of follow-up translation.
[9] according to the machine translation method of one of above-mentioned [1] ~ [8], further comprising the steps of after above-mentioned translation steps: word alignment is carried out to above-mentioned sentence to be translated and its translation result; And word alignment result is added above-mentioned bilingualism corpora.
The machine translation method of present embodiment, by word alignment result is added bilingualism corpora, can not only expand the corpus data of bilingualism corpora, improve the translation quality of follow-up translation, and can improve translation efficiency.
[10] according to the machine translation method of one of above-mentioned [1] ~ [9], further comprising the steps of before the step of above-mentioned calculating similarity: to add and user-dependent training data in above-mentioned bilingualism corpora.
The machine translation method of present embodiment, by adding and user-dependent training data, the sentence such as alignd, to, context-sensitive data, the sentence equity of having carried out word alignment, when training data deficiency, also can reach the object of user adaptation.
[11] according to the machine translation method of one of above-mentioned [1] ~ [10], further comprising the steps of after above-mentioned translation steps: the degree of confidence utilizing the above-mentioned translation result of above-mentioned Similarity Measure.
The machine translation method of present embodiment, by utilizing the degree of confidence of Similarity Measure translation result, while obtaining translation result, can obtain the degree of confidence of translation result, thus calculates degree of confidence without using other method, improves translation efficiency.
[12] machine translation apparatus, comprising: input block, and it inputs sentence to be translated; Computing unit, it calculates the similarity between the source language sentence in above-mentioned sentence to be translated and bilingualism corpora; Selection unit, it selects multiple sentence pair, as corpus based on above-mentioned similarity in above-mentioned bilingualism corpora; Training unit, it utilizes above-mentioned corpus, training translation system; And translation unit, it utilizes above-mentioned translation system, translates above-mentioned sentence to be translated.
The machine translation apparatus of present embodiment, by the language material high with the similarity of sentence to be translated being selected in bilingualism corpora, and construct translation system in real time based on the language material selected, dynamic, pointed translation system can be constructed, thus can translation quality be improved.
[13] according to the machine translation apparatus of above-mentioned [12], wherein, above-mentioned selection unit comprises: sequencing unit, and it sorts to above-mentioned similarity order from big to small to the sentence in above-mentioned bilingualism corpora; Top n sentence pair after above-mentioned selection unit selected and sorted, as above-mentioned corpus, N is the integer of more than 1.
The machine translation apparatus of present embodiment, by the top n sentence after selected and sorted, when existing in bilingualism corpora in a large number with the language material that the similarity of sentence to be translated is high, language material that is the most similar, some can be utilized to train translation system, thus translation quality can not only be ensured, and the processing load of training translation system can be alleviated.
[14] according to the machine translation apparatus of above-mentioned [12] or [13], wherein, above-mentioned selection unit selects the above-mentioned similarity in above-mentioned bilingualism corpora to be greater than the sentence pair of predetermined threshold value, as above-mentioned corpus.
The machine translation apparatus of present embodiment, translation system is trained by the language material selecting similarity to be greater than predetermined threshold value, language material low for similarity can be got rid of, thus language material that similarity is low can be avoided the interference of translation system, can ensure further to translate accuracy.
[15] according to the machine translation apparatus of one of above-mentioned [12] ~ [14], wherein, above-mentioned computing unit utilizes the editing distance between the source language sentence in above-mentioned sentence to be translated and above-mentioned bilingualism corpora to calculate above-mentioned similarity.
[16] according to the machine translation apparatus of one of above-mentioned [12] ~ [15], wherein, the above-mentioned computing unit similarity that calculates in above-mentioned sentence to be translated and above-mentioned bilingualism corpora between source language sentence syntactic structure.
[17] according to the machine translation apparatus of one of above-mentioned [12] ~ [16], also comprise: storage unit, it preserves above-mentioned sentence to be translated and translation result thereof at translation buffer.
[18] according to the machine translation apparatus of above-mentioned [17], also comprise: search unit, it searches above-mentioned sentence to be translated at above-mentioned translation buffer after the above-mentioned sentence to be translated of above-mentioned input block input.
The machine translation apparatus of present embodiment, by preserving sentence to be translated and translation result thereof in translation buffer, when translating identical sentence next time, directly can obtain the translation result of this sentence from translation buffer, save computational resource, improve translation efficiency.
[19] according to the machine translation apparatus of one of above-mentioned [12] ~ [18], also comprise: sentence is to adding device, and above-mentioned sentence to be translated and its translation result are added above-mentioned bilingualism corpora by it.
The machine translation apparatus of present embodiment, by sentence to be translated and its translation result are added bilingualism corpora, can expand the corpus data of bilingualism corpora, thus can improve the translation quality of follow-up translation.
[20] according to the machine translation apparatus of one of above-mentioned [12] ~ [19], also comprise: word alignment unit, it carries out word alignment to above-mentioned sentence to be translated and its translation result; And word alignment result adding device, word alignment result is added above-mentioned bilingualism corpora by it.
The machine translation apparatus of present embodiment, by word alignment result is added bilingualism corpora, can not only expand the corpus data of bilingualism corpora, improve the translation quality of follow-up translation, and can improve translation efficiency.
[21] according to the machine translation apparatus of one of above-mentioned [12] ~ [20], also comprise: corpus adding device, it adds and user-dependent training data in above-mentioned bilingualism corpora.
The machine translation apparatus of present embodiment, by adding and user-dependent training data, the sentence such as alignd, to, context-sensitive data, the sentence equity of having carried out word alignment, when training data deficiency, also can reach the object of user adaptation.
[22] according to the machine translation apparatus of one of above-mentioned [12] ~ [21], also comprise:
Confidence computation unit, it utilizes the degree of confidence of the above-mentioned translation result of above-mentioned Similarity Measure.The machine translation apparatus of present embodiment, by utilizing the degree of confidence of Similarity Measure translation result, while obtaining translation result, can obtain the degree of confidence of translation result, thus calculates degree of confidence without using other method, improves translation efficiency.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the machine translation method according to an embodiment of the invention.
Fig. 2 is the block scheme of machine translation apparatus according to another implementation of the invention.
Embodiment
Just by reference to the accompanying drawings each preferred implementation of the present invention is described in detail below.
machine translation method
Present embodiment provides a kind of machine translation method, comprises the following steps: input sentence to be translated; Calculate the similarity between the source language sentence in above-mentioned sentence to be translated and bilingualism corpora 10; In above-mentioned bilingualism corpora 10, multiple sentence pair is selected, as corpus based on above-mentioned similarity; Utilize above-mentioned corpus, training translation system; And utilize above-mentioned translation system, above-mentioned sentence to be translated is translated.
Be described in detail referring to Fig. 1.Fig. 1 is the process flow diagram of the machine translation method according to an embodiment of the invention.
As shown in Figure 1, first, in step S101, sentence to be translated is inputted.In the present embodiment, sentence to be translated can be the sentence of the known any needs translation of those skilled in the art, can be any languages, the present invention to this without any restriction.
Then, in step S105, the similarity between the source language sentence in sentence to be translated and bilingualism corpora 10 is calculated.
In the present embodiment, bilingualism corpora 10 comprises the sentence pair of multiple source language and target language, and it can be the known any bilingualism corpora of those skilled in the art, such as English-Chinese data storehouse, English-German corpus, Japanese-Chinese data storehouse etc.Present embodiment for bilingualism corpora 10 without any restriction.In addition, the bilingualism corpora of present embodiment both can be only carried out the corpus of sentence alignment, also can be to sentence to the corpus carrying out word alignment, the present invention to this without any restriction.
In the present embodiment, similarity is the parameter of the similarity degree represented between sentence to be translated and source language sentence, such as, can adopt the similarity based on character string, also can adopt structurized similarity.When calculating similarity, such as, can calculate similarity based on the editing distance between the source language sentence in sentence to be translated and bilingualism corpora 10.In addition, also similarity can be calculated based on the syntactic structure between the language sentence in sentence to be translated and bilingualism corpora 10.Present embodiment without any restriction, can utilize the known any method of those skilled in the art for the computing method of similarity.
Then, in step s 110, in bilingualism corpora 10, multiple sentence pair is selected, as corpus based on similarity.
When selecting sentence pair, such as, according to the result of calculation of similarity, can sort to similarity order from big to small to the sentence in bilingualism corpora 10, and top n sentence after selected and sorted is to as corpus, N is the integer of more than 1.
In addition, also can not to sentence to sorting, but the sentence selecting the similarity in bilingualism corpora 10 to be greater than predetermined threshold value is to as corpus.Present embodiment selects the method for corpus without any restriction for based on similarity, can be combination or those skilled in the art's any other method known of said method.
Then, in step sl 15, the corpus selected in step S110 is utilized, training translation system.
In the present embodiment, translation system can be the known any translation system of those skilled in the art, as long as it can be utilized to translate sentence to be translated.That is, the known any algorithm of those skilled in the art can be utilized to come based on corpus training translation system.
Particularly, when bilingualism corpora is the corpus only having carried out sentence alignment, first utilize the known any word alignment method of those skilled in the art to carry out word alignment to the corpus selected, then utilize the corpus having carried out word alignment to train translation system.On the other hand, when bilingualism corpora be to sentence to the corpus carrying out word alignment, without the need to carrying out above-mentioned word alignment, and directly utilize the corpus selected to train translation system.
Preferably, the translation system of present embodiment comprises translation model (Translation Model, TM) and language model (Language Model, LM).Translation model use the bilingual sentence in corpus to and word alignment information train, the order of accuarcy of conversion from source language to target language is described, language model uses the target language sentence in corpus to train, and describes the fluency of translation result.
Finally, in the step s 120, utilize the translation system trained in step sl 15, sentence to be translated is translated, obtains translation result 20.
The machine translation method of present embodiment, by the language material high with the similarity of sentence to be translated being selected in bilingualism corpora 10, and construct translation system in real time based on the language material selected, dynamic, pointed translation system can be constructed, thus can translation quality be improved.
In addition, the machine translation method of present embodiment, by the top n sentence after selected and sorted, when existing in bilingualism corpora 10 in a large number with the language material that the similarity of sentence to be translated is high, language material that is the most similar, some can be utilized to train translation system, thus translation quality can not only be ensured, and the processing load of training translation system can be alleviated.
In addition, the machine translation method of present embodiment, trains translation system by the language material selecting similarity to be greater than predetermined threshold value, language material low for similarity can be got rid of, thus language material that similarity is low can be avoided the interference of translation system, can ensure further to translate accuracy.
Preferably, the machine translation method of present embodiment, further comprising the steps of after step S120: to preserve sentence to be translated and translation result thereof at translation buffer.Translation buffer can be the known any storer of those skilled in the art.
And then, the machine translation method of present embodiment, preferably further comprising the steps of after step slol: to search sentence to be translated at translation buffer.If there is sentence to be translated in translation buffer, then directly obtain its translation result, if there is no, then carry out step S105.
The machine translation method of present embodiment, by preserving sentence to be translated and translation result thereof in translation buffer, when translating identical sentence next time, directly can obtain the translation result of this sentence from translation buffer, save computational resource, improve translation efficiency.
Preferably, the machine translation method of present embodiment, further comprising the steps of after step S120: above-mentioned sentence to be translated and its translation result are added above-mentioned bilingualism corpora.
The machine translation method of present embodiment, by sentence to be translated and its translation result are added bilingualism corpora 10, can expand the corpus data of bilingualism corpora 10, thus can improve the translation quality of follow-up translation.
Preferably, the machine translation method of present embodiment, further comprising the steps of before the above-mentioned sentence to be translated of interpolation and its translation result: word alignment is carried out to sentence to be translated and translation result thereof; And word alignment result is added bilingualism corpora 10.
In the present embodiment, can utilize the known any word alignment instrument of those skilled in the art, such as GIZA++ instrument carries out word alignment to sentence to be translated and translation result thereof.Present embodiment for the method for word alignment without any restriction.
The machine translation method of present embodiment, by word alignment result is added bilingualism corpora, can not only expand the corpus data of bilingualism corpora, improve the translation quality of follow-up translation, and can improve translation efficiency.
Preferably, the machine translation method of present embodiment, further comprising the steps of before step S110: to add and user-dependent training data in bilingualism corpora 10.
The machine translation method of present embodiment, by adding and user-dependent training data, the sentence such as alignd, to, context-sensitive data, the sentence equity of having carried out word alignment, when training data deficiency, also can reach the object of user adaptation.
Preferably, the machine translation method of present embodiment, further comprising the steps of after above-mentioned steps S120: the degree of confidence utilizing the above-mentioned translation result of above-mentioned Similarity Measure.
In the present embodiment, degree of confidence refers to the credibility of translation result.If be all not enough to obtain translation knowledge for study with the most similar portion of the sentence to be translated of input in bilingualism corpora 10, can learn that to utilize whole bilingualism corpora 10 to obtain the probability of best translation also very little.When selecting similar sentence, if similarity higher than the sentence number of certain value lower than predetermined threshold, then can think that bilingualism corpora 10 does not comprise the knowledge being enough to translate sentence to be translated.Such as, when can be more than predetermined threshold in the quantity that similarity is greater than the sentence of more than certain value, think that degree of confidence is high.
The machine translation method of present embodiment, by utilizing the degree of confidence of Similarity Measure translation result, while obtaining translation result, can obtain the degree of confidence of translation result, thus calculates degree of confidence without using other method, improves translation efficiency.
machine translation apparatus
Under same inventive concept, Fig. 2 is the block scheme of machine translation apparatus according to another implementation of the invention.Below just in conjunction with this figure, present embodiment is described.For the part that those are identical with earlier embodiments, suitably the description thereof will be omitted.
Present embodiment provides a kind of machine translation apparatus 200, comprising: input block 201, and it inputs sentence to be translated; Computing unit 205, it calculates the similarity between the source language sentence in above-mentioned sentence to be translated and bilingualism corpora 10; Selection unit 210, it selects multiple sentence pair, as corpus based on above-mentioned similarity in above-mentioned bilingualism corpora 10; Training unit 215, it utilizes above-mentioned corpus, training translation system; And translation unit 220, it utilizes above-mentioned translation system, translates above-mentioned sentence to be translated.
Be described in detail referring to Fig. 2.As shown in Figure 2, input block 201 inputs sentence to be translated.In the present embodiment, sentence to be translated can be the sentence of the known any needs translation of those skilled in the art, can be any languages, the present invention to this without any restriction.
Computing unit 205 calculates the similarity between the source language sentence in sentence to be translated and bilingualism corpora 10.
In the present embodiment, bilingualism corpora 10 comprises the sentence pair of multiple source language and target language, and it can be the known any bilingualism corpora of those skilled in the art, such as English-Chinese data storehouse, English-German corpus, Japanese-Chinese data storehouse etc.Present embodiment for bilingualism corpora 10 without any restriction.In addition, the bilingualism corpora of present embodiment both can be only carried out the corpus of sentence alignment, also can be to sentence to the corpus carrying out word alignment, the present invention to this without any restriction.
In the present embodiment, similarity is the parameter of the similarity degree represented between sentence to be translated and source language sentence, such as, can adopt the similarity based on character string, also can adopt structurized similarity.Computing unit 205 when calculating similarity, such as, can calculate similarity based on the editing distance between the source language sentence in sentence to be translated and bilingualism corpora 10.In addition, computing unit 205 also can calculate similarity based on the syntactic structure between the language sentence in sentence to be translated and bilingualism corpora 10.Present embodiment without any restriction, can utilize the known any method of those skilled in the art for the computing method of similarity.
Selection unit 210 selects multiple sentence pair, as corpus based on similarity in bilingualism corpora 10.
Selection unit 210 preferably includes sequencing unit.When selecting sentence pair, first sequencing unit is according to the result of calculation of similarity, sort to similarity order from big to small to the sentence in bilingualism corpora 10, the top n sentence then after selection unit 210 selected and sorted is to as corpus, and N is the integer of more than 1.
In addition, selection unit 210 also can not have to sentence to the sequencing unit sorted, and the sentence directly selecting the similarity in bilingualism corpora 10 to be greater than predetermined threshold value is to as corpus.Present embodiment selects the method for corpus without any restriction for based on similarity, can be combination or those skilled in the art's any other method known of said method.
The corpus that training unit 215 utilizes selection unit 210 to select, training translation system.
In the present embodiment, translation system can be the known any translation system of those skilled in the art, as long as it can be utilized to translate sentence to be translated.That is, the known any algorithm of those skilled in the art can be utilized to come based on corpus training translation system.
Particularly, when bilingualism corpora is the corpus only having carried out sentence alignment, training unit 215 also comprises word alignment unit, it utilizes the known any word alignment method of those skilled in the art to carry out word alignment to the corpus selected, and then training unit 215 utilizes the corpus having carried out word alignment to train translation system.On the other hand, when bilingualism corpora be to sentence to the corpus carrying out word alignment, training unit 215 without the need to comprising word alignment unit, and directly utilizes the corpus selected to train translation system.
Preferably, the translation system of present embodiment comprises translation model (Translation Model, TM) and language model (Language Model, LM).Translation model use the bilingual sentence in corpus to and word alignment information train, the order of accuarcy of conversion from source language to target language is described, language model uses the target language sentence in corpus to train, and describes the fluency of translation result.
The translation system that translation unit 220 utilizes training unit 215 to train, translates sentence to be translated, obtains translation result 20.
The machine translation apparatus 200 of present embodiment, by the language material high with the similarity of sentence to be translated being selected in bilingualism corpora 10, and construct translation system in real time based on the language material selected, dynamic, pointed translation system can be constructed, thus can translation quality be improved.
In addition, the machine translation apparatus 200 of present embodiment, by the top n sentence after selected and sorted, when existing in bilingualism corpora 10 in a large number with the language material that the similarity of sentence to be translated is high, language material that is the most similar, some can be utilized to train translation system, thus translation quality can not only be ensured, and the processing load of training translation system can be alleviated.
In addition, the machine translation apparatus 200 of present embodiment, trains translation system by the language material selecting similarity to be greater than predetermined threshold value, language material low for similarity can be got rid of, thus language material that similarity is low can be avoided the interference of translation system, can ensure further to translate accuracy.
Preferably, the machine translation apparatus 200 of present embodiment is also included in translation buffer and preserves sentence to be translated and the storage unit of translation result thereof.Translation buffer can be the known any storer of those skilled in the art.
And then the machine translation apparatus 200 of present embodiment, what be preferably also included in that translation buffer searches sentence to be translated searches unit.If there is sentence to be translated in translation buffer, then directly obtain its translation result, if there is no, then utilize computing unit 205, selection unit 210, training unit 215 and translation unit 220 to translate.
The machine translation apparatus 200 of present embodiment, by preserving sentence to be translated and translation result thereof in translation buffer, when translating identical sentence next time, directly can obtain the translation result of this sentence from translation buffer, save computational resource, improve translation efficiency.
Preferably, the machine translation apparatus 200 of present embodiment also comprises and above-mentioned sentence to be translated and its translation result is added the sentence of above-mentioned bilingualism corpora to adding device.
The machine translation apparatus 200 of present embodiment, by sentence to be translated and its translation result are added bilingualism corpora 10, can expand the corpus data of bilingualism corpora 10, thus can improve the translation quality of follow-up translation.
Preferably, the machine translation apparatus 200 of present embodiment also comprises and carries out the word alignment unit of word alignment to sentence to be translated and translation result thereof and word alignment result is added the word alignment result adding device of bilingualism corpora 10.
In the present embodiment, can utilize the known any word alignment instrument of those skilled in the art, such as GIZA++ instrument carries out word alignment to sentence to be translated and translation result thereof.Present embodiment for the method for word alignment without any restriction.
The machine translation apparatus 200 of present embodiment, by word alignment result is added bilingualism corpora, can not only expand the corpus data of bilingualism corpora, improve the translation quality of follow-up translation, and can improve translation efficiency.
Preferably, the machine translation apparatus 200 of present embodiment is also included in bilingualism corpora 10 and adds the training data adding device with user-dependent training data.Alternatively, also above-mentioned alignment result adding device can be also used as training data adding device.
The machine translation apparatus 200 of present embodiment, by adding and user-dependent training data, the sentence such as alignd, to, context-sensitive data, the sentence equity of having carried out word alignment, when training data deficiency, also can reach the object of user adaptation.
Preferably, the machine translation apparatus 200 of present embodiment also comprises the confidence computation unit of the degree of confidence utilizing the above-mentioned translation result of above-mentioned Similarity Measure.Alternatively, also above-mentioned computing unit 205 can be also used as confidence computation unit.
In the present embodiment, degree of confidence refers to the credibility of translation result.If be all not enough to obtain translation knowledge for study with the most similar portion of the sentence to be translated of input in bilingualism corpora 10, can learn that to utilize whole bilingualism corpora 10 to obtain the probability of best translation also very little.When selecting similar sentence, if similarity higher than the sentence number of certain value lower than predetermined threshold, then can think that bilingualism corpora 10 does not comprise the knowledge being enough to translate sentence to be translated.Such as, when can be more than predetermined threshold in the quantity that similarity is greater than the sentence of more than certain value, think that degree of confidence is high.
The machine translation apparatus 200 of present embodiment, by utilizing the degree of confidence of Similarity Measure translation result, while obtaining translation result, can obtain the degree of confidence of translation result, thus calculates degree of confidence without using other method, improves translation efficiency.
Although describe in detail machine translation method of the present invention and machine translation apparatus by some exemplary embodiments above, but these embodiments above are not exhaustive, and those skilled in the art can realize variations and modifications within the spirit and scope of the present invention.Therefore, the present invention is not limited to these embodiments, and scope of the present invention is only as the criterion by claims.

Claims (10)

1. a machine translation apparatus, comprising:
Input block, it inputs sentence to be translated;
Computing unit, it calculates the similarity between the source language sentence in above-mentioned sentence to be translated and bilingualism corpora;
Selection unit, it selects multiple sentence pair, as corpus based on above-mentioned similarity in above-mentioned bilingualism corpora;
Training unit, it utilizes above-mentioned corpus, training translation system; And
Translation unit, it utilizes above-mentioned translation system, translates above-mentioned sentence to be translated.
2. machine translation apparatus according to claim 1, wherein, above-mentioned selection unit comprises:
Sequencing unit, it sorts to above-mentioned similarity order from big to small to the sentence in above-mentioned bilingualism corpora;
Above-mentioned selection unit, the top n sentence pair after selected and sorted, as above-mentioned corpus, N is the integer of more than 1.
3. machine translation apparatus according to claim 1, wherein, above-mentioned selection unit, selects the above-mentioned similarity in above-mentioned bilingualism corpora to be greater than the sentence pair of predetermined threshold value, as above-mentioned corpus.
4. machine translation apparatus according to claim 1, wherein, above-mentioned computing unit, utilizes the editing distance between the source language sentence in above-mentioned sentence to be translated and above-mentioned bilingualism corpora to calculate above-mentioned similarity.
5. machine translation apparatus according to claim 1, wherein, above-mentioned computing unit, calculates the similarity of the syntactic structure between the source language sentence in above-mentioned sentence to be translated and above-mentioned bilingualism corpora.
6. machine translation apparatus according to claim 1, also comprises:
Storage unit, it preserves above-mentioned sentence to be translated and translation result thereof at translation buffer; And
Search unit, it searches above-mentioned sentence to be translated at above-mentioned translation buffer after the above-mentioned sentence to be translated of above-mentioned input block input.
7. machine translation apparatus according to claim 1, also comprises:
Above-mentioned sentence to be translated and its translation result are added the sentence of above-mentioned bilingualism corpora to adding device; Or
The word alignment unit of word alignment is carried out to above-mentioned sentence to be translated and its translation result and word alignment result is added the word alignment result adding device of above-mentioned bilingualism corpora.
8. machine translation apparatus according to claim 1, also comprises:
Training data adding device, it adds and user-dependent training data in above-mentioned bilingualism corpora.
9. machine translation apparatus according to claim 1, also comprises:
Confidence computation unit, it utilizes the degree of confidence of the above-mentioned translation result of above-mentioned Similarity Measure.
10. a machine translation method, comprises the following steps:
Input sentence to be translated;
Calculate the similarity between the source language sentence in above-mentioned sentence to be translated and bilingualism corpora;
In above-mentioned bilingualism corpora, multiple sentence pair is selected, as corpus based on above-mentioned similarity;
Utilize above-mentioned corpus, training translation system; And
Utilize above-mentioned translation system, above-mentioned sentence to be translated is translated.
CN201410104256.7A 2014-03-20 2014-03-20 Machine translation method and machine translation device Pending CN104933038A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410104256.7A CN104933038A (en) 2014-03-20 2014-03-20 Machine translation method and machine translation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410104256.7A CN104933038A (en) 2014-03-20 2014-03-20 Machine translation method and machine translation device

Publications (1)

Publication Number Publication Date
CN104933038A true CN104933038A (en) 2015-09-23

Family

ID=54120207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410104256.7A Pending CN104933038A (en) 2014-03-20 2014-03-20 Machine translation method and machine translation device

Country Status (1)

Country Link
CN (1) CN104933038A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs
CN106708812A (en) * 2016-12-19 2017-05-24 新译信息科技(深圳)有限公司 Machine translation model obtaining method and device
CN107193809A (en) * 2017-05-18 2017-09-22 广东小天才科技有限公司 A kind of teaching material scenario generation method and device, user equipment
CN107329961A (en) * 2017-07-03 2017-11-07 西安市邦尼翻译有限公司 A kind of method of cloud translation memory library Fast incremental formula fuzzy matching
CN107526727A (en) * 2017-07-31 2017-12-29 苏州大学 language generation method based on statistical machine translation
CN110472251A (en) * 2018-05-10 2019-11-19 腾讯科技(深圳)有限公司 Method, the method for statement translation, equipment and the storage medium of translation model training
CN111191469A (en) * 2019-12-17 2020-05-22 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111221965A (en) * 2019-12-30 2020-06-02 成都信息工程大学 Classification sampling detection method based on bilingual corpus of public identification words

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN101271451A (en) * 2007-03-20 2008-09-24 株式会社东芝 Computer aided translation method and device
CN101393547A (en) * 2007-09-20 2009-03-25 株式会社东芝 Apparatus, method, and system for machine translation
CN101667176A (en) * 2008-09-01 2010-03-10 株式会社东芝 Method and system for counting machine translation based on phrases
CN101714137A (en) * 2008-10-06 2010-05-26 株式会社东芝 Methods for evaluating and selecting example sentence pairs and building universal example sentence library, and machine translation method and device
CN103631773A (en) * 2013-12-16 2014-03-12 哈尔滨工业大学 Statistical machine translation method based on field similarity measurement method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271451A (en) * 2007-03-20 2008-09-24 株式会社东芝 Computer aided translation method and device
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
CN101393547A (en) * 2007-09-20 2009-03-25 株式会社东芝 Apparatus, method, and system for machine translation
CN101667176A (en) * 2008-09-01 2010-03-10 株式会社东芝 Method and system for counting machine translation based on phrases
CN101714137A (en) * 2008-10-06 2010-05-26 株式会社东芝 Methods for evaluating and selecting example sentence pairs and building universal example sentence library, and machine translation method and device
CN103631773A (en) * 2013-12-16 2014-03-12 哈尔滨工业大学 Statistical machine translation method based on field similarity measurement method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708812A (en) * 2016-12-19 2017-05-24 新译信息科技(深圳)有限公司 Machine translation model obtaining method and device
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs
CN107193809A (en) * 2017-05-18 2017-09-22 广东小天才科技有限公司 A kind of teaching material scenario generation method and device, user equipment
CN107329961A (en) * 2017-07-03 2017-11-07 西安市邦尼翻译有限公司 A kind of method of cloud translation memory library Fast incremental formula fuzzy matching
CN107526727A (en) * 2017-07-31 2017-12-29 苏州大学 language generation method based on statistical machine translation
CN110472251A (en) * 2018-05-10 2019-11-19 腾讯科技(深圳)有限公司 Method, the method for statement translation, equipment and the storage medium of translation model training
CN110472251B (en) * 2018-05-10 2023-05-30 腾讯科技(深圳)有限公司 Translation model training method, sentence translation equipment and storage medium
CN111191469A (en) * 2019-12-17 2020-05-22 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111191469B (en) * 2019-12-17 2023-09-19 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111221965A (en) * 2019-12-30 2020-06-02 成都信息工程大学 Classification sampling detection method based on bilingual corpus of public identification words

Similar Documents

Publication Publication Date Title
CN104933038A (en) Machine translation method and machine translation device
CN111598216B (en) Method, device and equipment for generating student network model and storage medium
CN107729322B (en) Word segmentation method and device and sentence vector generation model establishment method and device
CN107562824B (en) Text similarity detection method
Soru et al. SPARQL as a Foreign Language
CN104391842A (en) Translation model establishing method and system
CN109271644A (en) A kind of translation model training method and device
CN103714054B (en) Interpretation method and translating equipment
CN103678285A (en) Machine translation method and machine translation system
CN110895559B (en) Model training method, text processing method, device and equipment
CN105068998A (en) Translation method and translation device based on neural network model
CN103823857B (en) Space information searching method based on natural language processing
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN106557563A (en) Query statement based on artificial intelligence recommends method and device
CN109543165B (en) Text generation method and device based on circular convolution attention model
CN107766319B (en) Sequence conversion method and device
CN106844356B (en) Method for improving English-Chinese machine translation quality based on data selection
CN111160568A (en) Machine reading understanding model training method and device, electronic equipment and storage medium
CN113705313A (en) Text recognition method, device, equipment and medium
CN108549644A (en) Omission pronominal translation method towards neural machine translation
CN104731774A (en) Individualized translation method and individualized translation device oriented to general machine translation engine
CN113656613A (en) Method for training image-text retrieval model, multi-mode image retrieval method and device
CN113761220A (en) Information acquisition method, device, equipment and storage medium
CN106547743B (en) Translation method and system
CN110334362B (en) Method for solving and generating untranslated words based on medical neural machine translation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150923