CN107066455A - A kind of multilingual intelligence pretreatment real-time statistics machine translation system - Google Patents

A kind of multilingual intelligence pretreatment real-time statistics machine translation system Download PDF

Info

Publication number
CN107066455A
CN107066455A CN201710203439.8A CN201710203439A CN107066455A CN 107066455 A CN107066455 A CN 107066455A CN 201710203439 A CN201710203439 A CN 201710203439A CN 107066455 A CN107066455 A CN 107066455A
Authority
CN
China
Prior art keywords
module
language
translation
word
pretreatment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710203439.8A
Other languages
Chinese (zh)
Other versions
CN107066455B (en
Inventor
张昱琪
唐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710203439.8A priority Critical patent/CN107066455B/en
Publication of CN107066455A publication Critical patent/CN107066455A/en
Application granted granted Critical
Publication of CN107066455B publication Critical patent/CN107066455B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention discloses the multilingual intelligence pretreatment real-time statistics machine translation system of one kind, including:Receiving module, pretreatment module, machine translation module and post-processing module.The receiving module includes text language receiving module and voice identification result receiving module;The pretreatment module includes Text Pretreatment module and voice identification result pretreatment module;Machine translation module, the machine translation module is used to learn the translation of phrase-for-phrase, and finds out corresponding translation phrase to the phrase handled by pretreatment module, and phrase is connected into complete sentence;Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and format specificationization processing, the communicative habits for making it be more nearly object language, and is exported as final result.The present invention can be used in translating text language and speech language, and improve the translation degree of accuracy to small probability word, phrase.

Description

A kind of multilingual intelligence pretreatment real-time statistics machine translation system
Technical field
The present invention relates to artificial intelligence machine translation technology field, it particularly relates to a kind of multilingual intelligence pretreatment Real-time statistics machine translation system.
Background technology
Machine translation is the technology for carrying out automatic translation to Human Natural Language using computer, is one using computer The process that natural language is converted into another natural language is planted, and two kinds of natural languages should be of equal value in the sense.
At present, a kind of comparative maturity and machine translation method of main flow is Statistics-Based Method, the advantage of this method It is that all translation informations are all automatically to be obtained from language material learning with little need for manually translation rule is write, because The characteristics of this this method has farthest played computer high-speed computation, significantly reduces cost of labor.
Machine translation mothod based on statistical model is from Parallel Corpus learning from a kind of language A to another language B Phrase translation.When translating new sentence, input language A sentences decomposition into some phrases, according to study come phrase (A language)- phrase(B language)Co-occurrence probabilities, language A sentence translation into language B sentence.It is whole study, translated Journey is completely according to statistical model.
But, this according to co-occurrence frequency, the machine translation of probabilistic method is for small probability phrase(For example proper noun is turned over Translate)Disposal ability is not enough, in addition, how to add the expression of grammatical and semantic in statistical model so that the translation of the sentence of generation is more Plus meet the communicative habits of people, it is also the problem of current machine translation mothod needs solution.
The content of the invention
For the above-mentioned technical problem in correlation technique, the present invention proposes a kind of multilingual intelligence pretreatment real-time statistics machine Device translation system, can overcome the above-mentioned deficiency of prior art.
To realize above-mentioned technical purpose, the technical proposal of the invention is realized in this way:
A kind of multilingual intelligence pretreatment real-time statistics machine translation system, including:
Receiving module, the receiving module is used to check that the receiving module includes text to the normalization that system is inputted Language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out sentence to text language Segmentation and form conversion, voice identification result receiving module to voice for being split, noise is eliminated and form conversion;
Pretreatment module, the pretreatment module includes Text Pretreatment module and voice identification result pretreatment module, described Text Pretreatment module is used to carry out the language of text input word standardized operation, classification identification mark and language block word order tune It is whole;Institute's speech recognition result pretreatment module is used to carry out word standardized operation to voice and punctuate is predicted;
Machine translation module, the machine translation module is used for the translation for learning phrase-for-phrase, and to passing through pretreatment module The phrase of processing finds out corresponding translation phrase, and the complete sentence of generation;
Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and lattice Formula standardization processing, the communicative habits for making it be more nearly object language, and exported as final result.
Further, the text language receiving module includes sentence and splits module and format converting module, the sentence Segmentation module is used to input text be disconnected at punctuation mark so that the elementary cell of follow-up machine translation module translation is one Word;The format converting module is used to support lattice when the different-format of language text is converted to the translation of machine translation module Formula.
It is preferred that, support that form is plain text format or XML format during the machine translation module translation.
Further, institute's speech recognition result receiving module includes sentence and splits module and noise cancellation module, described Sentence segmentation module is used to make pauses in reading unpunctuated ancient writings to the speech text stream of input according to the pause between word and word;The noise cancellation module is used In the fragment for disposing adjacent repetition in spoken words text flow in input.
Further, the Text Pretreatment module includes word normalizing block, classification identification labeling module and language block Word order adjusting module, the word normalizing block is used to make language to be translated be more nearly object language in word aspect; Numeral that classification identification labeling module is used to treat in interpreter language text, date, time, URL are respectively labeled as $ Number, $ date, $ hour and $ www, and in advance by the content translation in classification into object language;The language block word order is adjusted The sentence that mould preparation block is used to treat interpreter language carries out syntactic analysis, then treats interpreter language according to the rule learnt automatically Language block order is adjusted so that the word order of language to be translated is more nearly the word order of object language.
Further, described voice identification result pretreatment module includes word normalizing block and punctuate prediction mould Block, the word normalizing block is used for the word for making the word particle in language to be translated be more nearly object language;It is described Punctuate prediction module is used for the position that the pause based on context between word judges fullstop in speech recognition output;Described language Sound recognition result pretreatment module is plain text and confusion network for the receivable pattern of voice identification result.
Further, the machine translation module includes training module and translation module, and described training module is utilized Translation of the GIZA++ kits in extensive balanced corpus learning phrase-for-phrase;The translation module is used for each defeated Enter the sentence come in, be divided into phrase fragment, each phrase fragment is translated according to the training result of training module, it is described The translation process of translation module is a search procedure, i.e. the translation from the translation result composition of each translation submodel is combined In find out optimal translation combination, the optimal translation combination is final translation result.
It is preferred that, described translation submodel includes phrase translation model, language model, and word order changes model, word-based The language model of property, bilingual language model and domain-adaptive model.
Further, described post-processing module includes word lattice gauge module, capital and small letter modular converter and form Modular converter, the word lattice gauge module is used to the word and lattice gauge in machine translation result turn to target language The form of expression of speech;The capital and small letter modular converter is used for the translation using Western languages as object language;Format converting module is used for The form of the object language of translation is consistent with the form of language to be translated.
It is preferred that, the capital and small letter modular converter is used to the initial and the letter of proper noun in object language being changed to Patterns of capitalization.
Beneficial effects of the present invention:The machine translation system of the present invention, can in real time turn over a kind of sentence of language, chapter Be translated into another language, the system can translation of the sentence it is complete, expression is correct, and the text language with punctuation mark can also be turned over No paragraph segmentation is translated, sentence may be imperfect, without punctuation mark, there is the voice of noise in sentence;The present invention is improved pair The translation degree of accuracy of small probability word, phrase, will the small probability word such as numeral, date, time, URL mark respectively and preferential Translation;The pretreatment module of the present invention can carry out standardization processing to the sentence of input;The post-processing module of the present invention can Improve the fluency of translation result.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing needed to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the present invention Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also be obtained according to these accompanying drawings Obtain other accompanying drawings.
Fig. 1 is the translation of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention Flow chart;
Fig. 2 is the received text of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention The schematic diagram of module;
Fig. 3 is the speech recognition of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention As a result the schematic diagram of receiving module;
Fig. 4 is that the text of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention is located in advance Manage the schematic diagram of module;
Fig. 5 is the speech recognition of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention As a result the schematic diagram of pretreatment module;
Fig. 6 is the machine translation of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention The schematic diagram of module;
Fig. 7 is the post processing mould of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention The schematic diagram of block.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to what the present invention was protected Scope.
As shown in figs. 1-7, described a kind of multilingual intelligence pretreatment real-time statistics machine according to embodiments of the present invention is turned over System is translated, including:
Receiving module, the receiving module is used to check that the receiving module includes text to the normalization that system is inputted Language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out sentence to text language Segmentation and form conversion, voice identification result receiving module to voice for being split, noise is eliminated and form conversion;
Pretreatment module, the pretreatment module includes Text Pretreatment module and voice identification result pretreatment module, described Text Pretreatment module is used to carry out the language of text input word standardized operation, classification identification mark and language block word order tune It is whole;Institute's speech recognition result pretreatment module is used to carry out word standardized operation to voice and punctuate is predicted;
Machine translation module, the machine translation module is used for the translation for learning phrase-for-phrase, and to passing through pretreatment module The phrase of processing finds out corresponding translation phrase, and the complete sentence of generation;
Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and lattice Formula standardization processing, the communicative habits for making it be more nearly object language, and exported as final result.
In one embodiment, the text language receiving module includes sentence and splits module and format converting module, The sentence segmentation module is used to input text be disconnected at punctuation mark so that it is basic that follow-up machine translation module is translated Unit is in short;When the format converting module is used to the different-format of language text be converted to the translation of machine translation module Support form.
In one embodiment, support that form is plain text format or XML format during the machine translation module translation.
In one embodiment, institute's speech recognition result receiving module includes sentence and splits module and noise elimination mould Block, the sentence segmentation module is used to make pauses in reading unpunctuated ancient writings to the speech text stream of input according to the pause between word and word;The noise disappears Except module is used to dispose the fragment of adjacent repetition in spoken words text flow in input.
In one embodiment, the Text Pretreatment module includes word normalizing block, classification identification mark mould Block and language block word order adjusting module, the word normalizing block are used to make language to be translated be more nearly mesh in word aspect Poster is sayed;Numeral that classification identification labeling module is used to treat in interpreter language text, date, time, URL are marked respectively For $ number, $ date, $ hour and $ www, and in advance by the content translation in classification into object language;The language block word order The sentence that adjusting module is used to treat interpreter language carries out syntactic analysis, then treats interpreter language according to the rule learnt automatically Language block order be adjusted so that the word order of language to be translated is more nearly the word order of object language.
In one embodiment, described voice identification result pretreatment module includes word normalizing block and punctuate Prediction module, the word normalizing block is used for the word for making the word particle in language to be translated be more nearly object language Language;The punctuate prediction module is used for the position that the pause based on context between word judges fullstop in speech recognition output; Described voice identification result pretreatment module is plain text and confusion network for the receivable pattern of voice identification result.
In one embodiment, the machine translation module includes training module and translation module, described training mould Block utilizes GIZA++ kits in the translation of extensive balanced corpus learning phrase-for-phrase;The translation module be used for pair The sentence that each input is come in, is divided into phrase fragment, each phrase fragment is turned over according to the training result of training module Translate, the translation process of the translation module is a search procedure, i.e., constituted from the translation result of each translation submodel Optimal translation combination is found out in translation combination, the optimal translation combination is final translation result.
In one embodiment, described translation submodel includes phrase translation model, and language model, word order changes mould Type, the language model based on part of speech, bilingual language model and domain-adaptive model.
In one embodiment, described post-processing module includes word lattice gauge module, capital and small letter modulus of conversion Block and format converting module, the word lattice gauge module are used for the word and lattice gauge in machine translation result For the form of expression of object language;The capital and small letter modular converter is used for the translation using Western languages as object language;Form is changed Module is used for the form of the object language of translation is consistent with the form of language to be translated.
In one embodiment, the capital and small letter modular converter is used for the initial and proper noun in object language Letter be changed to patterns of capitalization.
Understand for convenience the present invention above-mentioned technical proposal, below by way of in specifically used mode to the present invention it is above-mentioned Technical scheme is described in detail.
When specifically used, according to the multilingual intelligence pretreatment real-time statistics machine translation system of one kind of the present invention System, including receiving module, pretreatment module, translation module and post-processing module;
Receiving module is checked for the normalization that system is inputted, including text language receiving module and voice identification result connect Receive module;The receiving module of text language is mainly made up of two parts, as shown in Fig. 2 in accompanying drawing:Sentence splits module and form Modular converter.A.1 sentence segmentation module is input text in punctuation mark fullstop, and question mark is disconnected at exclamation mark so that follow-up The elementary cell of machine translation module translation is sentence, and when including html marks in input text, a pair of html mark it Between content individually form a complete sentence, to ensure that it is translated as complete sentence, the one of outer text is marked not as html Part is translated, and the subsequent module of flow supports the translation of plain text and XML format text.When input text is other lattice When formula, such as PDF or picture, A.2 extended formatting is converted into plain text and XML format by format converting module.Speech recognition knot Fruit receiving module is also mainly made up of two parts, as illustrated in figure 3 of the drawings:Sentence splits module and noise cancellation module.A.3 Sentence splits text flow of the module input according to the pause punctuate between word and word, when being more than 0.5s when pausing, it is believed that at this Newly start a sentence after pause, A.4 the function of noise cancellation module is to dispose spoken language in input to talk about adjacent in text flow The fragment repeated, for example " uh uh " is simplified to " uh ";" that is we are necessary ... " is simplified to " namely Say that we are necessary ... ", machine translation system subsequent module is plain text for the receivable pattern of voice identification result and obscured Network.
If pretreatment module carries out dry run to input language A so that it is more nearly special translating purpose language B, with after an action of the bowels Continuous machine translation module obtains more preferable translation quality.Pretreatment module includes Text Pretreatment module and voice identification result Pretreatment module.Text Pretreatment module is mainly made up of three parts, as shown in Fig. 4 in accompanying drawing.B.1 word normalizing block makes Obtain original language A and object language B is more nearly in word aspect.Such as carry out in-English translation when, Chinese will carry out participle, Space is inserted between word.When carrying out De-English translation, the compound word in German is split, word in increase moral English sentence Man-to-man corresponding relation.B.2 classification identification labeling module is respectively labeled as the numeral in original language A, date, time, URL Corresponding classification $ number, $ date, $ hour and $ www.Content in classification translates into object language B in advance by rule. Follow-up machine translation module is no longer translated to it.B.3 language block word order adjusting module is carried out to original language A sentence first Syntactic analysis:Carry out automatic identification or the syntax tree generation of phrase.Then according to the rule learnt automatically(Phrase-based)It is right The language block order of original language is adjusted so that the word order of original language is more nearly object language.The sentence adjusted after word order can , can also be several more excellent word orders with word lattice to be exported with optimal word order(lattice)Form is exported.The module can for one Whether the module of choosing, possess syntax analyzer of good performance etc. according to original language to decide whether to open the module.Voice is known Other result pretreatment module is made up of two parts, as illustrated in figure 5 of the drawings.B.4 word normalizing block and and B.1 word specification Change module similar, be also the word particle in A language sentences is more nearly object language B in the word aspect of original language Word.B.5 the fullstop position during pause prediction speech recognition of the punctuate prediction module based on context between word is exported.The son Module is an optional module, is mainly used in the speech recognition translation of relatively written word, the translation of such as speech.
Wherein, B.2 classification identification mark is based on bilingual semi-automatic classification identification and translated.It is so-called semi-automatic to refer to Define the classification for needing to recognize in a manual manner on original language in bilingual;Then according to balanced corpus and word ratio It is right(word alignment)It is automatic learn in another language in requisition for classification and classification translation.Turned over English-Chinese Example is translated into, each classifications of classification $ number, $ date, $ hour, $ www. for needing to recognize are defined on English first Content can include some words.Then all numerals are identified on Chinese, are tieed up labeled as $ bnumber, and with ten thousand Related word www, http .com etc. is netted, labeled as $ bwww.$ bnumber and $ bwww herein is the core of classification in Chinese The heart.On the basis of this core, front and rear word is also included into, the final Chinese corresponding with classification in English could be constituted Classification.Forgive word before and after which, we are compared by word(word alignment)It is automatic to extract.Compare and neutralize in word The corresponding Chinese word of English classification border word, it is also possible to the border word of Chinese classification.The border of Chinese classification is determined Word, the Chinese category content extracted also just implies the translator of Chinese of corresponding English classification.Therefrom English learning class The translation rule of Chinese classification is clipped to, for example:
$ number { 2 } → $ number { 2 }
$ number { 2 one-tenth } → $ number { 20% }
$ number { the 2nd } → $ number { 2nd }
The rule that this kind of method is extracted more preferably meets the actual conditions of data, reduces the rule of Manual definition in practical application The mistake of middle generation, defines classification on bilingual with tradition and is compared with rule, improve efficiency respectively;Nor require Rulemaking people is familiar with bilingual simultaneously;The regular probability of mismatch on bilingual is also greatly reduced, so as to improve Mechanical translation quality.
B.3 language block word order method of adjustment adds the limitation of grammer in statistical translation system in terms of word order adjustment.When one When kind of language translation is into another language, due to the difference of grammer, the difference of communicative habits, the order that word lists reach is often Difference.When completing translation, except word or phrase translation, into another language, are also put into the phrase of translation properly Position.In statistical translation system, its base unit-phrase-is any word string, it is not required that its grammaticalness structure.This Cause the language block moved to be stitched together again and often produce extremely odd translation.The present invention passes through shallow-layer syntax in pretreatment stage Analysis introduces the information of the phrase of grammaticality.Step is moved in follow-up phrase position, only grammaticalness is constrained Phrase moved, so as to improve the correctness and fluency of translation result, it is concretely comprised the following steps:
Shallow parsing is carried out to original language, NP is generated(Noun phrase)、VP(Verb phrase)、PP(Prepositional phrase)Deng grammer Information.
Compared by word(word alignment)Learn word order regulation rule, and the probability per rule, study is arrived Rule, for example:
DNP NP VP –> DNP NP VP (0.89)
DNP NP VP –> NP DNP VP (0.11)
That is the constant probability of phrase sequence D NP NP VP phrases order is 0.89, and the probability for being changed into NP DNP VP is 0.11, should Then inputted with these to original language on sentence.Different regular combination applications produces different phrase sequence variations.All these Change is with word lattice(lattice)Form shows.According to the probability of rule, the probability of every paths in word lattice is calculated. Optimal path, or whole word grid network are used as the new input of follow-up machine translation module.
Machine translation module is divided into training module and translation module.Training module is mainly used in training, in training stage, profit With GIZA++ kits extensive balanced corpus learning phrase-for-phrase translation(With probable value).Translation module is used In translation, in the translating phase, the sentence come in for each input is divided into some phrase fragments(Phrase herein is not necessarily It is the phrase of grammaticality).To each phrase fragment, corresponding phrase translation is found out in training results.These are short Language translates the sentence for being spliced into complete object language.Because the sentence of original language has many phrase separation schemes, each phrase There are some possible translations again.So translation process is substantially a search procedure.To be found out most from different splicings Excellent combination, i.e., final translation result.In search procedure, using many submodels come the optimal path that assists search out.Must The submodel of palpus includes phrase translation model(Translation Model), language model (Language Model).It is other Submodel, such as word order change model (Distortion Model), the language model based on part of speech(POS Language Model), bilingual language model (Bilingual Language Model), (Adaptation such as domain-adaptive model Model), it can decide whether to open according to actual needs.
Post-processing module is further processed to translation result, it is more nearly the communicative habits of object language, and make Exported for final result.Further handle as shown in Fig. 7 in accompanying drawing, it is main to include D.1 word lattice gauge module, its Word and lattice gauge in machine translation result is turned to the conventional form of expression of object language.For example, English-middle translation is turned over The space in result between Chinese language words is translated to remove.Fullstop in Western languages translation result, comma and the space before it between word Remove.D.2 capital and small letter modular converter is primarily adapted for use in Western languages as the translation of object language.The lead-in of such as english sentence Mother will capitalize.Some specific terms, such as USA will also be capitalized.The submodule changes corresponding lowercase in translation result Into capitalization.D.3 format converting module is the inverse operation of A.2 format converting module, that is, ensures the form one of output and input Cause.
In summary, machine translation system of the invention, can be a kind of sentence of language, and chapter real time translation is into another Kind of language, the system can translation of the sentence it is complete, expression is correct, and the text language with punctuation mark can also translate no section Fall segmentation, sentence may be imperfect, without punctuation mark, there is the voice of noise in sentence;The present invention is improved to small probability word The translation degree of accuracy of language, phrase, will the small probability word such as numeral, date, time, URL mark and preferentially translate respectively;This hair Bright pretreatment module can carry out standardization processing to the sentence of input;The post-processing module of the present invention can improve translation knot The fluency of fruit.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims (10)

1. a kind of multilingual intelligence pretreatment real-time statistics machine translation system, it is characterised in that including:
Receiving module, the receiving module is used to check that the receiving module includes text to the normalization that system is inputted Language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out sentence to text language Segmentation and form conversion, voice identification result receiving module to voice for being split, noise is eliminated and form conversion;
Pretreatment module, the pretreatment module includes Text Pretreatment module and voice identification result pretreatment module, described Text Pretreatment module is used to carry out the language of text input word standardized operation, classification identification mark and language block word order tune It is whole;Institute's speech recognition result pretreatment module is used to carry out word standardized operation to voice and punctuate is predicted;
Machine translation module, the machine translation module is used for the translation for learning phrase-for-phrase, and to passing through pretreatment module The phrase of processing finds out corresponding translation phrase, and phrase is connected into complete sentence;
Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and lattice Formula standardization processing, the communicative habits for making it be more nearly object language, and exported as final result.
2. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that The text language receiving module splits module and format converting module including sentence, and the sentence segmentation module is used for input Text disconnects at punctuation mark so that the elementary cell of follow-up machine translation module translation is in short;The form conversion Module is used to support form when the different-format of language text is converted to the translation of machine translation module.
3. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 2, it is characterised in that Support that form is plain text format or XML format during the machine translation module translation.
4. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that Institute's speech recognition result receiving module includes sentence and splits module and noise cancellation module, and the sentence segmentation module is used for pair The speech text stream of input is made pauses in reading unpunctuated ancient writings according to the pause between word and word;The noise cancellation module is used to dispose spoken in input Talk about the fragment of adjacent repetition in text flow.
5. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that The Text Pretreatment module includes word normalizing block, classification identification labeling module and language block word order adjusting module, described Word normalizing block is used to make language to be translated be more nearly object language in word aspect;The classification identification mark mould Block is used to treating numeral in interpreter language text, date, time, URL and is respectively labeled as $ number, $ date, $ hour and $ www, and in advance by the content translation in classification into object language;The language block word order adjusting module is used to treat interpreter language Sentence carry out syntactic analysis, the language block order for then treating interpreter language according to the rule that learns automatically is adjusted so that The word order of language to be translated is more nearly the word order of object language.
6. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that Described voice identification result pretreatment module includes word normalizing block and punctuate prediction module, the word standardization mould Block is used for the word for making the word particle in language to be translated be more nearly object language;The punctuate prediction module is used for basis Pause between context and word judges the position of fullstop in speech recognition output, described voice identification result pretreatment module It is plain text and confusion network for the receivable pattern of voice identification result.
7. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that The machine translation module includes training module and translation module, and described training module utilizes GIZA++ kits extensive The translation of balanced corpus learning phrase-for-phrase;The translation module is used for the sentence come in each input, is divided into Phrase fragment, is translated, the translation process of the translation module to each phrase fragment according to the training result of training module It is a search procedure, i.e., finds out optimal translation group from the translation combination of the translation result composition of each translation submodel Close, the optimal translation combination is final translation result.
8. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 7, it is characterised in that Described translation submodel includes phrase translation model, and language model, word order changes model, and the language model based on part of speech is double Language language model and domain-adaptive model.
9. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that Described post-processing module includes word lattice gauge module, capital and small letter modular converter and format converting module, the word Lattice gauge module is used for the form of expression for the word and lattice gauge in machine translation result being turned to object language;It is described Capital and small letter modular converter is used for the translation using Western languages as object language;Format converting module is used for the object language of translation Form is consistent with the form of language to be translated.
10. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 9, its feature exists In the capital and small letter modular converter is used to the initial and the letter of proper noun in object language being changed to patterns of capitalization.
CN201710203439.8A 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system Expired - Fee Related CN107066455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710203439.8A CN107066455B (en) 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710203439.8A CN107066455B (en) 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system

Publications (2)

Publication Number Publication Date
CN107066455A true CN107066455A (en) 2017-08-18
CN107066455B CN107066455B (en) 2020-07-28

Family

ID=59601701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710203439.8A Expired - Fee Related CN107066455B (en) 2017-03-30 2017-03-30 Multi-language intelligent preprocessing real-time statistics machine translation system

Country Status (1)

Country Link
CN (1) CN107066455B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783968A (en) * 2017-11-23 2018-03-09 浪潮金融信息技术有限公司 A kind of language transfer method, device, computer-readable recording medium and storage control
CN108519963A (en) * 2018-03-02 2018-09-11 山东科技大学 A method of procedural model is automatically converted to multi-language text
CN108563644A (en) * 2018-03-29 2018-09-21 河南工学院 A kind of English Translation electronic system
CN108647267A (en) * 2018-04-28 2018-10-12 广东金贝贝智能机器人研究院有限公司 One kind being based on internet big data robot Internet of things system
CN109213851A (en) * 2018-07-04 2019-01-15 中国科学院自动化研究所 Across the language transfer method of speech understanding in conversational system
CN110032934A (en) * 2019-03-07 2019-07-19 永德利硅橡胶科技(深圳)有限公司 The implementation method and Related product of Quan Yutong based on picture
CN110858268A (en) * 2018-08-20 2020-03-03 北京紫冬认知科技有限公司 Method and system for detecting unsmooth phenomenon in voice translation system
CN111401052A (en) * 2020-04-24 2020-07-10 南京莱科智能工程研究院有限公司 Semantic understanding-based multilingual text matching method and system
CN111654658A (en) * 2020-06-17 2020-09-11 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
WO2021057908A1 (en) * 2019-09-29 2021-04-01 深圳市万普拉斯科技有限公司 Instant translation display method and device, mobile terminal, and computer storage medium
CN112764535A (en) * 2021-01-08 2021-05-07 温州职业技术学院 System for realizing multi-language information exchange
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN113706977A (en) * 2020-08-13 2021-11-26 苏州韵果莘莘影视科技有限公司 Playing method and system based on intelligent sign language translation software
CN115455988A (en) * 2018-12-29 2022-12-09 苏州七星天专利运营管理有限责任公司 High-risk statement processing method and system
CN116050420A (en) * 2022-11-12 2023-05-02 武汉大学 Chinese and French voice semantic recognition method and device based on preposition sentence
CN116453132A (en) * 2023-06-14 2023-07-18 成都锦城学院 Japanese kana and Chinese character recognition method, equipment and memory based on machine translation
CN116822517A (en) * 2023-08-29 2023-09-29 百舜信息技术有限公司 Multi-language translation term identification method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361064A (en) * 2005-12-16 2009-02-04 Emil有限公司 A text editing apparatus and method
CN102650987A (en) * 2011-02-25 2012-08-29 北京百度网讯科技有限公司 Machine translation method and device both based on source language repeat resource
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN103164399A (en) * 2013-02-26 2013-06-19 北京捷通华声语音技术有限公司 Punctuation addition method and device in speech recognition
CN103956162A (en) * 2014-04-04 2014-07-30 上海元趣信息技术有限公司 Voice recognition method and device oriented towards child
CN104391839A (en) * 2014-11-13 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for machine translation
US20160147740A1 (en) * 2014-11-24 2016-05-26 Microsoft Technology Licensing, Llc Adapting machine translation data using damaging channel model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101361064A (en) * 2005-12-16 2009-02-04 Emil有限公司 A text editing apparatus and method
CN102650987A (en) * 2011-02-25 2012-08-29 北京百度网讯科技有限公司 Machine translation method and device both based on source language repeat resource
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN103164399A (en) * 2013-02-26 2013-06-19 北京捷通华声语音技术有限公司 Punctuation addition method and device in speech recognition
CN103956162A (en) * 2014-04-04 2014-07-30 上海元趣信息技术有限公司 Voice recognition method and device oriented towards child
CN104391839A (en) * 2014-11-13 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for machine translation
US20160147740A1 (en) * 2014-11-24 2016-05-26 Microsoft Technology Licensing, Llc Adapting machine translation data using damaging channel model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚树杰: "面向统计机器翻译的语料处理与评价技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107783968A (en) * 2017-11-23 2018-03-09 浪潮金融信息技术有限公司 A kind of language transfer method, device, computer-readable recording medium and storage control
CN108519963A (en) * 2018-03-02 2018-09-11 山东科技大学 A method of procedural model is automatically converted to multi-language text
CN108563644A (en) * 2018-03-29 2018-09-21 河南工学院 A kind of English Translation electronic system
CN108647267A (en) * 2018-04-28 2018-10-12 广东金贝贝智能机器人研究院有限公司 One kind being based on internet big data robot Internet of things system
CN109213851A (en) * 2018-07-04 2019-01-15 中国科学院自动化研究所 Across the language transfer method of speech understanding in conversational system
CN110858268A (en) * 2018-08-20 2020-03-03 北京紫冬认知科技有限公司 Method and system for detecting unsmooth phenomenon in voice translation system
CN110858268B (en) * 2018-08-20 2024-03-08 北京紫冬认知科技有限公司 Method and system for detecting unsmooth phenomenon in voice translation system
CN115455988A (en) * 2018-12-29 2022-12-09 苏州七星天专利运营管理有限责任公司 High-risk statement processing method and system
CN110032934A (en) * 2019-03-07 2019-07-19 永德利硅橡胶科技(深圳)有限公司 The implementation method and Related product of Quan Yutong based on picture
WO2021057908A1 (en) * 2019-09-29 2021-04-01 深圳市万普拉斯科技有限公司 Instant translation display method and device, mobile terminal, and computer storage medium
CN111401052A (en) * 2020-04-24 2020-07-10 南京莱科智能工程研究院有限公司 Semantic understanding-based multilingual text matching method and system
CN111654658A (en) * 2020-06-17 2020-09-11 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN111654658B (en) * 2020-06-17 2022-04-15 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN113706977A (en) * 2020-08-13 2021-11-26 苏州韵果莘莘影视科技有限公司 Playing method and system based on intelligent sign language translation software
CN112764535A (en) * 2021-01-08 2021-05-07 温州职业技术学院 System for realizing multi-language information exchange
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN116050420A (en) * 2022-11-12 2023-05-02 武汉大学 Chinese and French voice semantic recognition method and device based on preposition sentence
CN116050420B (en) * 2022-11-12 2023-09-22 武汉大学 Chinese and French voice semantic recognition method and device based on preposition sentence
CN116453132A (en) * 2023-06-14 2023-07-18 成都锦城学院 Japanese kana and Chinese character recognition method, equipment and memory based on machine translation
CN116453132B (en) * 2023-06-14 2023-09-05 成都锦城学院 Japanese kana and Chinese character recognition method, equipment and memory based on machine translation
CN116822517A (en) * 2023-08-29 2023-09-29 百舜信息技术有限公司 Multi-language translation term identification method
CN116822517B (en) * 2023-08-29 2023-11-10 百舜信息技术有限公司 Multi-language translation term identification method

Also Published As

Publication number Publication date
CN107066455B (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN107066455A (en) A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN111382580B (en) Encoder-decoder framework pre-training method for neural machine translation
CN101131691B (en) Domain-adaptive portable machine translation device for translating closed captions using dynamic translation resources and method thereof
CN107038160A (en) The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system
KR101762866B1 (en) Statistical translation apparatus by separating syntactic translation model from lexical translation model and statistical translation method
Ueffing et al. Improved models for automatic punctuation prediction for spoken and written text.
CN110264992B (en) Speech synthesis processing method, apparatus, device and storage medium
WO2010046782A2 (en) Hybrid machine translation
Kaur et al. Review of machine transliteration techniques
CN106803422A (en) A kind of language model re-evaluation method based on memory network in short-term long
CN109410949B (en) Text content punctuation adding method based on weighted finite state converter
Xu et al. Do we need Chinese word segmentation for statistical machine translation?
US11907665B2 (en) Method and system for processing user inputs using natural language processing
CN104679735A (en) Pragmatic machine translation method
CN106156013A (en) The two-part machine translation method that a kind of regular collocation type phrase is preferential
Seljan et al. Human Quality Evaluation of Machine-Translated Poetry
Tennage et al. Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation
Saini et al. Disfluency correction using unsupervised and semi-supervised learning
Siahbani et al. Simultaneous translation using optimized segmentation
CN109446537B (en) Translation evaluation method and device for machine translation
CN114861628A (en) System, method, electronic device and storage medium for training machine translation model
Neubarth et al. A hybrid approach to statistical machine translation between standard and dialectal varieties
Manzano English to asl translator for speech2signs
CN107368473B (en) Method for realizing voice interaction
Viet et al. Dependency-based pre-ordering for English-Vietnamese statistical machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200728

Termination date: 20210330