CN107066455A - A kind of multilingual intelligence pretreatment real-time statistics machine translation system - Google Patents
A kind of multilingual intelligence pretreatment real-time statistics machine translation system Download PDFInfo
- Publication number
- CN107066455A CN107066455A CN201710203439.8A CN201710203439A CN107066455A CN 107066455 A CN107066455 A CN 107066455A CN 201710203439 A CN201710203439 A CN 201710203439A CN 107066455 A CN107066455 A CN 107066455A
- Authority
- CN
- China
- Prior art keywords
- module
- language
- translation
- word
- pretreatment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/55—Rule-based translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Abstract
The invention discloses the multilingual intelligence pretreatment real-time statistics machine translation system of one kind, including:Receiving module, pretreatment module, machine translation module and post-processing module.The receiving module includes text language receiving module and voice identification result receiving module;The pretreatment module includes Text Pretreatment module and voice identification result pretreatment module;Machine translation module, the machine translation module is used to learn the translation of phrase-for-phrase, and finds out corresponding translation phrase to the phrase handled by pretreatment module, and phrase is connected into complete sentence;Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and format specificationization processing, the communicative habits for making it be more nearly object language, and is exported as final result.The present invention can be used in translating text language and speech language, and improve the translation degree of accuracy to small probability word, phrase.
Description
Technical field
The present invention relates to artificial intelligence machine translation technology field, it particularly relates to a kind of multilingual intelligence pretreatment
Real-time statistics machine translation system.
Background technology
Machine translation is the technology for carrying out automatic translation to Human Natural Language using computer, is one using computer
The process that natural language is converted into another natural language is planted, and two kinds of natural languages should be of equal value in the sense.
At present, a kind of comparative maturity and machine translation method of main flow is Statistics-Based Method, the advantage of this method
It is that all translation informations are all automatically to be obtained from language material learning with little need for manually translation rule is write, because
The characteristics of this this method has farthest played computer high-speed computation, significantly reduces cost of labor.
Machine translation mothod based on statistical model is from Parallel Corpus learning from a kind of language A to another language B
Phrase translation.When translating new sentence, input language A sentences decomposition into some phrases, according to study come phrase
(A language)- phrase(B language)Co-occurrence probabilities, language A sentence translation into language B sentence.It is whole study, translated
Journey is completely according to statistical model.
But, this according to co-occurrence frequency, the machine translation of probabilistic method is for small probability phrase(For example proper noun is turned over
Translate)Disposal ability is not enough, in addition, how to add the expression of grammatical and semantic in statistical model so that the translation of the sentence of generation is more
Plus meet the communicative habits of people, it is also the problem of current machine translation mothod needs solution.
The content of the invention
For the above-mentioned technical problem in correlation technique, the present invention proposes a kind of multilingual intelligence pretreatment real-time statistics machine
Device translation system, can overcome the above-mentioned deficiency of prior art.
To realize above-mentioned technical purpose, the technical proposal of the invention is realized in this way:
A kind of multilingual intelligence pretreatment real-time statistics machine translation system, including:
Receiving module, the receiving module is used to check that the receiving module includes text to the normalization that system is inputted
Language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out sentence to text language
Segmentation and form conversion, voice identification result receiving module to voice for being split, noise is eliminated and form conversion;
Pretreatment module, the pretreatment module includes Text Pretreatment module and voice identification result pretreatment module, described
Text Pretreatment module is used to carry out the language of text input word standardized operation, classification identification mark and language block word order tune
It is whole;Institute's speech recognition result pretreatment module is used to carry out word standardized operation to voice and punctuate is predicted;
Machine translation module, the machine translation module is used for the translation for learning phrase-for-phrase, and to passing through pretreatment module
The phrase of processing finds out corresponding translation phrase, and the complete sentence of generation;
Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and lattice
Formula standardization processing, the communicative habits for making it be more nearly object language, and exported as final result.
Further, the text language receiving module includes sentence and splits module and format converting module, the sentence
Segmentation module is used to input text be disconnected at punctuation mark so that the elementary cell of follow-up machine translation module translation is one
Word;The format converting module is used to support lattice when the different-format of language text is converted to the translation of machine translation module
Formula.
It is preferred that, support that form is plain text format or XML format during the machine translation module translation.
Further, institute's speech recognition result receiving module includes sentence and splits module and noise cancellation module, described
Sentence segmentation module is used to make pauses in reading unpunctuated ancient writings to the speech text stream of input according to the pause between word and word;The noise cancellation module is used
In the fragment for disposing adjacent repetition in spoken words text flow in input.
Further, the Text Pretreatment module includes word normalizing block, classification identification labeling module and language block
Word order adjusting module, the word normalizing block is used to make language to be translated be more nearly object language in word aspect;
Numeral that classification identification labeling module is used to treat in interpreter language text, date, time, URL are respectively labeled as $
Number, $ date, $ hour and $ www, and in advance by the content translation in classification into object language;The language block word order is adjusted
The sentence that mould preparation block is used to treat interpreter language carries out syntactic analysis, then treats interpreter language according to the rule learnt automatically
Language block order is adjusted so that the word order of language to be translated is more nearly the word order of object language.
Further, described voice identification result pretreatment module includes word normalizing block and punctuate prediction mould
Block, the word normalizing block is used for the word for making the word particle in language to be translated be more nearly object language;It is described
Punctuate prediction module is used for the position that the pause based on context between word judges fullstop in speech recognition output;Described language
Sound recognition result pretreatment module is plain text and confusion network for the receivable pattern of voice identification result.
Further, the machine translation module includes training module and translation module, and described training module is utilized
Translation of the GIZA++ kits in extensive balanced corpus learning phrase-for-phrase;The translation module is used for each defeated
Enter the sentence come in, be divided into phrase fragment, each phrase fragment is translated according to the training result of training module, it is described
The translation process of translation module is a search procedure, i.e. the translation from the translation result composition of each translation submodel is combined
In find out optimal translation combination, the optimal translation combination is final translation result.
It is preferred that, described translation submodel includes phrase translation model, language model, and word order changes model, word-based
The language model of property, bilingual language model and domain-adaptive model.
Further, described post-processing module includes word lattice gauge module, capital and small letter modular converter and form
Modular converter, the word lattice gauge module is used to the word and lattice gauge in machine translation result turn to target language
The form of expression of speech;The capital and small letter modular converter is used for the translation using Western languages as object language;Format converting module is used for
The form of the object language of translation is consistent with the form of language to be translated.
It is preferred that, the capital and small letter modular converter is used to the initial and the letter of proper noun in object language being changed to
Patterns of capitalization.
Beneficial effects of the present invention:The machine translation system of the present invention, can in real time turn over a kind of sentence of language, chapter
Be translated into another language, the system can translation of the sentence it is complete, expression is correct, and the text language with punctuation mark can also be turned over
No paragraph segmentation is translated, sentence may be imperfect, without punctuation mark, there is the voice of noise in sentence;The present invention is improved pair
The translation degree of accuracy of small probability word, phrase, will the small probability word such as numeral, date, time, URL mark respectively and preferential
Translation;The pretreatment module of the present invention can carry out standardization processing to the sentence of input;The post-processing module of the present invention can
Improve the fluency of translation result.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment
The accompanying drawing needed to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the present invention
Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also be obtained according to these accompanying drawings
Obtain other accompanying drawings.
Fig. 1 is the translation of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention
Flow chart;
Fig. 2 is the received text of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention
The schematic diagram of module;
Fig. 3 is the speech recognition of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention
As a result the schematic diagram of receiving module;
Fig. 4 is that the text of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention is located in advance
Manage the schematic diagram of module;
Fig. 5 is the speech recognition of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention
As a result the schematic diagram of pretreatment module;
Fig. 6 is the machine translation of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention
The schematic diagram of module;
Fig. 7 is the post processing mould of multilingual intelligence pretreatment real-time statistics machine translation system described according to embodiments of the present invention
The schematic diagram of block.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to what the present invention was protected
Scope.
As shown in figs. 1-7, described a kind of multilingual intelligence pretreatment real-time statistics machine according to embodiments of the present invention is turned over
System is translated, including:
Receiving module, the receiving module is used to check that the receiving module includes text to the normalization that system is inputted
Language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out sentence to text language
Segmentation and form conversion, voice identification result receiving module to voice for being split, noise is eliminated and form conversion;
Pretreatment module, the pretreatment module includes Text Pretreatment module and voice identification result pretreatment module, described
Text Pretreatment module is used to carry out the language of text input word standardized operation, classification identification mark and language block word order tune
It is whole;Institute's speech recognition result pretreatment module is used to carry out word standardized operation to voice and punctuate is predicted;
Machine translation module, the machine translation module is used for the translation for learning phrase-for-phrase, and to passing through pretreatment module
The phrase of processing finds out corresponding translation phrase, and the complete sentence of generation;
Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and lattice
Formula standardization processing, the communicative habits for making it be more nearly object language, and exported as final result.
In one embodiment, the text language receiving module includes sentence and splits module and format converting module,
The sentence segmentation module is used to input text be disconnected at punctuation mark so that it is basic that follow-up machine translation module is translated
Unit is in short;When the format converting module is used to the different-format of language text be converted to the translation of machine translation module
Support form.
In one embodiment, support that form is plain text format or XML format during the machine translation module translation.
In one embodiment, institute's speech recognition result receiving module includes sentence and splits module and noise elimination mould
Block, the sentence segmentation module is used to make pauses in reading unpunctuated ancient writings to the speech text stream of input according to the pause between word and word;The noise disappears
Except module is used to dispose the fragment of adjacent repetition in spoken words text flow in input.
In one embodiment, the Text Pretreatment module includes word normalizing block, classification identification mark mould
Block and language block word order adjusting module, the word normalizing block are used to make language to be translated be more nearly mesh in word aspect
Poster is sayed;Numeral that classification identification labeling module is used to treat in interpreter language text, date, time, URL are marked respectively
For $ number, $ date, $ hour and $ www, and in advance by the content translation in classification into object language;The language block word order
The sentence that adjusting module is used to treat interpreter language carries out syntactic analysis, then treats interpreter language according to the rule learnt automatically
Language block order be adjusted so that the word order of language to be translated is more nearly the word order of object language.
In one embodiment, described voice identification result pretreatment module includes word normalizing block and punctuate
Prediction module, the word normalizing block is used for the word for making the word particle in language to be translated be more nearly object language
Language;The punctuate prediction module is used for the position that the pause based on context between word judges fullstop in speech recognition output;
Described voice identification result pretreatment module is plain text and confusion network for the receivable pattern of voice identification result.
In one embodiment, the machine translation module includes training module and translation module, described training mould
Block utilizes GIZA++ kits in the translation of extensive balanced corpus learning phrase-for-phrase;The translation module be used for pair
The sentence that each input is come in, is divided into phrase fragment, each phrase fragment is turned over according to the training result of training module
Translate, the translation process of the translation module is a search procedure, i.e., constituted from the translation result of each translation submodel
Optimal translation combination is found out in translation combination, the optimal translation combination is final translation result.
In one embodiment, described translation submodel includes phrase translation model, and language model, word order changes mould
Type, the language model based on part of speech, bilingual language model and domain-adaptive model.
In one embodiment, described post-processing module includes word lattice gauge module, capital and small letter modulus of conversion
Block and format converting module, the word lattice gauge module are used for the word and lattice gauge in machine translation result
For the form of expression of object language;The capital and small letter modular converter is used for the translation using Western languages as object language;Form is changed
Module is used for the form of the object language of translation is consistent with the form of language to be translated.
In one embodiment, the capital and small letter modular converter is used for the initial and proper noun in object language
Letter be changed to patterns of capitalization.
Understand for convenience the present invention above-mentioned technical proposal, below by way of in specifically used mode to the present invention it is above-mentioned
Technical scheme is described in detail.
When specifically used, according to the multilingual intelligence pretreatment real-time statistics machine translation system of one kind of the present invention
System, including receiving module, pretreatment module, translation module and post-processing module;
Receiving module is checked for the normalization that system is inputted, including text language receiving module and voice identification result connect
Receive module;The receiving module of text language is mainly made up of two parts, as shown in Fig. 2 in accompanying drawing:Sentence splits module and form
Modular converter.A.1 sentence segmentation module is input text in punctuation mark fullstop, and question mark is disconnected at exclamation mark so that follow-up
The elementary cell of machine translation module translation is sentence, and when including html marks in input text, a pair of html mark it
Between content individually form a complete sentence, to ensure that it is translated as complete sentence, the one of outer text is marked not as html
Part is translated, and the subsequent module of flow supports the translation of plain text and XML format text.When input text is other lattice
When formula, such as PDF or picture, A.2 extended formatting is converted into plain text and XML format by format converting module.Speech recognition knot
Fruit receiving module is also mainly made up of two parts, as illustrated in figure 3 of the drawings:Sentence splits module and noise cancellation module.A.3
Sentence splits text flow of the module input according to the pause punctuate between word and word, when being more than 0.5s when pausing, it is believed that at this
Newly start a sentence after pause, A.4 the function of noise cancellation module is to dispose spoken language in input to talk about adjacent in text flow
The fragment repeated, for example " uh uh " is simplified to " uh ";" that is we are necessary ... " is simplified to " namely
Say that we are necessary ... ", machine translation system subsequent module is plain text for the receivable pattern of voice identification result and obscured
Network.
If pretreatment module carries out dry run to input language A so that it is more nearly special translating purpose language B, with after an action of the bowels
Continuous machine translation module obtains more preferable translation quality.Pretreatment module includes Text Pretreatment module and voice identification result
Pretreatment module.Text Pretreatment module is mainly made up of three parts, as shown in Fig. 4 in accompanying drawing.B.1 word normalizing block makes
Obtain original language A and object language B is more nearly in word aspect.Such as carry out in-English translation when, Chinese will carry out participle,
Space is inserted between word.When carrying out De-English translation, the compound word in German is split, word in increase moral English sentence
Man-to-man corresponding relation.B.2 classification identification labeling module is respectively labeled as the numeral in original language A, date, time, URL
Corresponding classification $ number, $ date, $ hour and $ www.Content in classification translates into object language B in advance by rule.
Follow-up machine translation module is no longer translated to it.B.3 language block word order adjusting module is carried out to original language A sentence first
Syntactic analysis:Carry out automatic identification or the syntax tree generation of phrase.Then according to the rule learnt automatically(Phrase-based)It is right
The language block order of original language is adjusted so that the word order of original language is more nearly object language.The sentence adjusted after word order can
, can also be several more excellent word orders with word lattice to be exported with optimal word order(lattice)Form is exported.The module can for one
Whether the module of choosing, possess syntax analyzer of good performance etc. according to original language to decide whether to open the module.Voice is known
Other result pretreatment module is made up of two parts, as illustrated in figure 5 of the drawings.B.4 word normalizing block and and B.1 word specification
Change module similar, be also the word particle in A language sentences is more nearly object language B in the word aspect of original language
Word.B.5 the fullstop position during pause prediction speech recognition of the punctuate prediction module based on context between word is exported.The son
Module is an optional module, is mainly used in the speech recognition translation of relatively written word, the translation of such as speech.
Wherein, B.2 classification identification mark is based on bilingual semi-automatic classification identification and translated.It is so-called semi-automatic to refer to
Define the classification for needing to recognize in a manual manner on original language in bilingual;Then according to balanced corpus and word ratio
It is right(word alignment)It is automatic learn in another language in requisition for classification and classification translation.Turned over English-Chinese
Example is translated into, each classifications of classification $ number, $ date, $ hour, $ www. for needing to recognize are defined on English first
Content can include some words.Then all numerals are identified on Chinese, are tieed up labeled as $ bnumber, and with ten thousand
Related word www, http .com etc. is netted, labeled as $ bwww.$ bnumber and $ bwww herein is the core of classification in Chinese
The heart.On the basis of this core, front and rear word is also included into, the final Chinese corresponding with classification in English could be constituted
Classification.Forgive word before and after which, we are compared by word(word alignment)It is automatic to extract.Compare and neutralize in word
The corresponding Chinese word of English classification border word, it is also possible to the border word of Chinese classification.The border of Chinese classification is determined
Word, the Chinese category content extracted also just implies the translator of Chinese of corresponding English classification.Therefrom English learning class
The translation rule of Chinese classification is clipped to, for example:
$ number { 2 } → $ number { 2 }
$ number { 2 one-tenth } → $ number { 20% }
$ number { the 2nd } → $ number { 2nd }
The rule that this kind of method is extracted more preferably meets the actual conditions of data, reduces the rule of Manual definition in practical application
The mistake of middle generation, defines classification on bilingual with tradition and is compared with rule, improve efficiency respectively;Nor require
Rulemaking people is familiar with bilingual simultaneously;The regular probability of mismatch on bilingual is also greatly reduced, so as to improve
Mechanical translation quality.
B.3 language block word order method of adjustment adds the limitation of grammer in statistical translation system in terms of word order adjustment.When one
When kind of language translation is into another language, due to the difference of grammer, the difference of communicative habits, the order that word lists reach is often
Difference.When completing translation, except word or phrase translation, into another language, are also put into the phrase of translation properly
Position.In statistical translation system, its base unit-phrase-is any word string, it is not required that its grammaticalness structure.This
Cause the language block moved to be stitched together again and often produce extremely odd translation.The present invention passes through shallow-layer syntax in pretreatment stage
Analysis introduces the information of the phrase of grammaticality.Step is moved in follow-up phrase position, only grammaticalness is constrained
Phrase moved, so as to improve the correctness and fluency of translation result, it is concretely comprised the following steps:
Shallow parsing is carried out to original language, NP is generated(Noun phrase)、VP(Verb phrase)、PP(Prepositional phrase)Deng grammer
Information.
Compared by word(word alignment)Learn word order regulation rule, and the probability per rule, study is arrived
Rule, for example:
DNP NP VP –> DNP NP VP (0.89)
DNP NP VP –> NP DNP VP (0.11)
That is the constant probability of phrase sequence D NP NP VP phrases order is 0.89, and the probability for being changed into NP DNP VP is 0.11, should
Then inputted with these to original language on sentence.Different regular combination applications produces different phrase sequence variations.All these
Change is with word lattice(lattice)Form shows.According to the probability of rule, the probability of every paths in word lattice is calculated.
Optimal path, or whole word grid network are used as the new input of follow-up machine translation module.
Machine translation module is divided into training module and translation module.Training module is mainly used in training, in training stage, profit
With GIZA++ kits extensive balanced corpus learning phrase-for-phrase translation(With probable value).Translation module is used
In translation, in the translating phase, the sentence come in for each input is divided into some phrase fragments(Phrase herein is not necessarily
It is the phrase of grammaticality).To each phrase fragment, corresponding phrase translation is found out in training results.These are short
Language translates the sentence for being spliced into complete object language.Because the sentence of original language has many phrase separation schemes, each phrase
There are some possible translations again.So translation process is substantially a search procedure.To be found out most from different splicings
Excellent combination, i.e., final translation result.In search procedure, using many submodels come the optimal path that assists search out.Must
The submodel of palpus includes phrase translation model(Translation Model), language model (Language Model).It is other
Submodel, such as word order change model (Distortion Model), the language model based on part of speech(POS Language
Model), bilingual language model (Bilingual Language Model), (Adaptation such as domain-adaptive model
Model), it can decide whether to open according to actual needs.
Post-processing module is further processed to translation result, it is more nearly the communicative habits of object language, and make
Exported for final result.Further handle as shown in Fig. 7 in accompanying drawing, it is main to include D.1 word lattice gauge module, its
Word and lattice gauge in machine translation result is turned to the conventional form of expression of object language.For example, English-middle translation is turned over
The space in result between Chinese language words is translated to remove.Fullstop in Western languages translation result, comma and the space before it between word
Remove.D.2 capital and small letter modular converter is primarily adapted for use in Western languages as the translation of object language.The lead-in of such as english sentence
Mother will capitalize.Some specific terms, such as USA will also be capitalized.The submodule changes corresponding lowercase in translation result
Into capitalization.D.3 format converting module is the inverse operation of A.2 format converting module, that is, ensures the form one of output and input
Cause.
In summary, machine translation system of the invention, can be a kind of sentence of language, and chapter real time translation is into another
Kind of language, the system can translation of the sentence it is complete, expression is correct, and the text language with punctuation mark can also translate no section
Fall segmentation, sentence may be imperfect, without punctuation mark, there is the voice of noise in sentence;The present invention is improved to small probability word
The translation degree of accuracy of language, phrase, will the small probability word such as numeral, date, time, URL mark and preferentially translate respectively;This hair
Bright pretreatment module can carry out standardization processing to the sentence of input;The post-processing module of the present invention can improve translation knot
The fluency of fruit.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
God is with principle, and any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.
Claims (10)
1. a kind of multilingual intelligence pretreatment real-time statistics machine translation system, it is characterised in that including:
Receiving module, the receiving module is used to check that the receiving module includes text to the normalization that system is inputted
Language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out sentence to text language
Segmentation and form conversion, voice identification result receiving module to voice for being split, noise is eliminated and form conversion;
Pretreatment module, the pretreatment module includes Text Pretreatment module and voice identification result pretreatment module, described
Text Pretreatment module is used to carry out the language of text input word standardized operation, classification identification mark and language block word order tune
It is whole;Institute's speech recognition result pretreatment module is used to carry out word standardized operation to voice and punctuate is predicted;
Machine translation module, the machine translation module is used for the translation for learning phrase-for-phrase, and to passing through pretreatment module
The phrase of processing finds out corresponding translation phrase, and phrase is connected into complete sentence;
Post-processing module, the post-processing module is used to do translation result word lattice gauge, capital and small letter standardization and lattice
Formula standardization processing, the communicative habits for making it be more nearly object language, and exported as final result.
2. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that
The text language receiving module splits module and format converting module including sentence, and the sentence segmentation module is used for input
Text disconnects at punctuation mark so that the elementary cell of follow-up machine translation module translation is in short;The form conversion
Module is used to support form when the different-format of language text is converted to the translation of machine translation module.
3. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 2, it is characterised in that
Support that form is plain text format or XML format during the machine translation module translation.
4. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that
Institute's speech recognition result receiving module includes sentence and splits module and noise cancellation module, and the sentence segmentation module is used for pair
The speech text stream of input is made pauses in reading unpunctuated ancient writings according to the pause between word and word;The noise cancellation module is used to dispose spoken in input
Talk about the fragment of adjacent repetition in text flow.
5. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that
The Text Pretreatment module includes word normalizing block, classification identification labeling module and language block word order adjusting module, described
Word normalizing block is used to make language to be translated be more nearly object language in word aspect;The classification identification mark mould
Block is used to treating numeral in interpreter language text, date, time, URL and is respectively labeled as $ number, $ date, $ hour and
$ www, and in advance by the content translation in classification into object language;The language block word order adjusting module is used to treat interpreter language
Sentence carry out syntactic analysis, the language block order for then treating interpreter language according to the rule that learns automatically is adjusted so that
The word order of language to be translated is more nearly the word order of object language.
6. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that
Described voice identification result pretreatment module includes word normalizing block and punctuate prediction module, the word standardization mould
Block is used for the word for making the word particle in language to be translated be more nearly object language;The punctuate prediction module is used for basis
Pause between context and word judges the position of fullstop in speech recognition output, described voice identification result pretreatment module
It is plain text and confusion network for the receivable pattern of voice identification result.
7. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that
The machine translation module includes training module and translation module, and described training module utilizes GIZA++ kits extensive
The translation of balanced corpus learning phrase-for-phrase;The translation module is used for the sentence come in each input, is divided into
Phrase fragment, is translated, the translation process of the translation module to each phrase fragment according to the training result of training module
It is a search procedure, i.e., finds out optimal translation group from the translation combination of the translation result composition of each translation submodel
Close, the optimal translation combination is final translation result.
8. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 7, it is characterised in that
Described translation submodel includes phrase translation model, and language model, word order changes model, and the language model based on part of speech is double
Language language model and domain-adaptive model.
9. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 1, it is characterised in that
Described post-processing module includes word lattice gauge module, capital and small letter modular converter and format converting module, the word
Lattice gauge module is used for the form of expression for the word and lattice gauge in machine translation result being turned to object language;It is described
Capital and small letter modular converter is used for the translation using Western languages as object language;Format converting module is used for the object language of translation
Form is consistent with the form of language to be translated.
10. the multilingual intelligence pretreatment real-time statistics machine translation system of one kind according to claim 9, its feature exists
In the capital and small letter modular converter is used to the initial and the letter of proper noun in object language being changed to patterns of capitalization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710203439.8A CN107066455B (en) | 2017-03-30 | 2017-03-30 | Multi-language intelligent preprocessing real-time statistics machine translation system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710203439.8A CN107066455B (en) | 2017-03-30 | 2017-03-30 | Multi-language intelligent preprocessing real-time statistics machine translation system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107066455A true CN107066455A (en) | 2017-08-18 |
CN107066455B CN107066455B (en) | 2020-07-28 |
Family
ID=59601701
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710203439.8A Expired - Fee Related CN107066455B (en) | 2017-03-30 | 2017-03-30 | Multi-language intelligent preprocessing real-time statistics machine translation system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107066455B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107783968A (en) * | 2017-11-23 | 2018-03-09 | 浪潮金融信息技术有限公司 | A kind of language transfer method, device, computer-readable recording medium and storage control |
CN108519963A (en) * | 2018-03-02 | 2018-09-11 | 山东科技大学 | A method of procedural model is automatically converted to multi-language text |
CN108563644A (en) * | 2018-03-29 | 2018-09-21 | 河南工学院 | A kind of English Translation electronic system |
CN108647267A (en) * | 2018-04-28 | 2018-10-12 | 广东金贝贝智能机器人研究院有限公司 | One kind being based on internet big data robot Internet of things system |
CN109213851A (en) * | 2018-07-04 | 2019-01-15 | 中国科学院自动化研究所 | Across the language transfer method of speech understanding in conversational system |
CN110032934A (en) * | 2019-03-07 | 2019-07-19 | 永德利硅橡胶科技(深圳)有限公司 | The implementation method and Related product of Quan Yutong based on picture |
CN110858268A (en) * | 2018-08-20 | 2020-03-03 | 北京紫冬认知科技有限公司 | Method and system for detecting unsmooth phenomenon in voice translation system |
CN111401052A (en) * | 2020-04-24 | 2020-07-10 | 南京莱科智能工程研究院有限公司 | Semantic understanding-based multilingual text matching method and system |
CN111654658A (en) * | 2020-06-17 | 2020-09-11 | 平安科技(深圳)有限公司 | Audio and video call processing method and system, coder and decoder and storage device |
WO2021057908A1 (en) * | 2019-09-29 | 2021-04-01 | 深圳市万普拉斯科技有限公司 | Instant translation display method and device, mobile terminal, and computer storage medium |
CN112764535A (en) * | 2021-01-08 | 2021-05-07 | 温州职业技术学院 | System for realizing multi-language information exchange |
CN113158695A (en) * | 2021-05-06 | 2021-07-23 | 上海极链网络科技有限公司 | Semantic auditing method and system for multi-language mixed text |
CN113706977A (en) * | 2020-08-13 | 2021-11-26 | 苏州韵果莘莘影视科技有限公司 | Playing method and system based on intelligent sign language translation software |
CN115455988A (en) * | 2018-12-29 | 2022-12-09 | 苏州七星天专利运营管理有限责任公司 | High-risk statement processing method and system |
CN116050420A (en) * | 2022-11-12 | 2023-05-02 | 武汉大学 | Chinese and French voice semantic recognition method and device based on preposition sentence |
CN116453132A (en) * | 2023-06-14 | 2023-07-18 | 成都锦城学院 | Japanese kana and Chinese character recognition method, equipment and memory based on machine translation |
CN116822517A (en) * | 2023-08-29 | 2023-09-29 | 百舜信息技术有限公司 | Multi-language translation term identification method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101361064A (en) * | 2005-12-16 | 2009-02-04 | Emil有限公司 | A text editing apparatus and method |
CN102650987A (en) * | 2011-02-25 | 2012-08-29 | 北京百度网讯科技有限公司 | Machine translation method and device both based on source language repeat resource |
CN103116578A (en) * | 2013-02-07 | 2013-05-22 | 北京赛迪翻译技术有限公司 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
CN103164399A (en) * | 2013-02-26 | 2013-06-19 | 北京捷通华声语音技术有限公司 | Punctuation addition method and device in speech recognition |
CN103956162A (en) * | 2014-04-04 | 2014-07-30 | 上海元趣信息技术有限公司 | Voice recognition method and device oriented towards child |
CN104391839A (en) * | 2014-11-13 | 2015-03-04 | 百度在线网络技术(北京)有限公司 | Method and device for machine translation |
US20160147740A1 (en) * | 2014-11-24 | 2016-05-26 | Microsoft Technology Licensing, Llc | Adapting machine translation data using damaging channel model |
-
2017
- 2017-03-30 CN CN201710203439.8A patent/CN107066455B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101361064A (en) * | 2005-12-16 | 2009-02-04 | Emil有限公司 | A text editing apparatus and method |
CN102650987A (en) * | 2011-02-25 | 2012-08-29 | 北京百度网讯科技有限公司 | Machine translation method and device both based on source language repeat resource |
CN103116578A (en) * | 2013-02-07 | 2013-05-22 | 北京赛迪翻译技术有限公司 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
CN103164399A (en) * | 2013-02-26 | 2013-06-19 | 北京捷通华声语音技术有限公司 | Punctuation addition method and device in speech recognition |
CN103956162A (en) * | 2014-04-04 | 2014-07-30 | 上海元趣信息技术有限公司 | Voice recognition method and device oriented towards child |
CN104391839A (en) * | 2014-11-13 | 2015-03-04 | 百度在线网络技术(北京)有限公司 | Method and device for machine translation |
US20160147740A1 (en) * | 2014-11-24 | 2016-05-26 | Microsoft Technology Licensing, Llc | Adapting machine translation data using damaging channel model |
Non-Patent Citations (1)
Title |
---|
姚树杰: "面向统计机器翻译的语料处理与评价技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107783968A (en) * | 2017-11-23 | 2018-03-09 | 浪潮金融信息技术有限公司 | A kind of language transfer method, device, computer-readable recording medium and storage control |
CN108519963A (en) * | 2018-03-02 | 2018-09-11 | 山东科技大学 | A method of procedural model is automatically converted to multi-language text |
CN108563644A (en) * | 2018-03-29 | 2018-09-21 | 河南工学院 | A kind of English Translation electronic system |
CN108647267A (en) * | 2018-04-28 | 2018-10-12 | 广东金贝贝智能机器人研究院有限公司 | One kind being based on internet big data robot Internet of things system |
CN109213851A (en) * | 2018-07-04 | 2019-01-15 | 中国科学院自动化研究所 | Across the language transfer method of speech understanding in conversational system |
CN110858268A (en) * | 2018-08-20 | 2020-03-03 | 北京紫冬认知科技有限公司 | Method and system for detecting unsmooth phenomenon in voice translation system |
CN110858268B (en) * | 2018-08-20 | 2024-03-08 | 北京紫冬认知科技有限公司 | Method and system for detecting unsmooth phenomenon in voice translation system |
CN115455988A (en) * | 2018-12-29 | 2022-12-09 | 苏州七星天专利运营管理有限责任公司 | High-risk statement processing method and system |
CN110032934A (en) * | 2019-03-07 | 2019-07-19 | 永德利硅橡胶科技(深圳)有限公司 | The implementation method and Related product of Quan Yutong based on picture |
WO2021057908A1 (en) * | 2019-09-29 | 2021-04-01 | 深圳市万普拉斯科技有限公司 | Instant translation display method and device, mobile terminal, and computer storage medium |
CN111401052A (en) * | 2020-04-24 | 2020-07-10 | 南京莱科智能工程研究院有限公司 | Semantic understanding-based multilingual text matching method and system |
CN111654658A (en) * | 2020-06-17 | 2020-09-11 | 平安科技(深圳)有限公司 | Audio and video call processing method and system, coder and decoder and storage device |
CN111654658B (en) * | 2020-06-17 | 2022-04-15 | 平安科技(深圳)有限公司 | Audio and video call processing method and system, coder and decoder and storage device |
CN113706977A (en) * | 2020-08-13 | 2021-11-26 | 苏州韵果莘莘影视科技有限公司 | Playing method and system based on intelligent sign language translation software |
CN112764535A (en) * | 2021-01-08 | 2021-05-07 | 温州职业技术学院 | System for realizing multi-language information exchange |
CN113158695A (en) * | 2021-05-06 | 2021-07-23 | 上海极链网络科技有限公司 | Semantic auditing method and system for multi-language mixed text |
CN116050420A (en) * | 2022-11-12 | 2023-05-02 | 武汉大学 | Chinese and French voice semantic recognition method and device based on preposition sentence |
CN116050420B (en) * | 2022-11-12 | 2023-09-22 | 武汉大学 | Chinese and French voice semantic recognition method and device based on preposition sentence |
CN116453132A (en) * | 2023-06-14 | 2023-07-18 | 成都锦城学院 | Japanese kana and Chinese character recognition method, equipment and memory based on machine translation |
CN116453132B (en) * | 2023-06-14 | 2023-09-05 | 成都锦城学院 | Japanese kana and Chinese character recognition method, equipment and memory based on machine translation |
CN116822517A (en) * | 2023-08-29 | 2023-09-29 | 百舜信息技术有限公司 | Multi-language translation term identification method |
CN116822517B (en) * | 2023-08-29 | 2023-11-10 | 百舜信息技术有限公司 | Multi-language translation term identification method |
Also Published As
Publication number | Publication date |
---|---|
CN107066455B (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107066455A (en) | A kind of multilingual intelligence pretreatment real-time statistics machine translation system | |
CN111382580B (en) | Encoder-decoder framework pre-training method for neural machine translation | |
CN101131691B (en) | Domain-adaptive portable machine translation device for translating closed captions using dynamic translation resources and method thereof | |
CN107038160A (en) | The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system | |
KR101762866B1 (en) | Statistical translation apparatus by separating syntactic translation model from lexical translation model and statistical translation method | |
Ueffing et al. | Improved models for automatic punctuation prediction for spoken and written text. | |
CN110264992B (en) | Speech synthesis processing method, apparatus, device and storage medium | |
WO2010046782A2 (en) | Hybrid machine translation | |
Kaur et al. | Review of machine transliteration techniques | |
CN106803422A (en) | A kind of language model re-evaluation method based on memory network in short-term long | |
CN109410949B (en) | Text content punctuation adding method based on weighted finite state converter | |
Xu et al. | Do we need Chinese word segmentation for statistical machine translation? | |
US11907665B2 (en) | Method and system for processing user inputs using natural language processing | |
CN104679735A (en) | Pragmatic machine translation method | |
CN106156013A (en) | The two-part machine translation method that a kind of regular collocation type phrase is preferential | |
Seljan et al. | Human Quality Evaluation of Machine-Translated Poetry | |
Tennage et al. | Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation | |
Saini et al. | Disfluency correction using unsupervised and semi-supervised learning | |
Siahbani et al. | Simultaneous translation using optimized segmentation | |
CN109446537B (en) | Translation evaluation method and device for machine translation | |
CN114861628A (en) | System, method, electronic device and storage medium for training machine translation model | |
Neubarth et al. | A hybrid approach to statistical machine translation between standard and dialectal varieties | |
Manzano | English to asl translator for speech2signs | |
CN107368473B (en) | Method for realizing voice interaction | |
Viet et al. | Dependency-based pre-ordering for English-Vietnamese statistical machine translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200728 Termination date: 20210330 |