CN107066456A - A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system - Google Patents

A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system Download PDF

Info

Publication number
CN107066456A
CN107066456A CN201710203849.2A CN201710203849A CN107066456A CN 107066456 A CN107066456 A CN 107066456A CN 201710203849 A CN201710203849 A CN 201710203849A CN 107066456 A CN107066456 A CN 107066456A
Authority
CN
China
Prior art keywords
receiving module
text
language
machine translation
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710203849.2A
Other languages
Chinese (zh)
Inventor
张昱琪
唐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201710203849.2A priority Critical patent/CN107066456A/en
Publication of CN107066456A publication Critical patent/CN107066456A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Abstract

The invention discloses a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system, receiving module is used to check that receiving module includes text language receiving module and voice identification result receiving module to the normalization that system is inputted;Wherein text language receiving module is used to carry out sentence segmentation to text language and form is changed, and voice identification result receiving module is used to split voice and noise is eliminated.The receiving module of the present invention both can be used for the reception to text language, can be used for the receiving to speech language;Receiving module of the present invention can carry out basic processing to the language of reception, for example, the content between a pair of html marks individually forms a complete sentence;When speech pause is more than 0.5s, it is believed that newly start a sentence after the pause;Dispose spoken language in input and talk about fragment of adjacent repetition etc. in text flow, be easy to follow-up machine translation module to treat the translation of interpreter language, so as to improve the efficiency and quality of translation.

Description

A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system
Technical field
The present invention relates to artificial intelligence machine translation technology field, it particularly relates to a kind of multilingual intelligence pretreatment The receiving module of real-time statistics machine translation system.
Background technology
Machine translation is the technology for carrying out automatic translation to Human Natural Language using computer, is one using computer The process that natural language is converted into another natural language is planted, and two kinds of natural languages should be of equal value in the sense.
At present, a kind of comparative maturity and machine translation method of main flow is Statistics-Based Method, the advantage of this method It is that all translation informations are all automatically to be obtained from language material learning with little need for manually translation rule is write, because The characteristics of this this method has farthest played computer high-speed computation, significantly reduces cost of labor.
Machine translation mothod based on statistical model is from Parallel Corpus learning from a kind of language A to another language B Phrase translation.When translating new sentence, input language A sentences decomposition into some phrases, according to study come phrase (A language)- phrase(B language)Co-occurrence probabilities, language A sentence translation into language B sentence.It is whole study, translated Journey is completely according to statistical model.
And the receiving module of the technology is mainly used in treating cypher text or language to be translated(For example:Given a lecture in speech The translation of content)Progress basic normative inspection and processing, enable the machine translation in later stage more smooth, at present, existing Receiving module there is poor universality, text language or speech language can only be received, in addition, receive voice receiving module pair Various problems etc. occur in the processing of voice.
The content of the invention
For the above-mentioned technical problem in correlation technique, the present invention proposes a kind of multilingual intelligence pretreatment real-time statistics machine The receiving module of device translation system, can overcome the above-mentioned deficiency of prior art.
To realize above-mentioned technical purpose, the technical proposal of the invention is realized in this way:
A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system, the receiving module is used for system The normalization of input is checked that the receiving module includes text language receiving module and voice identification result receiving module; Wherein text language receiving module is used to carry out sentence segmentation to text language and form is changed, voice identification result receiving module For voice is split and noise eliminate.
Further, the text language receiving module includes sentence and splits submodule, and the sentence segmentation submodule is used Disconnected in input text at punctuation mark so that the elementary cell of follow-up machine translation module translation is in short.
It is preferred that, when including html marks in input text, the content between a pair of html marks individually forms a complete sentence, and makees Translated for complete sentence.
Further, the text language receiving module also includes form transform subblock, the form transform subblock The plain text or XML format supported during for language text to be converted to the translation of machine translation module.
It is preferred that, the language text includes PDF texts and/or picture text.
Further, institute's speech recognition result receiving module includes sentence and splits submodule, and the sentence splits submodule Block is used to make pauses in reading unpunctuated ancient writings to the speech text stream of input according to the pause between word and word.
It is preferred that, when the sentence segmentation submodule is paused between sentence more than 5s, that is, think newly to open after a pause Begin a sentence.
Further, institute's speech recognition result receiving module also includes noise elimination submodule, and the noise eliminates son Module is used to dispose the fragment of adjacent repetition in spoken words text flow in input.
It is preferred that, machine translation system subsequent module is plain text for the receivable pattern of voice identification result and obscured Network.
Beneficial effects of the present invention:The receiving module of the present invention both can be used for the reception to text language, can also use In the receiving to speech language;Receiving module of the present invention can carry out basic processing to the language of reception, such as by a pair Content between html marks individually forms a complete sentence;When speech pause is more than 0.5s, it is believed that newly start a sentence after the pause Son;Dispose spoken language in input and talk about fragment of adjacent repetition etc. in text flow, consequently facilitating follow-up machine translation module is treated and turned over The translation of language is translated, so as to improve the efficiency and quality of translation.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing needed to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the present invention Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also be obtained according to these accompanying drawings Obtain other accompanying drawings.
Fig. 1 is the schematic diagram of text language receiving module described according to embodiments of the present invention;
Fig. 2 is the schematic diagram of voice identification result receiving module described according to embodiments of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to what the present invention was protected Scope.
As shown in Figure 1-2, described a kind of multilingual intelligence pretreatment real-time statistics machine according to embodiments of the present invention is turned over The receiving module of system is translated, the receiving module is used to check that the receiving module includes to the normalization that system is inputted Text language receiving module and voice identification result receiving module;Wherein text language receiving module is used to carry out text language Sentence is split and form conversion, and voice identification result receiving module is used to split voice and noise is eliminated;The text Language receiving module includes sentence and splits submodule, and the sentence segmentation submodule is used to input text break at punctuation mark Open so that the elementary cell of follow-up machine translation module translation is in short;The text language receiving module also includes form Transform subblock, the form transform subblock is used to language text be converted to machine translation module and translate to make the pure text of support Sheet or XML format.Institute's speech recognition result receiving module includes sentence and splits submodule, and the sentence segmentation submodule is used for The speech text stream of input is made pauses in reading unpunctuated ancient writings according to the pause between word and word;Institute's speech recognition result receiving module also includes noise Submodule is eliminated, the noise, which eliminates submodule, is used to dispose the fragment of adjacent repetition in spoken words text flow in input;Machine The pattern that the receivable voice identification result receiving module of device translation system subsequent module is sent is plain text and confusion network.
In one embodiment, when including html marks in input text, the content between a pair of html marks Individually form a complete sentence, translated as complete sentence.
In one embodiment, the language text includes PDF texts and/or picture text.
In one embodiment, when the sentence segmentation submodule is paused between sentence more than 5s, that is, think stopping Newly start a sentence after.
Understand for convenience the present invention above-mentioned technical proposal, below by way of in specifically used mode to the present invention it is above-mentioned Technical scheme is described in detail.
When specifically used, text language receiving module of the present invention is mainly made up of two parts, Fig. 1 in such as accompanying drawing It is shown:A.1 sentence splits submodule and A.2 form transform subblock, and A.1 sentence segmentation submodule accords with input text in punctuate Number:Such as fullstop, question mark, disconnect at exclamation mark so that the elementary cell of follow-up machine translation module translation is an a word; When input include html marks in text when, the content between a pair of html mark individually forms a complete sentence, to ensure it as complete Sentence translated, the part for marking outer text not as html is translated;The subsequent module branch of machine translation The translation of plain text and XML format text is held, therefore, when input text is extended formatting:Such as PDF is accomplished by lattice during picture The conversion of formula modular converter;A.2 extended formatting is converted into plain text and XML format by form transform subblock.
Voice identification result receiving module is also mainly made up of two parts, as shown in Fig. 2 in accompanying drawing:A.3 sentence segmentation is sub Module and A.4 noise eliminate submodule.A.3 sentence segmentation submodule is used for the text flow of input according between word and word Pause punctuate, such as when being more than 0.5s when pausing, it is believed that newly start a sentence after the pause;A.4 noise eliminates submodule The function of block is to dispose the fragment of adjacent repetition in spoken words text flow in input, for example " uh uh " is simplified to " uh "; " that is we are necessary ... " is simplified to " that is we are necessary ... ";Machine translation system subsequent module It is plain text and confusion network for the receivable pattern of voice identification result.
In summary, by means of the above-mentioned technical proposal of the present invention, receiving module of the invention both can be used for text The reception of language, can be used for the receiving to speech language;Receiving module of the present invention can carry out basic to the language of reception Processing, for example the content between a pair of html mark is individually formed a complete sentence;When speech pause is more than 0.5s, it is believed that in the pause Newly start a sentence afterwards;Dispose spoken language in input and talk about fragment of adjacent repetition etc. in text flow, consequently facilitating follow-up machine Device translation module treats the translation of interpreter language, so as to improve the efficiency and quality of translation.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims (9)

1. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system, it is characterised in that the reception Module is used to check that the receiving module includes text language receiving module and speech recognition to the normalization that system is inputted As a result receiving module;Wherein text language receiving module is used to carry out sentence segmentation to text language and form is changed, and voice is known Other result receiving module to voice for being split, noise is eliminated.
2. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 1, Characterized in that, the text language receiving module, which includes sentence, splits submodule, the sentence segmentation submodule is used for defeated Enter text to disconnect at punctuation mark so that the elementary cell of follow-up machine translation module translation is in short.
3. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 2, Characterized in that, when including html marks in input text, the content between a pair of html marks individually forms a complete sentence, as complete Whole sentence is translated.
4. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 2, Characterized in that, the text language receiving module also includes form transform subblock, the form transform subblock is used for handle Language text is converted to the plain text format supported during the translation of machine translation module or XML format.
5. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 4, Characterized in that, the language text includes PDF texts and/or picture text.
6. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 1, Characterized in that, institute's speech recognition result receiving module, which includes sentence, splits submodule, the sentence segmentation submodule is used for The speech text stream of input is made pauses in reading unpunctuated ancient writings according to the pause between word and word.
7. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 6, Characterized in that, when the sentence segmentation submodule is paused between sentence more than 5s, that is, thinking newly to start one after a pause Individual sentence.
8. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 6, Characterized in that, institute's speech recognition result receiving module, which also includes noise, eliminates submodule, the noise eliminates submodule and used In the fragment for disposing adjacent repetition in spoken words text flow in input.
9. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 8, Characterized in that, machine translation system subsequent module is plain text for the receivable pattern of voice identification result and obscures net Network.
CN201710203849.2A 2017-03-30 2017-03-30 A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system Pending CN107066456A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710203849.2A CN107066456A (en) 2017-03-30 2017-03-30 A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710203849.2A CN107066456A (en) 2017-03-30 2017-03-30 A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system

Publications (1)

Publication Number Publication Date
CN107066456A true CN107066456A (en) 2017-08-18

Family

ID=59602734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710203849.2A Pending CN107066456A (en) 2017-03-30 2017-03-30 A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system

Country Status (1)

Country Link
CN (1) CN107066456A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102231278A (en) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 Method and system for realizing automatic addition of punctuation marks in speech recognition
US20120010873A1 (en) * 2010-07-06 2012-01-12 Electronics And Telecommunications Research Institute Sentence translation apparatus and method
CN102650987A (en) * 2011-02-25 2012-08-29 北京百度网讯科技有限公司 Machine translation method and device both based on source language repeat resource
CN103956162A (en) * 2014-04-04 2014-07-30 上海元趣信息技术有限公司 Voice recognition method and device oriented towards child
CN105117389A (en) * 2015-07-28 2015-12-02 百度在线网络技术(北京)有限公司 Translation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120010873A1 (en) * 2010-07-06 2012-01-12 Electronics And Telecommunications Research Institute Sentence translation apparatus and method
CN102650987A (en) * 2011-02-25 2012-08-29 北京百度网讯科技有限公司 Machine translation method and device both based on source language repeat resource
CN102231278A (en) * 2011-06-10 2011-11-02 安徽科大讯飞信息科技股份有限公司 Method and system for realizing automatic addition of punctuation marks in speech recognition
CN103956162A (en) * 2014-04-04 2014-07-30 上海元趣信息技术有限公司 Voice recognition method and device oriented towards child
CN105117389A (en) * 2015-07-28 2015-12-02 百度在线网络技术(北京)有限公司 Translation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JOHENNES: "Google可翻译Word或PDF文档", 《HTTPS://BLOG.CSDN.NET/JOHENNES/ARTICLE/DETAILS/12968209》 *

Similar Documents

Publication Publication Date Title
CN107066455A (en) A kind of multilingual intelligence pretreatment real-time statistics machine translation system
Caubrière et al. Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability
CN103971686B (en) Method and system for automatically recognizing voice
US8571849B2 (en) System and method for enriching spoken language translation with prosodic information
CN103761975B (en) Method and device for oral evaluation
CN107038160A (en) The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system
DE602004018290D1 (en) LANGUAGE RECOGNITION AND CORRECTION SYSTEM, CORRECTION DEVICE AND METHOD FOR GENERATING A LEXICON OF ALTERNATIVES
Kaur et al. Review of machine transliteration techniques
Kumar et al. Translations of the CALLHOME Egyptian Arabic corpus for conversational speech translation
CN103885924A (en) Field-adaptive automatic open class subtitle generating system and field-adaptive automatic open class subtitle generating method
CN105573994B (en) Statictic machine translation system based on syntax skeleton
Bungeroth et al. A German Sign Language Corpus of the Domain Weather Report.
CN106156013A (en) The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN104679733B (en) A kind of voice dialogue interpretation method, apparatus and system
Seljan et al. Human Quality Evaluation of Machine-Translated Poetry
CN107066456A (en) A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system
CN111738023A (en) Automatic image-text audio translation method and system
Schlippe et al. Text normalization based on statistical machine translation and internet user support
Tam et al. RNN-based labeled data generation for spoken language understanding.
CN107015971A (en) The post-processing module of multilingual intelligence pretreatment real-time statistics machine translation system
CN103268314A (en) Method and device for acquiring sentence punctuating rules of Thai language
CN108628851A (en) The method for translating mandarin and Japanese based on artificial intelligence algorithm of support vector machine
CN108628841A (en) The APP of Guangdong language accent and English is translated based on BIRCH clustering algorithms
CN116386637B (en) Radar flight command voice instruction generation method and system
CN108628847A (en) A kind of simultaneous interpretation case for translating mandarin and English using BIRCH clustering algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170818