CN107066456A

CN107066456A - A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system

Info

Publication number: CN107066456A
Application number: CN201710203849.2A
Authority: CN
Inventors: 张昱琪; 唐亮
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2017-08-18

Abstract

The invention discloses a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system, receiving module is used to check that receiving module includes text language receiving module and voice identification result receiving module to the normalization that system is inputted；Wherein text language receiving module is used to carry out sentence segmentation to text language and form is changed, and voice identification result receiving module is used to split voice and noise is eliminated.The receiving module of the present invention both can be used for the reception to text language, can be used for the receiving to speech language；Receiving module of the present invention can carry out basic processing to the language of reception, for example, the content between a pair of html marks individually forms a complete sentence；When speech pause is more than 0.5s, it is believed that newly start a sentence after the pause；Dispose spoken language in input and talk about fragment of adjacent repetition etc. in text flow, be easy to follow-up machine translation module to treat the translation of interpreter language, so as to improve the efficiency and quality of translation.

Description

A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system

Technical field

The present invention relates to artificial intelligence machine translation technology field, it particularly relates to a kind of multilingual intelligence pretreatment The receiving module of real-time statistics machine translation system.

Background technology

Machine translation is the technology for carrying out automatic translation to Human Natural Language using computer, is one using computer The process that natural language is converted into another natural language is planted, and two kinds of natural languages should be of equal value in the sense.

At present, a kind of comparative maturity and machine translation method of main flow is Statistics-Based Method, the advantage of this method It is that all translation informations are all automatically to be obtained from language material learning with little need for manually translation rule is write, because The characteristics of this this method has farthest played computer high-speed computation, significantly reduces cost of labor.

Machine translation mothod based on statistical model is from Parallel Corpus learning from a kind of language A to another language B Phrase translation.When translating new sentence, input language A sentences decomposition into some phrases, according to study come phrase （A language）- phrase（B language）Co-occurrence probabilities, language A sentence translation into language B sentence.It is whole study, translated Journey is completely according to statistical model.

And the receiving module of the technology is mainly used in treating cypher text or language to be translated（For example：Given a lecture in speech The translation of content）Progress basic normative inspection and processing, enable the machine translation in later stage more smooth, at present, existing Receiving module there is poor universality, text language or speech language can only be received, in addition, receive voice receiving module pair Various problems etc. occur in the processing of voice.

The content of the invention

For the above-mentioned technical problem in correlation technique, the present invention proposes a kind of multilingual intelligence pretreatment real-time statistics machine The receiving module of device translation system, can overcome the above-mentioned deficiency of prior art.

To realize above-mentioned technical purpose, the technical proposal of the invention is realized in this way：

A kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system, the receiving module is used for system The normalization of input is checked that the receiving module includes text language receiving module and voice identification result receiving module； Wherein text language receiving module is used to carry out sentence segmentation to text language and form is changed, voice identification result receiving module For voice is split and noise eliminate.

Further, the text language receiving module includes sentence and splits submodule, and the sentence segmentation submodule is used Disconnected in input text at punctuation mark so that the elementary cell of follow-up machine translation module translation is in short.

It is preferred that, when including html marks in input text, the content between a pair of html marks individually forms a complete sentence, and makees Translated for complete sentence.

Further, the text language receiving module also includes form transform subblock, the form transform subblock The plain text or XML format supported during for language text to be converted to the translation of machine translation module.

It is preferred that, the language text includes PDF texts and/or picture text.

Further, institute's speech recognition result receiving module includes sentence and splits submodule, and the sentence splits submodule Block is used to make pauses in reading unpunctuated ancient writings to the speech text stream of input according to the pause between word and word.

It is preferred that, when the sentence segmentation submodule is paused between sentence more than 5s, that is, think newly to open after a pause Begin a sentence.

Further, institute's speech recognition result receiving module also includes noise elimination submodule, and the noise eliminates son Module is used to dispose the fragment of adjacent repetition in spoken words text flow in input.

It is preferred that, machine translation system subsequent module is plain text for the receivable pattern of voice identification result and obscured Network.

Beneficial effects of the present invention：The receiving module of the present invention both can be used for the reception to text language, can also use In the receiving to speech language；Receiving module of the present invention can carry out basic processing to the language of reception, such as by a pair Content between html marks individually forms a complete sentence；When speech pause is more than 0.5s, it is believed that newly start a sentence after the pause Son；Dispose spoken language in input and talk about fragment of adjacent repetition etc. in text flow, consequently facilitating follow-up machine translation module is treated and turned over The translation of language is translated, so as to improve the efficiency and quality of translation.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing needed to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the present invention Example, for those of ordinary skill in the art, on the premise of not paying creative work, can also be obtained according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 is the schematic diagram of text language receiving module described according to embodiments of the present invention；

Fig. 2 is the schematic diagram of voice identification result receiving module described according to embodiments of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained belongs to what the present invention was protected Scope.

As shown in Figure 1-2, described a kind of multilingual intelligence pretreatment real-time statistics machine according to embodiments of the present invention is turned over The receiving module of system is translated, the receiving module is used to check that the receiving module includes to the normalization that system is inputted Text language receiving module and voice identification result receiving module；Wherein text language receiving module is used to carry out text language Sentence is split and form conversion, and voice identification result receiving module is used to split voice and noise is eliminated；The text Language receiving module includes sentence and splits submodule, and the sentence segmentation submodule is used to input text break at punctuation mark Open so that the elementary cell of follow-up machine translation module translation is in short；The text language receiving module also includes form Transform subblock, the form transform subblock is used to language text be converted to machine translation module and translate to make the pure text of support Sheet or XML format.Institute's speech recognition result receiving module includes sentence and splits submodule, and the sentence segmentation submodule is used for The speech text stream of input is made pauses in reading unpunctuated ancient writings according to the pause between word and word；Institute's speech recognition result receiving module also includes noise Submodule is eliminated, the noise, which eliminates submodule, is used to dispose the fragment of adjacent repetition in spoken words text flow in input；Machine The pattern that the receivable voice identification result receiving module of device translation system subsequent module is sent is plain text and confusion network.

In one embodiment, when including html marks in input text, the content between a pair of html marks Individually form a complete sentence, translated as complete sentence.

In one embodiment, the language text includes PDF texts and/or picture text.

In one embodiment, when the sentence segmentation submodule is paused between sentence more than 5s, that is, think stopping Newly start a sentence after.

Understand for convenience the present invention above-mentioned technical proposal, below by way of in specifically used mode to the present invention it is above-mentioned Technical scheme is described in detail.

When specifically used, text language receiving module of the present invention is mainly made up of two parts, Fig. 1 in such as accompanying drawing It is shown：A.1 sentence splits submodule and A.2 form transform subblock, and A.1 sentence segmentation submodule accords with input text in punctuate Number：Such as fullstop, question mark, disconnect at exclamation mark so that the elementary cell of follow-up machine translation module translation is an a word； When input include html marks in text when, the content between a pair of html mark individually forms a complete sentence, to ensure it as complete Sentence translated, the part for marking outer text not as html is translated；The subsequent module branch of machine translation The translation of plain text and XML format text is held, therefore, when input text is extended formatting：Such as PDF is accomplished by lattice during picture The conversion of formula modular converter；A.2 extended formatting is converted into plain text and XML format by form transform subblock.

Voice identification result receiving module is also mainly made up of two parts, as shown in Fig. 2 in accompanying drawing：A.3 sentence segmentation is sub Module and A.4 noise eliminate submodule.A.3 sentence segmentation submodule is used for the text flow of input according between word and word Pause punctuate, such as when being more than 0.5s when pausing, it is believed that newly start a sentence after the pause；A.4 noise eliminates submodule The function of block is to dispose the fragment of adjacent repetition in spoken words text flow in input, for example " uh uh " is simplified to " uh "； " that is we are necessary ... " is simplified to " that is we are necessary ... "；Machine translation system subsequent module It is plain text and confusion network for the receivable pattern of voice identification result.

In summary, by means of the above-mentioned technical proposal of the present invention, receiving module of the invention both can be used for text The reception of language, can be used for the receiving to speech language；Receiving module of the present invention can carry out basic to the language of reception Processing, for example the content between a pair of html mark is individually formed a complete sentence；When speech pause is more than 0.5s, it is believed that in the pause Newly start a sentence afterwards；Dispose spoken language in input and talk about fragment of adjacent repetition etc. in text flow, consequently facilitating follow-up machine Device translation module treats the translation of interpreter language, so as to improve the efficiency and quality of translation.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims

1. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system, it is characterised in that the reception Module is used to check that the receiving module includes text language receiving module and speech recognition to the normalization that system is inputted As a result receiving module；Wherein text language receiving module is used to carry out sentence segmentation to text language and form is changed, and voice is known Other result receiving module to voice for being split, noise is eliminated.

2. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 1, Characterized in that, the text language receiving module, which includes sentence, splits submodule, the sentence segmentation submodule is used for defeated Enter text to disconnect at punctuation mark so that the elementary cell of follow-up machine translation module translation is in short.

3. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 2, Characterized in that, when including html marks in input text, the content between a pair of html marks individually forms a complete sentence, as complete Whole sentence is translated.

4. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 2, Characterized in that, the text language receiving module also includes form transform subblock, the form transform subblock is used for handle Language text is converted to the plain text format supported during the translation of machine translation module or XML format.

5. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 4, Characterized in that, the language text includes PDF texts and/or picture text.

6. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 1, Characterized in that, institute's speech recognition result receiving module, which includes sentence, splits submodule, the sentence segmentation submodule is used for The speech text stream of input is made pauses in reading unpunctuated ancient writings according to the pause between word and word.

7. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 6, Characterized in that, when the sentence segmentation submodule is paused between sentence more than 5s, that is, thinking newly to start one after a pause Individual sentence.

8. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 6, Characterized in that, institute's speech recognition result receiving module, which also includes noise, eliminates submodule, the noise eliminates submodule and used In the fragment for disposing adjacent repetition in spoken words text flow in input.

9. a kind of receiving module of multilingual intelligence pretreatment real-time statistics machine translation system according to claim 8, Characterized in that, machine translation system subsequent module is plain text for the receivable pattern of voice identification result and obscures net Network.