CN102903361A

CN102903361A - Instant call translation system and instant call translation method

Info

Publication number: CN102903361A
Application number: CN2012103909731A
Authority: CN
Inventors: 钟实; 刘鹤; 袁首鹏
Original assignee: Itp Innovation Ltd
Current assignee: Itp Innovation Ltd
Priority date: 2012-10-15
Filing date: 2012-10-15
Publication date: 2013-01-30

Abstract

The invention discloses an instant call translation system and an instant call translation method. The system comprises a divider, a voice recognition device, a translation device and a voice synthesizer, wherein the divider is connected to a switchboard and divides an inputted voice signal into one or more audio files; the voice recognition device is connected with the divider and is used for transcribing the one or more audio files into texts in source language; the translation device is connected with the voice recognition device, and is used for translating the texts in the source language into texts in objective language; and the voice synthesizer is connected with the translation device, and is used for converting the texts in the objective language into output voice signals and outputting the voice signals to the switchboard. By the instant call translation system and the instant call translation method, both call sides with language barrier can be freely communicated with each other in real time.

Description

A kind of conversation instant translation system and method

Technical field

The present invention relates to the instant translation field, relate in particular to a kind of conversation instant translation system and method.

Background technology

In the current epoch, the people of country variant can realize the interchange between the people of different geographical expediently because many-sided demands such as politics, economy, culture, amusement will often be linked up by modes such as network and phones.Yet except needs network, phone etc. easily the information transmitting medium, also to solve the problem of language obstacle.Skillfully master a foreign language and be very difficult with smooth and easy interchange of other national people.Therefore, language obstacle is the biggest obstacle of people's interchange of country variant.At present, have many translation software on the network or on the intelligent terminal such as mobile phone, but these translation software can not be used for immediate communication usually.

Therefore, need to provide a kind of conversation instant translation system and method to address the above problem.

Summary of the invention

Introduced the concept of a series of reduced forms in the summary of the invention part, this will further describe in the embodiment part.Summary of the invention part of the present invention does not also mean that key feature and the essential features that will attempt to limit technical scheme required for protection, does not more mean that the protection domain of attempting to determine technical scheme required for protection.

In order to address the above problem, the invention discloses a kind of for the conversation instant translation system, comprise sheer, speech recognition equipment, translating equipment and speech synthetic device, wherein, described sheer is used for being connected to switch and is one or more audio files with the input speech signal cutting; Described speech recognition equipment links to each other with described sheer, is used for described one or more audio files are transcribed into the text of source language; Described translating equipment links to each other with described speech recognition equipment, and the text translation that is used for described source language is the text of target language; And described speech synthetic device links to each other with described translating equipment, is used for the text-converted of described target language is the output voice signal, and exports to described switch.

In a preferred embodiment of the invention, described system also comprises storer, and it is connected between described sheer and the described speech recognition equipment; Wherein, described sheer also is used for described one or more audio files storage to described storer; And described one or more audio files that described speech recognition equipment is transcribed come from described storer.

In a preferred embodiment of the invention, described system also comprises: device determined in language, and it links to each other with described sheer, is used for the language of determining that both call sides uses; Wherein, a kind of as described source language, another kind of as described target language in the language that described both call sides uses.

In a preferred embodiment of the invention, described system also comprises: input interface is used for receiving described input speech signal from described switch; And output interface, be used for exporting described output voice signal to described switch.

In a preferred embodiment of the invention, described sheer further comprises: detecting unit, for detection of the quiet part in the described input speech signal; And cutting unit, being used for based on the quiet part that detects is described one or more audio file with described input speech signal cutting.

Preferably, described quiet part is included in the part that decibel value in time period more than 0.6 second or 0.6 second is less than or equal to noise threshold.

In a preferred embodiment of the invention, described system also comprises automatic gain controller, and it links to each other with described sheer, is used for and to the control that gains of described input speech signal.

In a preferred embodiment of the invention, described automatic gain controller further comprises: amplifying unit is used for decibel value is amplified to described setting value less than the described input speech signal of setting value; And dwindle the unit, be used for decibel value is contracted to described setting value greater than the described input speech signal of described setting value.

In a preferred embodiment of the invention, described system also comprises wave filter, and it links to each other with described sheer, is used for described input speech signal is carried out noise reduction process.

Preferably, described wave filter is S filter.

According to a further aspect of the invention, also provide a kind of conversation instant translation method, having comprised: be one or more audio files with the input speech signal cutting; Described one or more audio files are transcribed into the text of source language; Be the text of target language with the text translation of described source language; And the text-converted of described target language is the output voice signal.

In a preferred embodiment of the invention, also comprise after the described cutting: with described one or more audio files storage to storer; And described one or more audio files of transcribing come from described storer.

In a preferred embodiment of the invention, also comprise before the described cutting: determine the language that both call sides uses; Wherein, a kind of as described source language, another kind of as described target language in the language that described both call sides uses.

In a preferred embodiment of the invention, also comprise before the described cutting: receive described input speech signal from switch; And also comprise after the described conversion: export described output voice signal to described switch.

In a preferred embodiment of the invention, described cutting further comprises: detect the quiet part in the described input speech signal; And be described one or more audio file based on the quiet part that detects with described input speech signal cutting.

In a preferred embodiment of the invention, also comprise before the described cutting: to the control that gains of described input speech signal.

In a preferred embodiment of the invention, described gain control further comprises: decibel value is amplified to described setting value less than the described input speech signal of setting value; And decibel value is contracted to described setting value greater than the described input speech signal of described setting value.

In a preferred embodiment of the invention, also comprise before the described cutting: described input speech signal is carried out noise reduction process.

Preferably, described noise reduction process further comprises described input speech signal is carried out Wiener filtering.

Above-mentioned conversation instant translation system provided by the present invention can be so that the both call sides of language obstacle can be realized real-time freely exchanging with method.

Description of drawings

Following accompanying drawing of the present invention is used for understanding the present invention at this as a part of the present invention.Shown in the drawings of embodiments of the invention and description thereof, be used for explaining principle of the present invention.In the accompanying drawings,

Fig. 1 shows the structured flowchart of conversation instant translation system in accordance with a preferred embodiment of the present invention;

Fig. 2 shows the synoptic diagram of input speech signal in accordance with a preferred embodiment of the present invention;

Fig. 3 shows the process flow diagram of conversation instant translation method in accordance with a preferred embodiment of the present invention;

Fig. 4 shows the synoptic diagram of the verbal system of the conversation instant translation system that comprises in accordance with a preferred embodiment of the present invention.

Embodiment

In the following description, a large amount of concrete details have been provided in order to more thorough understanding of the invention is provided.Yet, it will be apparent to one skilled in the art that the present invention can need not one or more these details and implemented.In other example, for fear of obscuring with the present invention, be not described for technical characterictics more well known in the art.

In order thoroughly to understand the present invention, detailed structure will be proposed in following description.Obviously, execution of the present invention is not limited to the specific details that those skilled in the art has the knack of.Preferred embodiment of the present invention is described in detail as follows, yet except these were described in detail, the present invention can also have other embodiments.

According to an aspect of the present invention, provide a kind of conversation instant translation system.Fig. 1 shows the structured flowchart of conversation instant translation system 100 in accordance with a preferred embodiment of the present invention.As shown in Figure 1, this conversation instant translation system comprises sheer 104, speech recognition equipment 106, translating equipment 107 and speech synthetic device 108.Wherein, sheer 104 is used for being connected to external switch and is one or more audio files with the input speech signal cutting.Speech recognition equipment 106 links to each other with sheer 104, is used for one or more audio files of 104 cuttings of sheer are transcribed into the text of source language.Translating equipment 107 links to each other with speech recognition equipment 106, and the text translation that is used for source language that speech recognition equipment 106 is transcribed is the text of target language.Speech synthetic device 108 links to each other with translating equipment 107, and the text-converted that is used for target language that translating equipment 107 is translated is the output voice signal, and exports to external switch.

Speech recognition technology is normally based on vocabulary, phrase or carry out than short sentence.As shown in Figure 1, sheer 104 links to each other with external switch, and it is used for will be one or more audio files from the input speech signal cutting of external switch.Thus, the conversation cutting of its continuous large section is short statement.Like this, follow-up voice recognition processing can be carried out for the data after the cutting, has greatly improved processing accuracy.This has effectively guaranteed the quality of conversation instant translation.

According to a preferred embodiment of the present invention, sheer 104 can be divided into detecting unit and cutting unit, wherein, detecting unit is for detection of the quiet part in the input speech signal, and partitioning portion to be used for based on the quiet part that detects be one or more audio files with the input speech signal cutting.Fig. 2 shows the synoptic diagram of input speech signal in accordance with a preferred embodiment of the present invention.As shown in Figure 2, can in input speech signal, detect quiet part, then based on the quiet part that detects with the input speech signal cutting be one or more audio files quiet be the conversation in requisite part, come the cutting voice signal can express better speaker's statement implication based on quiet part.Can not occur like this making pauses in reading unpunctuated ancient writings or half situation, avoid subsequent treatment mistake to occur.

The quiet part of input speech signal can be less than or equal to for the decibel value of certain time the part of noise threshold.Noise threshold can be decided according to the concrete condition of both call sides place environment.For example, in the noisy environment, noise threshold can arrange highlyer.By increase duration length, thereby noise can be regarded as quiet being removed.Preferably, length is more than 0.6 second or 0.6 second duration.0.6 second be the person to person when exchanging sentence with between the cardinal principle dwell interval, select this time period quiet can be comparatively exactly person to person's dialog context to be divided into audio file take natural sentences as unit, and can effectively remove noise, so that ensuing processing procedure accuracy is higher.

Speech recognition equipment 106 links to each other with sheer 104, is used for one or more audio files of 104 cuttings of sheer are transcribed into the text of source language.In accordance with a preferred embodiment of the present invention, speech recognition equipment 106 transcription of carrying out comprises following operation.At first the one or more audio files that form after the cutting are carried out the extraction of the phonetic feature of voice signal.Phonetic feature according to extracting can carry out analyzing and processing to voice signal, can remove the redundant information that has nothing to do with speech recognition and the important information that obtains to affect speech recognition, can compress voice signal simultaneously.Then, speech recognition equipment 106 acoustic model of having trained according to the phonetic feature utilization of extracting is identified.Particularly, the phonetic feature of voice signal is mated with the phonetic feature of acoustics model and relatively, obtain best recognition result.Whole transcription has been finished the text that one or more audio files of 104 cuttings of sheer is transcribed into source language.

Translating equipment 107 links to each other with speech recognition equipment 106, and the text translation that is used for source language that speech recognition equipment 106 is transcribed is the text of target language.Translating equipment 107 is based on grammer, semanteme, syntax, the knowledge of idiom and speaker's the culture of the text of source language, then the decode meaning of text of source language of all features that analyze the text of source language is re-encoded as the text of source language the text of the target language of expressing the same meaning.

Speech synthetic device 108 links to each other with translating equipment 107, and the target language text that is used for producing after translating equipment 107 translations is converted to the output voice signal of target language, and exports to external switch.This transfer process is as follows: at first, the characteristic parameter that the text of the target language that produces after translating equipment 107 translation is converted into target language is with the corresponding prosodic information of each syllable of the sentence of the text that produces this target language; The tone that uses when then, speaking at ordinary times in conjunction with the people, the tone, pause mode, and pronunciation length convert this prosodic information to corresponding prosodic parameter; At last, generate corresponding output voice signal in conjunction with the parameter of this prosodic parameter and acoustics, and export to external switch.

According to a preferred embodiment of the present invention, conversation instant translation system 100 can also comprise input interface and output interface (not shown in Figure 1).Wherein, input interface can be connected between external switch and the sheer 104, is used for receiving input speech signal from external switch, and this input speech signal can be that simulating signal also can be digital signal.If digital signal, its sample frequency is preferably 8000Hz, and its quantization digit is preferably 16 bits.Output interface can be connected between speech synthetic device 108 and external switch, is used for exporting voice signal and exports external switch to.

According to a preferred embodiment of the present invention, conversation instant translation system 100 can also comprise that language determines device 101, and it links to each other with sheer 104, is used for determining the language of both call sides use.In the process of conversation, if a kind of as source language, then another kind of as target language in the language that both call sides uses.As shown in Figure 1, after both call sides connected by external switch, device 101 determined in the language that a word (for example, the initial greeting of both call sides) of saying separately can be inputed to system 100 through switch.Then, the language that device 101 definite both call sides use determined in language.For example, both call sides is respectively Chinese and American, be that the employed language of both call sides is Chinese and English, (Chinese say " feeding " by the initial greeting of both call sides, the American says " hello "), language is determined device 101 by receiving external switch input " feeding " and " hello ", determines the used language of both call sides and is Chinese and English.Like this, in follow-up processing procedure, if input speech signal is the voice signal of Chinese, then source language is Chinese, and target language is English; Otherwise if input speech signal is English voice signal, then source language is English, and target language is Chinese.Can identify the voice signal of various language according to the system 100 of the preferred embodiment, applied widely.One of ordinary skill in the art will appreciate that the source language of system 100 and target language can also set in advance to need not to use language to determine device 101.

According to a preferred embodiment of the present invention, conversation instant translation system 100 can also comprise automatic gain controller 102, and it links to each other with sheer, for control that input speech signal is gained.For example, the decibel value with the input speech signal that receives is adjusted to roughly unified setting value level.By automatic gain controller 102 on input speech signal the control that gains can successfully avoid because of the suddenly big or suddenly small impact that causes subsequent treatment of speaker's volume, and then the user who has influence on the other side experiences.

Preferably, this automatic gain controller 102 can comprise amplifying unit and dwindle the unit.Wherein, when the decibel value of the input speech signal that receives during less than setting value, amplifying unit is used for decibel value is amplified to this setting value less than the input speech signal of this setting value; Otherwise, when the decibel value of the input speech signal that receives during greater than this setting value, dwindle the unit decibel value be contracted to this setting value greater than the input speech signal of this setting value.This setting value can freely limit according to actual needs.

According to a preferred embodiment of the present invention, conversation instant translation system 100 can also comprise wave filter 103, and it links to each other with sheer 104, is used for input speech signal is carried out noise reduction process.Noise reduction process can adopt the method for filtering.Filtering can be from continuous or discrete input data filtering noise and disturb to extract useful information.Preferably, wave filter 103 can be that S filter is to obtain good filter effect.

In a word, automatic gain controller 102 and wave filter 103 all can make input speech signal be convenient to be identified and improve the accuracy of identification and translation.

According to a preferred embodiment of the present invention, conversation instant translation system 100 can also comprise storer 105, and it is connected between sheer 104 and the speech recognition equipment 106.In this situation, sheer 104 also is used for one or more audio files storage with its cutting to storer 105, and one or more audio files that speech recognition equipment 106 is transcribed come from storer 105.Through storer 105, can temporarily deposit one or more audio files of sheer 104 cuttings in storer 105, before entering speech recognition equipment, to cushion, so that the work of transcribing that next speech recognition equipment 106 carries out is more smooth and easy.

In addition, it should be noted that the direct connection that can represent above term " connection " and " linking to each other " between each device, also can represent indirect joint, only show a kind of connected mode between the different device of conversation instant translation system 100 among Fig. 1, other connected mode can also be arranged.For example, language determines that device 101 can directly connect wave filter 103, and automatic gain controller 102 is connected between wave filter 103 and the sheer 104.

According to a further aspect in the invention, also provide a kind of conversation instant translation method.Fig. 3 shows the process flow diagram of conversation instant translation method 300 in accordance with a preferred embodiment of the present invention.As shown in Figure 3, this conversation instant translation method 300 comprises cutting step 304, voice lard speech with literary allusions this step 306, translation steps 307 and text-to-speech step 308.Wherein, cutting step 304 is one or more audio files with the input speech signal cutting; Lard speech with literary allusions one or more audio files that this step 306 forms after with the cutting of cutting step 304 of voice are transcribed into the text of source language; Translation steps 307 is the text of target language with the voice text translation that this step 306 transcribes the source language of rear formation of larding speech with literary allusions; Text-to-speech step 308 is the output voice signal with the text-converted of the rear target language that forms of translation steps 307 translations.

In cutting step 304, the process of input speech signal being carried out cutting has further comprised detecting step and segmentation procedure, wherein, detecting step is for detection of the quiet part of input speech signal, and then to be used for based on the quiet part that detects be a plurality of audio files with the input speech signal cutting to segmentation procedure.

According to a preferred embodiment of the present invention, the quiet part of input speech signal is less than or equal to the part of noise threshold for decibel value within the time period more than 0.6 second or 0.6 second.

After cutting step 304 is one or more audio files with the input speech signal cutting, enter voice this step 306 of larding speech with literary allusions.Lard speech with literary allusions one or more audio files that this step 306 forms after with the cutting of cutting step 304 of voice are transcribed into the text of source language.In this step 306 larded speech with literary allusions in voice, at first the one or more audio files that form after the cutting of cutting step 304 are carried out the extraction of the phonetic feature of voice signal; The acoustic model of then having trained according to the phonetic feature utilization of extracting is identified.Particularly, the phonetic feature of voice signal is mated with the phonetic feature of acoustics model and relatively, obtain best recognition result.

Lard speech with literary allusions after one or more audio files that this step 306 forms after with the cutting of cutting step 304 are transcribed into the text of source language at voice, enter translation steps 307.Translation steps 307 is the text of target language with the voice text translation that this step 306 transcribes the source language of rear formation of larding speech with literary allusions.In translation steps 307, by grammer, semanteme, syntax, the knowledge of idiom and speaker's the culture based on the text of source language, the decode meaning of text of source language of all features that analyze the text of source language, then the text of source language is re-encoded as the text of the target language of the same meaning, the text translation of namely having finished source language is the text of target language.

Voice are larded speech with literary allusions after the Language Translation of the source text that forms after this step 306 is transcribed becomes the text of target language in translation steps 307, enter text-to-speech step 308.Text-to-speech step 308 is converted to the output voice signal of target language with the rear target language text that forms of translation steps 307 translations, and exports to external switch.In text-to-speech step 308, preferably, the characteristic parameter that at first text of the target language that forms after translation steps 307 translation is converted into target language is with the corresponding prosodic information of each syllable of the sentence of the text that produces this target language, the tone that uses when then speaking at ordinary times in conjunction with the people, the tone, pause mode, and pronunciation length convert this prosodic information to corresponding prosodic parameter, parameter in conjunction with prosodic parameter and acoustics generates corresponding output voice signal at last, and exports to external switch.

Like this, whole conversation instant translation process finishes.

According to a preferred embodiment of the present invention, conversation instant translation method 300 can also comprise receiving step and output step (not shown in Figure 3).Wherein, receiving step received input speech signal from switch in this receiving step before cutting step 304, and this input speech signal can be that simulating signal also can be digital signal.If digital signal, its sample frequency is preferably 8000Hz, and its quantization digit is preferably 16 bits.The output step will be exported voice signal and export described switch to after text-to-speech step 308.

According to a preferred embodiment of the present invention, conversation instant translation method 300 can also comprise language determining step 301, and it is used for determining the language that both call sides uses before cutting step 304.A kind of as source language, then another kind of as target language in the language that both call sides uses.For example, both call sides is respectively Chinese and American, be that the employed language of both call sides is Chinese and English, (Chinese say " feeding " by the initial greeting of both call sides, the American says " hello "), receive " feeding " and " hello " that external switch send and the language of determining the both call sides effect is Chinese and English in language determining step 301.Like this, in follow-up processing procedure, if input speech signal is the voice signal of Chinese, then source language is Chinese, and target language is English; Otherwise if input speech signal is English voice signal, then source language is English, and target language is Chinese.

According to a preferred embodiment of the present invention, conversation instant translation method 300 can also comprise gain control step 302, it is used in cutting step 304 front to the input speech signal control that gains, for example, the decibel value with the input speech signal that receives is adjusted to roughly unified setting value level.

Preferably, in gain control step 302, when the decibel value of the input speech signal that receives during less than setting value, decibel value is amplified to this setting value less than the input speech signal of this setting value; Otherwise, when the decibel value of the input speech signal that receives during greater than this setting value, decibel value is contracted to this setting value greater than the input speech signal of this setting value.This setting value can freely limit according to actual needs.

According to a preferred embodiment of the present invention, conversation instant translation method 300 can also comprise noise reduction process step 303, and it is used for input speech signal being carried out noise reduction process in that cutting step 304 is front.Noise reduction process can adopt the method for filtering.Preferably, noise reduction process step 303 comprises input speech signal is carried out Wiener filtering.

In addition, one of ordinary skill in the art will appreciate that Fig. 3 shows a kind of execution sequence of conversation instant translation method step in accordance with a preferred embodiment of the present invention, this order can be adjusted.For example, gain control step 302 can be carried out after noise reduction process step 303.

According to a preferred embodiment of the present invention, conversation instant translation method 300 can also comprise storing step 305, and it is used for cutting step 304 after larding speech with literary allusions before this step 306 the one or more audio files storage that will form after the cutting of cutting step 304 to storer with voice.The voice one or more audio files that this step 306 transcribes of larding speech with literary allusions come from this storer.

Fig. 4 shows the synoptic diagram of preferred embodiment of the verbal system of the conversation instant translation system that comprises in accordance with a preferred embodiment of the present invention.This verbal system 400 comprises the employed phone 401 of user's communication and phone 402, public switched telephone network (PSTN) 403, private branch exchange system (IP PBX) 404 and conversation instant translation system 405 provided by the present invention.Wherein, the employed phone 401 of user's communication and phone 402 also can replace with intelligent terminal, and correspondingly, PSTN 403 also can replace with internet voice transfer protocol (VOIP) network.

As shown in Figure 4, the both sides of conversation are respectively user 1 and user 2.Wherein, user's 1 employed language is A, and user's 2 employed language are B.A side who makes a phone call, for example, the user 1, by PSTN 403 dial-up users 2.IP PBX 404 sets up both sides' call connection.Subsequently, user 1 and user 2 begin conversation, and its voice that send separately enter conversation instant translation system 405 through IP PBX 404, and the voice after translation send corresponding user to by IP PBX respectively.The below specifically describes the workflow of verbal system 400.At first, set up the conversation connection that user 1 is connected with the user.Then, user 1 A language input speech signal S1 is sent to conversation instant translation system 405 via IP PBX 404.Subsequently, translated by conversation instant translation system 405, form the output voice signal S4 of B language performance.At last, IP PBX 404 detects this signal S4, is sent to user 2.One of ordinary skill in the art will appreciate that, in the description of said process, omitted PSTN and IP PBX to the routine operation of voice signal, to avoid covering the present invention.Like this, user 2 just can hear the user's 1 who expresses with the language (being the B language) of oneself voice.In like manner, when user 1 words responded in user's 2 usefulness B language, user 1 also can hear the user's 2 of A language performance voice.Optionally, user 1 and user 2 can also hear the voice without translation except hearing the other side's voice with own language.

Use conversation instant translation system provided by the invention and method, the both call sides of language obstacle utilizes traditional public phone exchanges network or VOIP network etc. can realize real-time freely exchanging.

The present invention is illustrated by above-described embodiment, but should be understood that, above-described embodiment just is used for for example and the purpose of explanation, but not is intended to the present invention is limited in the described scope of embodiments.It will be appreciated by persons skilled in the art that in addition the present invention is not limited to above-described embodiment, can also make more kinds of variants and modifications according to instruction of the present invention, these variants and modifications all drop in the present invention's scope required for protection.Protection scope of the present invention is defined by the appended claims and equivalent scope thereof.

Claims

1. a conversation instant translation system comprises sheer, speech recognition equipment, translating equipment and speech synthetic device, wherein,

Described sheer is used for being connected to switch and is one or more audio files with the input speech signal cutting;

Described speech recognition equipment links to each other with described sheer, is used for described one or more audio files are transcribed into the text of source language;

Described translating equipment links to each other with described speech recognition equipment, and the text translation that is used for described source language is the text of target language; And

Described speech synthetic device links to each other with described translating equipment, is used for the text-converted of described target language is the output voice signal, and exports to described switch.

2. system according to claim 1 is characterized in that, described system also comprises:

Storer, it is connected between described sheer and the described speech recognition equipment;

Wherein, described sheer also is used for described one or more audio files storage to described storer; And

Described one or more audio files that described speech recognition equipment is transcribed come from described storer.

3. system according to claim 1 is characterized in that, described system also comprises:

Device determined in language, and it links to each other with described sheer, is used for the language of determining that both call sides uses;

Wherein, a kind of as described source language, another kind of as described target language in the language that described both call sides uses.

4. system according to claim 1 is characterized in that, described system also comprises:

Input interface is used for receiving described input speech signal from described switch; And

Output interface is used for exporting described output voice signal to described switch.

5. system according to claim 1 is characterized in that, described sheer further comprises:

Detecting unit is for detection of the quiet part in the described input speech signal; And

Cutting unit, being used for based on the quiet part that detects is described one or more audio file with described input speech signal cutting.

6. system according to claim 5 is characterized in that, described quiet part is included in the part that decibel value in time period more than 0.6 second or 0.6 second is less than or equal to noise threshold.

7. system according to claim 1 is characterized in that, described system also comprises:

Automatic gain controller, it links to each other with described sheer, is used for and to the control that gains of described input speech signal.

8. system according to claim 7 is characterized in that, described automatic gain controller further comprises:

Amplifying unit is used for decibel value is amplified to described setting value less than the described input speech signal of setting value; And

Dwindle the unit, be used for decibel value is contracted to described setting value greater than the described input speech signal of described setting value.

9. system according to claim 1 is characterized in that, described system also comprises:

Wave filter, it links to each other with described sheer, is used for described input speech signal is carried out noise reduction process.

10. system according to claim 9 is characterized in that, described wave filter is S filter.

11. a conversation instant translation method comprises:

Be one or more audio files with the input speech signal cutting;

Described one or more audio files are transcribed into the text of source language;

Be the text of target language with the text translation of described source language; And

The text-converted of described target language is the output voice signal.

12. method according to claim 11 is characterized in that, also comprises after the described cutting:

With described one or more audio files storage to storer; And

Described one or more audio files of transcribing come from described storer.

13. method according to claim 11 is characterized in that, also comprises before the described cutting:

Determine the language that both call sides uses;

14. method according to claim 11 is characterized in that,

Also comprise before the described cutting: receive described input speech signal from switch; And

Also comprise after the described conversion: export described output voice signal to described switch.

15. method according to claim 11 is characterized in that, described cutting further comprises:

Detect the quiet part in the described input speech signal; And

Be described one or more audio file based on the quiet part that detects with described input speech signal cutting.

16. method according to claim 15 is characterized in that, described quiet part is included in the part that decibel value in time period more than 0.6 second or 0.6 second is less than or equal to noise threshold.

17. method according to claim 11 is characterized in that, also comprises before the described cutting: to the control that gains of described input speech signal.

18. method according to claim 17 is characterized in that, described gain control further comprises:

Decibel value is amplified to described setting value less than the described input speech signal of setting value; And

Decibel value is contracted to described setting value greater than the described input speech signal of described setting value.

19. method according to claim 11 is characterized in that, also comprises before the described cutting: described input speech signal is carried out noise reduction process.

20. method according to claim 19 is characterized in that, described noise reduction process further comprises carries out Wiener filtering to described input speech signal.