CN107910004A

CN107910004A - Voiced translation processing method and processing device

Info

Publication number: CN107910004A
Application number: CN201711107221.9A
Authority: CN
Inventors: 刘俊华; 魏思; 胡国平; 柳林; 王建社; 方昕; 李永超; 孟廷
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-04-13

Abstract

The embodiment of the present invention provides a kind of voiced translation processing method and processing device, belongs to language processing techniques field.This method includes：During the first synthetic speech signal is reported, if receiving the mixing voice signal for including the first synthetic speech signal of part, stop reporting the first synthetic speech signal.The first synthetic speech signal of part is filtered out from mixing voice signal, obtains current round voice signal to be translated, and be used as targeted voice signal.Based on targeted voice signal, the second synthetic speech signal is obtained, and report the second synthetic speech signal.The embodiment of the present invention stops reporting the first synthetic speech signal, and report the second synthetic speech signal when receiving the mixing voice signal for including the first synthetic speech signal of part.Due to any one party in communication process, report process can be interrupted at any time according to full-duplex mode, and terminated without waiting until that a wheel reports process every time, so as to improve communication efficiency.

Description

Voiced translation processing method and processing device

Technical field

The present embodiments relate to language processing techniques field, more particularly, to a kind of voiced translation processing method and Device.

Background technology

At present, language communication becomes the important topic that different racial groups are faced in mutual exchange.For example, In double or multi-person conference, voiced translation can be realized by automatic speech translation system.Wherein, automatic speech translation system is usual It is made of speech recognition, machine translation and phonetic synthesis three parts.Source languages voice signal obtains source languages by speech recognition Text data, then translates into target language text data, finally by mesh by machine translation by source languages text data Poster kind text data carries out phonetic synthesis, obtains target language voice signal and is reported.When carrying out voiced translation, need After waiting until that the target language voice signal of last round of time has been reported, the source languages voice signal of next one can be just translated.

Due in instant communication exchange, listener may just can be bright when carrying out half for the report process of last round of time White speaker to be said if implication, and listener need to continue after waiting last round of secondary report process it is next The communication of round, so that communication efficiency is relatively low.

The content of the invention

To solve the above-mentioned problems, the embodiment of the present invention provides one kind and overcomes the above problem or solve at least in part State the voiced translation processing method and processing device of problem.

First aspect according to embodiments of the present invention, there is provided a kind of voiced translation processing method, this method include：

During being reported to the first synthetic speech signal, voice letter is synthesized if receiving and including part first Number mixing voice signal, then stop reporting the first synthetic speech signal, the first synthetic speech signal is time to turn over via last round of Translate and phonetic synthesis after it is obtained；

The first synthetic speech signal of part is filtered out from mixing voice signal, obtains current round voice letter to be translated Number, and it is used as targeted voice signal；

Based on targeted voice signal, the second synthetic speech signal is obtained, and reports the second synthetic speech signal, the second synthesis Voice signal be targeted voice signal is translated and phonetic synthesis after it is obtained.

Method provided in an embodiment of the present invention, by during being reported to the first synthetic speech signal, if connecing The mixing voice signal for including the first synthetic speech signal of part is received, then stops reporting the first synthetic speech signal.From mixed Close filtered speech signal and fall the first synthetic speech signal of part, obtain current round voice signal to be translated, and be used as target Voice signal.Based on targeted voice signal, the second synthetic speech signal is obtained, and report the second synthetic speech signal.Due to ditch Any one party during logical, can interrupt report process at any time according to full-duplex mode, and without waiting until a carousel every time Report process terminates so that while communication efficiency is improved, also may be such that link up between different language user it is more natural and tripping.

The possible implementation of with reference to first aspect the first, in second of possible implementation, the first synthesis Voice signal and targeted voice signal be identical languages type or the first synthetic speech signal with targeted voice signal for not Same languages type.

The possible implementation of with reference to first aspect the first, in the third possible implementation, reports second Before synthetic speech signal, further include：

The identification text data to being obtained after targeted voice signal progress speech recognition is obtained, is obtained to identifying text data The target text data obtained after being translated, and phonetic synthesis is carried out to target text data, obtain the second synthesis voice letter Number.

The third possible implementation with reference to first aspect, in the 4th kind of possible implementation, obtains to knowing The target text data that other text data obtains after being translated, including：

Determine the corresponding source languages type of identification text data, and determine that source languages type corresponds to according to default correspondence Target language type；

Target language type and identification text data are inputted to translation coding/decoding model, export target text data.

The 4th kind of possible implementation with reference to first aspect, in the 5th kind of possible implementation, based on target Vocal print feature in voice signal, determines the corresponding default languages type of vocal print feature, to be used as the corresponding mesh of source languages type Poster type.

The third possible implementation with reference to first aspect, in the 6th kind of possible implementation, obtains to knowing The target text data that other text data obtains after being translated, including：

If judgement knows that targeted voice signal and the information that the first synthetic speech signal transmits respectively are interrelated, it is based on The voice signal and/or translation result of round before current round, translate identification text data, to obtain target text Data.

The third possible implementation with reference to first aspect, in the 7th kind of possible implementation, to target text Notebook data carries out phonetic synthesis, obtains the second synthetic speech signal, including：

Voice broadcast parameter is obtained, target text data and voice broadcast parameter are inputted to phonetic synthesis model, output Second synthetic speech signal, used tone color ginseng when reporting the second synthetic speech signal is included at least in voice broadcast parameter Number.

Second aspect according to embodiments of the present invention, there is provided a kind of voiced translation processing unit, the device include：

Stop broadcasting module, for during being reported to the first synthetic speech signal, including if receiving The mixing voice signal of the first synthetic speech signal of part, then stop reporting the first synthetic speech signal, the first synthesis voice letter Number it is via obtained after last round of time translation and phonetic synthesis；

Filtering module, for filtering out the first synthetic speech signal of part from mixing voice signal, obtains current round and treats The voice signal of translation, and it is used as targeted voice signal；

Broadcasting module, for based on targeted voice signal, obtaining the second synthetic speech signal, and report the second synthesis voice Signal, the second synthetic speech signal be targeted voice signal is translated and phonetic synthesis after it is obtained.

The third aspect according to embodiments of the present invention, there is provided a kind of voiced translation processing equipment, including：

At least one processor；And

At least one processor being connected with processor communication, wherein：

Memory storage has the programmed instruction that can be executed by processor, and the instruction of processor caller is able to carry out first party The voiced translation processing method that any possible implementation is provided in the various possible implementations in face.

According to the fourth aspect of the invention, there is provided a kind of non-transient computer readable storage medium storing program for executing, non-transient computer Readable storage medium storing program for executing stores computer instruction, and computer instruction makes the various possible implementations of computer execution first aspect In the voiced translation processing method that is provided of any possible implementation.

It should be appreciated that the general description and following detailed description of the above are exemplary and explanatory, can not Limit the embodiment of the present invention.

Brief description of the drawings

Fig. 1 is a kind of flow diagram of voiced translation processing method of the embodiment of the present invention；

Fig. 2 is a kind of flow diagram of voiced translation processing method of the embodiment of the present invention；

Fig. 3 is a kind of flow diagram of voiced translation processing method of the embodiment of the present invention；

Fig. 4 is a kind of block diagram of voiced translation processing unit of the embodiment of the present invention；

Fig. 5 is a kind of block diagram of voiced translation processing equipment of the embodiment of the present invention.

Embodiment

With reference to the accompanying drawings and examples, the embodiment of the embodiment of the present invention is described in further detail.With Lower embodiment is used to illustrate the embodiment of the present invention, but is not limited to the scope of the embodiment of the present invention.

At present, the people of different language realizes when carrying out communication exchange typically by automatic speech translation system.Its In, automatic speech translation system is usually made of speech recognition, machine translation and phonetic synthesis three parts.Source languages voice signal Source languages text data is obtained by speech recognition, source languages text data is then translated into by target language by machine translation Text data, carries out phonetic synthesis finally by target language text data, obtains target language voice signal and broadcast Report.When carrying out voiced translation, it is necessary to after the target language voice signal of last round of time has been reported, can just carry out next Translation, phonetic synthesis and the report of round.

For example, communication exchange between user A and user B, user A speaks English, and user B says Chinese.User A says an English Language, the process by translation and phonetic synthesis obtains a Chinese sentence, and is reported.Complete after whole sentence reports, user A can be after It is continuous to say an English or a Chinese sentence is said by user B, and repeat the process of above-mentioned translation, phonetic synthesis and report.That is, User A and user B need that after system is reported, new voice data could be received, and translated, phonetic synthesis and Report.

After a voice data having been said in view of user, it may be necessary to which the voice data just said is supplemented or changed. In addition, when reporting synthetic speech signal, the user of report is listened to may not be needed to hear out just it will be appreciated that the intention of spoken user. For said circumstances, reported if completing the whole sentence of last round of time according to above-mentioned flow, then carry out the translation of next one, voice conjunction Into and report, then can compare expend the time.In view of the above-mentioned problems, an embodiment of the present invention provides a kind of voiced translation processing side Method.This method can be applied to voice collecting, translation, synthesis and report function terminal or system, can be applied to two people or The more people of person link up scene, and the embodiment of the present invention is not especially limited this.Referring to Fig. 1, this method includes：101st, closed to first During being reported into voice signal, if receiving the mixing voice signal for including the first synthetic speech signal of part, Then stop reporting the first synthetic speech signal, the first synthetic speech signal is via institute after last round of translation and phonetic synthesis Obtain；102nd, the first synthetic speech signal of part is filtered out from mixing voice signal, obtains current round voice to be translated Signal, and it is used as targeted voice signal；103rd, based on targeted voice signal, the second synthetic speech signal is obtained, and report second Synthetic speech signal, the second synthetic speech signal be targeted voice signal is translated and phonetic synthesis after it is obtained.

In above-mentioned steps 101, the first synthetic speech signal for last round of time source languages signal by collection, translation and Obtained by after phonetic synthesis., can be while listening for whether having new source during being reported to the first synthetic speech signal Languages voice signal, namely monitor whether have user said again needs translate report if.Specifically, can be by opening a prison Thread is listened to monitor, the embodiment of the present invention is not especially limited this.In snoop procedure, due to the first synthesis language of last round of time Sound signal is also being reported, so that the voice signal listened to is except including new source languages voice signal, (i.e. user is new Speech) outside, also there are the first synthetic speech signal of part, so that what is received includes the first synthetic speech signal of part Mixing voice signal.After the mixing voice signal for including the first synthetic speech signal of part is received, illustrate useful at this time Family made a statement, which is probably that the user of last round of speech requires supplementation with or changes, so as to interrupt the first synthesis voice letter Number report process make a speech.Or, it is also possible to the user for listening report, which does not hear out, has just understood last round of speech The intention that user is passed on, so that the report process for interrupting the first synthetic speech signal is made a speech.

Due in addition to including the mixing voice signal of the first synthetic speech signal of part, being gone back in mixing voice signal Comprising current round voice signal to be translated, and the voice signal to be translated to current round is subsequently needed to be translated, language Sound is synthesized and reported, so that, it is necessary to which part first is filtered out from mixing voice signal synthesizes voice letter in above-mentioned steps 102 Number, so as to obtain current round voice signal to be translated.The embodiment of the present invention from mixing voice signal to not filtering out portion The mode of the first synthetic speech signal is divided specifically to limit, from creolized language message including but not limited to by way of echo cancellor The first synthetic speech signal of part is filtered out in number.Wherein, the calculating process of echo cancellor can be as follows：

By taking audiomonitor is microphone as an example, it is assumed that the first synthetic speech signal of part of report is s (t), m-th of Mike The channel transfer function that wind receives is h_m(t), the voice signal to be translated that user newly inputs is x_m(t), then microphone receives Observation signal y_m(t), as shown in following equation：

y_m(t)=s (t) * h_m(t)+x_m(t)

As the voice signal x to be translated not inputted newly_m(t) when, channel transfer function h can be estimated in advance_m(t).When There is the voice signal x to be translated newly inputted_m(t) when, echo cancellor can be carried out to mixing voice signal.Due to y_m(t), s (t), h_m(t) it is known that so as to which current round voice signal to be translated is calculated by equation below, and it is used as target voice Signal, specific formula are as follows：

x′_m(t)=y (t)-s (t) * h_m(t)

After targeted voice signal is obtained, targeted voice signal can be translated and phonetic synthesis, to obtain working as front-wheel The second secondary synthetic speech signal, and report the second synthetic speech signal.

Content based on above-described embodiment, any one party in communication process can interrupt report process according to demand, from And the user to talk can interrupt the report process of oneself speech, the user for listening report can interrupt last round of secondary user's speech Report process.Accordingly, as a kind of alternative embodiment, the first synthetic speech signal is identical languages with targeted voice signal Type or the first synthetic speech signal are different languages types from targeted voice signal.

Wherein, the voice signal before the translation of the first synthetic speech signal is identical languages type with targeted voice signal When, then the report process of explanation last round of oneself speech that be probably the User break of last round of time speech, or can also be The user of current round speech is identical with the languages type used by a user of last round of speech, and the use of current round speech The report process of last round of secondary user's speech is interrupted in family.When the first synthetic speech signal from targeted voice signal is different languages During type, then it is probably the last round of secondary user's speech of User break for listening report that explanation, which is, report process, or may be used also The report process for last round of oneself speech that can be the User break of last round of time speech, and used different from last round of time Languages type speech.Based on the above, the embodiment of the present invention is applicable to the ditch of different language between double or more people Logical scene, such as double or multi-person conference.

Content based on above-described embodiment, it is necessary to obtain the second synthesis voice before the second synthetic speech signal is reported Signal.Accordingly, as a kind of alternative embodiment, an embodiment of the present invention provides a kind of method for obtaining synthetic speech signal. Referring to Fig. 2, this method includes：201st, the identification text data to being obtained after targeted voice signal progress speech recognition is obtained； 202nd, the target text data obtained after being translated to identification text data are obtained；203rd, voice is carried out to target text data Synthesis, obtains the second synthetic speech signal.

Content based on above-described embodiment, as a kind of alternative embodiment, the embodiment of the present invention additionally provides a kind of acquisition The method of target text data.Referring to Fig. 3, this method includes：2011st, the corresponding source languages type of identification text data is determined, And determine the corresponding target language type of source languages type according to default correspondence；2012nd, by target language type and identification Text data is inputted to translation coding/decoding model, exports target text data.

In above-mentioned steps 2011, the corresponding source languages type of identification text data need to be first determined.The embodiment of the present invention is not Mode to determining the corresponding source languages type of identification text data specifically limits, and includes but not limited to the following two kinds mode.

First way：Acoustic feature based on targeted voice signal determines.

Specifically, the acoustic feature of targeted voice signal, such as spectrum signature be can extract：Mel cepstrum coefficients (Mel- Frequency Cepstral Coefficients, MFCC), perceive linear predictor coefficient (Linear Predictive Coding, PLP) etc., acoustic feature is inputted to languages identification model, languages prediction is carried out to targeted voice signal.Languages are known The output result of other model is the probability that targeted voice signal is each languages type, and the languages of select probability maximum are as mesh The corresponding languages of poster sound signal, namely the corresponding source languages type of identification text data is determined.Wherein, languages identification model Common classification model generally in pattern-recognition, specifically can extract every voice letter by collecting a large amount of voice signals in advance Number acoustic feature, mark every voice signal languages type build to obtain.

The second way, the recognition result based on targeted voice signal determine.

Specifically, the corresponding speech recognition modeling of each languages being currently related to is utilized respectively to carry out targeted voice signal Speech recognition, obtains targeted voice signal and corresponds to the identification text data of each languages and corresponding recognition confidence, selection is known The identification text data of other confidence level maximum corresponds to languages of the languages as targeted voice signal.Wherein, speech recognition process one As be：End-point detection first is carried out to targeted voice signal, obtains the starting point and end point of efficient voice section.Then endpoint is examined The efficient voice section measured carries out feature extraction, recycles the characteristic of extraction and trained in advance acoustic model and language Model is decoded, and obtains the confidence level that current speech data corresponds to identification text and corresponding identification text.

For example, it is assumed that targeted voice signal corresponds to languages as Chinese；The languages being currently related to are Chinese and English.To target When voice signal carries out languages identification, Chinese speech recognition model and English speech recognition modeling are utilized respectively to target language message Number speech recognition is carried out, obtain the corresponding Chinese identification text data and corresponding recognition confidence 0.9 of targeted voice signal, English Identify text data and corresponding recognition confidence 0.2.The larger identification text data of selection recognition confidence corresponds to languages, i.e., in Text corresponds to languages as targeted voice signal.Further, the identification that each languages can also be corresponded to identification text data is put Reliability and language model scores are merged, and the languages corresponding to the maximum identification text data of selection fusion score are as target The corresponding languages of voice signal.Wherein, fusion method can be linear weighting method, and the embodiment of the present invention does not limit this specifically It is fixed.

After the corresponding source languages type of identification text data is determined, source languages type can be determined according to default correspondence Corresponding target language type.Wherein, languages type in source is languages used in spoken user, and target language type is generally Other languages types in the languages that active user's exchange is related in addition to the languages type of source.Wherein, target language type can be with It is corresponding respectively to carry out communication exchange using the user of two languages and using the user of multiple languages for one or more.

For example, the languages that are related to are Chinese, English, French, Thai, Hindi, German etc. when if active user exchanges, Chinese speech signal can be predefined and translate into English, French bilingual every time, that is, using English and French as mesh Poster type.

In addition to pre-setting fixed target language type, the preference or demand of spoken user are also based on To determine the corresponding target language type of source languages type.It is not right accordingly, as a kind of alternative embodiment, the embodiment of the present invention Specifically limited in the way of default correspondence determines the corresponding target language type of source languages type, including it is but unlimited In：Based on the vocal print feature in targeted voice signal, the corresponding default languages type of vocal print feature is determined, to be used as source languages class The corresponding target language type of type.

If for example, meeting in leader speech need to be automatically translated into English, target language when can leader be talked in advance Type is determined as English, and establishes correspondence between vocal print feature when talking with leader, so that in meeting is extracted After leading the vocal print feature of speech, it is English that can directly determine target language type.

In above-mentioned steps 2012, inputted by target language type and identification text data to translation coding/decoding model, When exporting target text data, the translation coding/decoding model based on neutral net can be used, identification text data is translated into Corresponding target language text data, the embodiment of the present invention are not especially limited this.Wherein, each languages type is corresponding turns over Translating coding/decoding model can build to obtain using a large amount of training datas in advance.

In addition to being translated above by translation coding/decoding model to identification text data, it is contemplated that above and below during speech Relevance between literary content, the embodiment of the present invention additionally provide a kind of method for obtaining target text data, including but unlimited In：If judgement knows that targeted voice signal and the information that the first synthetic speech signal transmits respectively are interrelated, based on current The voice signal and/or translation result of round before round, translate identification text data, to obtain target text number According to.

Specifically, it can determine whether that the targeted voice signal of current round passes respectively with first synthetic speech signal of last round of time Whether the information passed is interrelated, if both related (such as during report, the User break of speech, which is reported, to be required supplementation with), then After the targeted voice signal of current round can be merged with last round of translated voice signal in translation again Translation, finally reports out by translation result again；The translation result of two-wheeled voice signal can also be merged upon translation Afterwards, then report out；It can be combined with last round of translated voice signal, the translation result of last round of time, target language The translation result of sound signal and targeted voice signal, reports out again after two-wheeled translation result is merged；It is in addition, above-mentioned During identification text data to current round is translated, except the voice signal of last round of time and/or translation are tied Fruit as reference outside, can also be by the voice signal of preceding n rounds and/or translation result as reference, the embodiment of the present invention pair This is not especially limited.Wherein, n is more than or equal to 1.

After target text data are got, phonetic synthesis can be carried out to target text data, obtain the second synthesis voice Signal.Accordingly, as a kind of alternative embodiment, on carrying out phonetic synthesis to target text data, the second synthesis language is obtained The mode of sound signal, the embodiment of the present invention are not especially limited this, include but not limited to：Voice broadcast parameter is obtained, by mesh Mark text data and voice broadcast parameter are inputted to phonetic synthesis model, export the second synthetic speech signal, voice broadcast parameter In include at least report the second synthetic speech signal when used tamber parameter.

Wherein, when carrying out phonetic synthesis, a fixed speaker model can be selected to be synthesized, can such as uses one A sound is neutral, the synthetic model of simple and honest sound synthesizes corresponding second synthetic speech signal.It is, of course, also possible to select individual character The speaker model of change is synthesized.Specifically, the sound of a variety of different tone colors can be included in speech translation system, user can be with Oneself selection, can also be made choice, the embodiment of the present invention does not make this to have by system according to the user information of active user Body limits.Wherein, user information includes but not limited to the gender of user, age, tone color etc..For example, if the user for listening report is Male, system can automatically select female speakers model, to synthesize the second synthetic speech signal of women sounding.

It is, of course, also possible to changed using sound, by the sound of synthesis be converted into user's tone color similar in sound broadcast Report.For example, after being translated to the voice signal to be translated that user A is inputted, changed using sound so that transformed language The tone color of sound signal is close with user's A tone colors, and transformed voice signal is reported out.

Further, since voice signal and/or translation result based on round before current round, to identification text data into Row translation, the relevance of context during so as to make full use of speech, improves translation accuracy rate.

Finally, due to which languages type corresponding target language type in source can be determined according to default correspondence, can such as be based on Vocal print feature in targeted voice signal, determines target language type, so as to meet preference and demand of the user in translation, Realize personalized customization.

It should be noted that above-mentioned all alternative embodiments, can use any combination to form the optional implementation of the present invention Example, this is no longer going to repeat them.

Content based on above-described embodiment, an embodiment of the present invention provides a kind of voiced translation processing unit, which turns over Processing unit is translated to be used to perform the voiced translation processing method in above method embodiment.Referring to Fig. 4, which includes：

Stop broadcasting module 401, for during being reported to the first synthetic speech signal, being included if receiving There is the mixing voice signal of the first synthetic speech signal of part, then stop reporting the first synthetic speech signal, the first synthesis voice Signal is via obtained after last round of translation and phonetic synthesis；

Filtering module 402, for filtering out the first synthetic speech signal of part from mixing voice signal, obtains current round Voice signal to be translated, and it is used as targeted voice signal；

Broadcasting module 403, for based on targeted voice signal, obtaining the second synthetic speech signal, and report the second synthesis Voice signal, the second synthetic speech signal be targeted voice signal is translated and phonetic synthesis after it is obtained.

As a kind of alternative embodiment, the first synthetic speech signal and targeted voice signal for identical languages type or First synthetic speech signal is different languages types from targeted voice signal.

As a kind of alternative embodiment, which further includes：

First acquisition module, for obtaining the identification text data to being obtained after targeted voice signal progress speech recognition；

Second acquisition module, for obtaining the target text data obtained to identifying text data after translating；

Voice synthetic module, for carrying out phonetic synthesis to target text data, obtains the second synthetic speech signal.

As a kind of alternative embodiment, the second acquisition module, including：

Determination unit, for determining the corresponding source languages type of identification text data, and determines according to default correspondence The corresponding target language type of source languages type；

Translation unit, for inputting target language type and identification text data to translation coding/decoding model, exports mesh Mark text data.

As a kind of alternative embodiment, determination unit, for based on the vocal print feature in targeted voice signal, determining vocal print The corresponding default languages type of feature, to be used as the corresponding target language type of source languages type.

As a kind of alternative embodiment, the second acquisition module, for knowing targeted voice signal and the first synthesis when judgement When the information that voice signal transmits respectively is interrelated, then voice signal and/or translation knot based on round before current round Fruit, translates identification text data, to obtain target text data.

As a kind of alternative embodiment, voice synthetic module, for obtaining voice broadcast parameter, by target text data and Voice broadcast parameter is inputted to phonetic synthesis model, is exported the second synthetic speech signal, is included at least and broadcast in voice broadcast parameter Used tamber parameter when reporting the second synthetic speech signal.

Device provided in an embodiment of the present invention, by during being reported to the first synthetic speech signal, if connecing The mixing voice signal for including the first synthetic speech signal of part is received, then stops reporting the first synthetic speech signal.From mixed Close filtered speech signal and fall the first synthetic speech signal of part, obtain current round voice signal to be translated, and be used as target Voice signal.Based on targeted voice signal, the second synthetic speech signal is obtained, and report the second synthetic speech signal.Due to ditch Any one party during logical, can interrupt report process at any time according to full-duplex mode, and without waiting until a carousel every time Report process terminates so that while communication efficiency is improved, also may be such that link up between different language user it is more natural and tripping.

An embodiment of the present invention provides a kind of voiced translation processing equipment.Referring to Fig. 5, which includes：Processor (processor) 501, memory (memory) 502 and bus 503；

Wherein, processor 501 and memory 502 complete mutual communication by bus 503 respectively；

Processor 501 is used to call the programmed instruction in memory 502, and the voice provided with performing above-described embodiment turns over Translate processing method, such as including：During being reported to the first synthetic speech signal, include part if receiving The mixing voice signal of one synthetic speech signal, then stop reporting the first synthetic speech signal, the first synthetic speech signal is warp By obtained after last round of translation and phonetic synthesis；Part first, which is filtered out, from mixing voice signal synthesizes voice letter Number, current round voice signal to be translated is obtained, and be used as targeted voice signal；Based on targeted voice signal, second is obtained Synthetic speech signal, and the second synthetic speech signal is reported, the second synthetic speech signal is that targeted voice signal is translated It is and obtained after phonetic synthesis.

The embodiment of the present invention provides a kind of non-transient computer readable storage medium storing program for executing, the non-transient computer readable storage medium Matter stores computer instruction, which makes computer perform the voiced translation processing method that above-described embodiment is provided, Such as including：During being reported to the first synthetic speech signal, voice is synthesized if receiving and including part first The mixing voice signal of signal, then stop reporting the first synthetic speech signal, the first synthetic speech signal is via last round of time It is obtained after translation and phonetic synthesis；The first synthetic speech signal of part is filtered out from mixing voice signal, is obtained current Round voice signal to be translated, and it is used as targeted voice signal；Based on targeted voice signal, the second synthesis voice letter is obtained Number, and report the second synthetic speech signal, the second synthetic speech signal be targeted voice signal is translated and voice close Into rear obtained.

One of ordinary skill in the art will appreciate that：Realizing all or part of step of above method embodiment can pass through The relevant hardware of programmed instruction is completed, and foregoing program can be stored in a computer read/write memory medium, the program Upon execution, the step of execution includes above method embodiment；And foregoing storage medium includes：ROM, RAM, magnetic disc or light Disk etc. is various can be with the medium of store program codes.

The embodiments such as information interaction device described above are only schematical, wherein illustrate as separating component Unit may or may not be physically separate, and thing is may or may not be as the component that unit is shown Manage unit, you can with positioned at a place, or can also be distributed in multiple network unit.It can select according to the actual needs Some or all of module therein is selected to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying wound In the case of the work for the property made, you can to understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can Realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on such understanding, on The part that technical solution substantially in other words contributes to the prior art is stated to embody in the form of software product, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including some fingers Order is used so that a computer equipment (can be personal computer, server, or network equipment etc.) performs each implementation Some Part Methods of example or embodiment.

Finally, the present processes are only preferable embodiment, are not intended to limit the protection model of the embodiment of the present invention Enclose.With within principle, any modification, equivalent replacement, improvement and so on, should be included in all spirit in the embodiment of the present invention Within the protection domain of the embodiment of the present invention.

Claims

A kind of 1. voiced translation processing method, it is characterised in that including：

During being reported to the first synthetic speech signal, include part the first synthesis voice letter if receiving Number mixing voice signal, then stop reporting first synthetic speech signal, first synthetic speech signal is via upper It is obtained after the translation of one round and phonetic synthesis；

Part first synthetic speech signal is filtered out from the mixing voice signal, obtains current round voice to be translated Signal, and it is used as targeted voice signal；

Based on the targeted voice signal, the second synthetic speech signal is obtained, and reports second synthetic speech signal, it is described Second synthetic speech signal be the targeted voice signal is translated and phonetic synthesis after it is obtained.
2. according to the method described in claim 1, it is characterized in that, first synthetic speech signal and the target language message From the targeted voice signal it is different languages types number for identical languages type or first synthetic speech signal.
3. according to the method described in claim 1, it is characterized in that, before report second synthetic speech signal, go back Including：

The identification text data to being obtained after targeted voice signal progress speech recognition is obtained, is obtained to the identification text The target text data that data obtain after being translated, and phonetic synthesis is carried out to the target text data, obtain described the Two synthetic speech signals.
4. according to the method described in claim 3, it is characterized in that, it is described obtain to it is described identification text data translate after Obtained target text data, including：

Determine the corresponding source languages type of the identification text data, and the source languages type is determined according to default correspondence Corresponding target language type；

The target language type and the identification text data are inputted to translation coding/decoding model, export the target text Data.
5. according to the method described in claim 4, it is characterized in that, described determine the source languages class according to default correspondence The corresponding target language type of type, including：

Based on the vocal print feature in the targeted voice signal, the corresponding default languages type of the vocal print feature is determined, to make For the corresponding target language type of the source languages type.
6. according to the method described in claim 3, it is characterized in that, it is described obtain to it is described identification text data translate after Obtained target text data, including：

If judgement knows that the targeted voice signal and the information that first synthetic speech signal transmits respectively are interrelated, Based on the voice signal and/or translation result of round before current round, the identification text data is translated, to obtain The target text data.
7. according to the method described in claim 3, it is characterized in that, it is described to the target text data carry out phonetic synthesis, Second synthetic speech signal is obtained, including：

Voice broadcast parameter is obtained, the target text data and the voice broadcast parameter are inputted to phonetic synthesis model, Second synthetic speech signal is exported, when report second synthetic speech signal is included at least in the voice broadcast parameter Used tamber parameter.
A kind of 8. voiced translation processing unit, it is characterised in that including：

Stop broadcasting module, for during being reported to the first synthetic speech signal, including part if receiving The mixing voice signal of first synthetic speech signal, then stop reporting first synthetic speech signal, described first closes It is via obtained after last round of translation and phonetic synthesis into voice signal；

Filtering module, for filtering out part first synthetic speech signal from the mixing voice signal, obtains working as front-wheel Secondary voice signal to be translated, and it is used as targeted voice signal；

Broadcasting module, for based on the targeted voice signal, obtaining the second synthetic speech signal, and report second synthesis Voice signal, second synthetic speech signal be the targeted voice signal is translated and phonetic synthesis after obtained by 's.
A kind of 9. voiced translation processing equipment, it is characterised in that including：

At least one processor；And

At least one processor being connected with the processor communication, wherein：

The memory storage has the programmed instruction that can be performed by the processor, and the processor calls described program instruction energy Enough perform the method as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium storing program for executing, it is characterised in that the non-transient computer readable storage medium storing program for executing is deposited Computer instruction is stored up, the computer instruction makes the computer perform the method as described in claim 1 to 7 is any.