CN109979461A

CN109979461A - A kind of voice translation method and device

Info

Publication number: CN109979461A
Application number: CN201910199082.XA
Authority: CN
Inventors: 马志强; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2019-07-05
Anticipated expiration: 2039-03-15
Also published as: CN109979461B

Abstract

This application discloses a kind of voice translation method and devices, this method comprises: after getting target voice to be translated, both the identification text of target voice and target voice itself are translated collectively as translation object, to obtain the final cypher text of target voice, first target voice is identified in compared with the prior art, obtain identification text, the mode that the identification text is translated as translation object again, translation object more horn of plenty in the application, it include identification two translation objects of text and target voice, so, by way of being translated to the two translation objects, it can determine the cypher text of more accurate target voice.

Description

A kind of voice translation method and device

Technical field

This application involves voiced translation technical field more particularly to a kind of voice translation methods and device.

Background technique

Existing voice translation method generally includes two steps, that is, speech recognition and text translation.Specifically, firstly, One section of voice is passed through into speech recognition technology, is identified as the text of same languages therewith, it then, should using text translation technology Identification text translates into the text of another languages, to realize voiced translation process.

But the shortcomings that joint speech recognition technology and text translation technology carry out voiced translation, and there are error accumulations, example Such as, it is assumed that using speech recognition technology by some word recognition errors, and work as and the word is carried out using text translation technology When translation, the translation result of mistake will be obtained according to the word of the mistake.As it can be seen that the mistake of speech recognition period can accumulate text This translating phase, so as to cause the inaccuracy of translation result.

Summary of the invention

The main purpose of the embodiment of the present application is to provide a kind of voice translation method and device, is able to ascend voiced translation As a result accuracy.

The embodiment of the present application provides a kind of voice translation method, comprising:

Obtain target voice to be translated；

First translation object and the second translation object are translated, the final cypher text of the target voice is obtained, The first translation object is the identification text of the target voice, and the second translation object is the target voice.

Optionally, described that first translation object and the second translation object are translated, obtain the target voice most Whole cypher text, comprising:

Generate corresponding first probability distribution of k-th of word and the second probability distribution in the final cypher text；

Wherein, first probability distribution includes the kth obtained after identification text to the target voice is decoded The first decoding probability when a word is each word to be selected in vocabulary, second probability distribution include carrying out to the target voice The second decoding probability when k-th of the word obtained after decoding is each word to be selected in vocabulary；

According to first probability distribution and second probability distribution, the translation result of k-th of word is obtained.

Optionally, described according to first probability distribution and second probability distribution, obtain the translation knot of k-th of word Fruit, comprising:

In first probability distribution and second probability distribution, by corresponding first decoding probability of identical word to be selected It is merged with the second decoding probability, obtains corresponding fusion when k-th of word is each word to be selected in the final cypher text Decoding probability；

Select the corresponding word to be selected of maximum fusing and decoding probability, the translation result as k-th of word.

The identification text of the target voice is translated, the first cypher text is obtained；

The target voice is directly translated, the second cypher text is obtained；

According to first cypher text and second cypher text, the final translation text of the target voice is obtained This.

Optionally, described according to first cypher text and second cypher text, obtain the target voice Final cypher text, comprising:

Determine confidence level when final cypher text of first cypher text as the target voice；

Determine confidence level when final cypher text of second cypher text as the target voice；

The corresponding cypher text of larger confidence level is selected, the final cypher text as the target voice.

Optionally, confidence when final cypher text of the determination first cypher text as the target voice Degree, comprising:

Obtain the corresponding decoding probability of each text unit of first cypher text, the decoding probability characterization Correspond to size a possibility that text unit belongs to translation result；

According to the corresponding decoding probability of each text unit of first cypher text, first translation is determined Confidence level when final cypher text of the text as the target voice.

Optionally, confidence when final cypher text of the determination second cypher text as the target voice Degree, comprising:

Obtain the corresponding decoding probability of each text unit of second cypher text, the decoding probability characterization Correspond to size a possibility that text unit belongs to translation result；

According to the corresponding decoding probability of each text unit of second cypher text, second translation is determined Confidence level when final cypher text of the text as the target voice.

It is optionally, described that first translation object and the second translation object are translated, comprising:

Using the speech recognition modeling constructed in advance, the target voice is identified, obtains identification text；

Using the text translation model constructed in advance, the identification text is translated；

Using the voiced translation model constructed in advance, the target voice is translated；

Wherein, department pattern parameter is shared or do not shared to the voiced translation model and the speech recognition modeling.

The embodiment of the present application also provides a kind of speech translation apparatus, comprising:

Target voice acquiring unit, for obtaining target voice to be translated；

Cypher text obtaining unit obtains the mesh for translating to the first translation object and the second translation object The final cypher text of poster sound, the first translation object are the identification text of the target voice, second translation pair As for the target voice.

Optionally, the cypher text obtaining unit includes:

Probability distribution generates subelement, for generating corresponding first probability point of k-th of word in the final cypher text Cloth and the second probability distribution；

Translation result obtains subelement, for obtaining kth according to first probability distribution and second probability distribution The translation result of a word.

Optionally, the translation result acquisition subelement includes:

Fusing and decoding probability obtains subelement, is used in first probability distribution and second probability distribution, will Corresponding first decoding probability of identical word to be selected and the second decoding probability are merged, and kth in the final cypher text is obtained A word corresponding fusing and decoding probability when being each word to be selected；

First translation result obtains subelement, for selecting the corresponding word to be selected of maximum fusing and decoding probability, as k-th The translation result of word.

Optionally, the cypher text obtaining unit includes:

First cypher text obtains subelement, translates for the identification text to the target voice, obtains first Cypher text；

Second cypher text obtains subelement, for directly being translated to the target voice, obtains the second translation text This；

Final cypher text obtains subelement, for obtaining according to first cypher text and second cypher text To the final cypher text of the target voice.

Optionally, the final cypher text acquisition subelement includes:

First confidence level determines subelement, for determining first cypher text finally turning over as the target voice The confidence level of this when of translation；

Second confidence level determines subelement, for determining second cypher text finally turning over as the target voice The confidence level of this when of translation；

Second translation result obtains subelement, for selecting the corresponding cypher text of larger confidence level, as the target The final cypher text of voice.

Optionally, first confidence level determines that subelement includes:

First decoding probability obtains subelement, and each text unit for obtaining first cypher text respectively corresponds to Decoding probability, the decoding probability characterizes size a possibility that corresponding text unit belongs to translation result；

First confidence level obtains subelement, corresponding for each text unit according to first cypher text Decoding probability determines confidence level when final cypher text of first cypher text as the target voice.

Optionally, second confidence level determines that subelement includes:

Second decoding probability obtains subelement, and each text unit for obtaining second cypher text respectively corresponds to Decoding probability, the decoding probability characterizes size a possibility that corresponding text unit belongs to translation result；

Second confidence level obtains subelement, corresponding for each text unit according to second cypher text Decoding probability determines confidence level when final cypher text of second cypher text as the target voice.

Optionally, the cypher text obtaining unit includes:

Text identification subelement, for being identified to the target voice using the speech recognition modeling constructed in advance, Obtain identification text；

Text translates subelement, for being translated to the identification text using the text translation model constructed in advance；

Voiced translation subelement, for being translated to the target voice using the voiced translation model constructed in advance；

The embodiment of the present application also provides a kind of speech translation apparatus, comprising: processor, memory, system bus；

The processor and the memory are connected by the system bus；

The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any one implementation in above-mentioned voice translation method when being executed by the processor.

The embodiment of the present application also provides a kind of computer readable storage medium, deposited in the computer readable storage medium Instruction is contained, when described instruction is run on the terminal device, so that the terminal device executes in above-mentioned voice translation method Any one implementation.

The embodiment of the present application also provides a kind of computer program product, the computer program product is on the terminal device When operation, so that the terminal device executes any one implementation in above-mentioned voice translation method.

A kind of voice translation method and device provided by the embodiments of the present application, after getting target voice to be translated, Both the identification text of target voice and target voice itself are translated collectively as translation object, to obtain target language The final cypher text of sound, compared with the prior art in first target voice is identified, obtain identification text, then by the identification Text is as the mode translated of translation object, translation object more horn of plenty in the application includes identification text and mesh Two translation objects of poster sound, so, by way of being translated to the two translation objects, it can determine more accurately The cypher text of target voice.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow diagram of voice translation method provided by the embodiments of the present application；

Fig. 2 is one of voiced translation model provided by the embodiments of the present application and structural schematic diagram of speech recognition modeling；

Fig. 3 is the second structural representation of voiced translation model and speech recognition modeling provided by the embodiments of the present application；

Fig. 4 is the knot of voiced translation model provided by the embodiments of the present application, speech recognition modeling and text translation model Structure schematic diagram；

Fig. 5 obtains turning over for k-th of word according to the first probability distribution and the second probability distribution to be provided by the embodiments of the present application Translate the flow diagram of result；

Fig. 6 obtains target voice most according to the first cypher text and the second cypher text to be provided by the embodiments of the present application The flow diagram of whole cypher text；

Fig. 7 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall in the protection scope of this application.

First embodiment

It is a kind of flow diagram of voice translation method provided in this embodiment, this method includes following step referring to Fig. 1 It is rapid:

S101: target voice to be translated is obtained.

In the present embodiment, any voice for carrying out voiced translation using the present embodiment is defined as target voice.Also, The present embodiment does not limit the languages type of target voice, for example, target voice can be Chinese speech or English voice etc.；Together When, the present embodiment does not limit the length of target voice yet, for example, target voice can be a word or more words etc..

It is understood that target voice can be obtained by modes such as recording according to actual needs, for example, people day Often telephone relation voice in life or session recording etc. can be used as target voice, can be with after getting target voice The translation to the target voice is realized using the present embodiment.

S102: translating the first translation object and the second translation object, obtain the final cypher text of target voice, Wherein, the first translation object is the identification text of target voice, and the second translation object is target voice.

It in the present embodiment, can be by the knowledge of target voice in order to determine the cypher text of more accurate target voice Other text and target voice itself are translated as translation object, more acurrate so as to obtain since translation object is abundant Cypher text, to the final cypher text as target voice.

Specifically, after target voice to be translated being got by step S101, voice can be carried out to target voice Identification obtains corresponding identification text, and using the identification text as the first translation object, it is then possible to first translation Object (i.e. identification text) carries out text translation, to obtain the translation text of intermediate data or the first translation object in translation process This.Similarly, after target voice to be translated being got by step S101, the target voice itself can be used as to the second translation Then object is translated (without speech recognition) second translation object (i.e. target voice), directly to be translated The cypher text of intermediate data or the second translation object in the process.

In a kind of implementation of the present embodiment, this step S102 may include: to the first translation object and second Translation object is translated, and after obtaining the intermediate data in corresponding translation process, is based on these intermediate data, is obtained mesh The final cypher text of poster sound.

In this implementation, above-mentioned intermediate data can be probability distribution data, specifically, S102 through the above steps First translation object (i.e. the identification text of target voice) and the second translation object (i.e. target voice) are translated, Ke Yisheng At corresponding first probability distribution of k-th of word in final cypher text and the second probability distribution, wherein the first probability distribution includes The first decoding when k-th of the word obtained after being decoded to the identification text of target voice is each word to be selected in vocabulary is general Rate, the second probability distribution includes k-th of word obtaining after being decoded to target voice when being each word to be selected in vocabulary the Two decoding probabilities.It should be noted that translating in this implementation to the first translation object and the second translation object, obtains To final cypher text in the specific introduction of corresponding first probability distribution of k-th of word and the second probability distribution can be found in second Embodiment, and corresponding first probability distribution of k-th of word and on the basis of the second probability distribution in final cypher text, obtain The specific implementation of the translation result of k-th of word will also be introduced in a second embodiment in final cypher text.

In another implementation of the present embodiment, this step S102 may include: to the first translation object and the Two translation objects are translated, and after obtaining corresponding cypher text, are based on the two cypher texts, are obtained target voice Final cypher text.

In this implementation, through the above steps S102 to first translation object (i.e. the identification text of target voice) into Row translation, available first cypher text, meanwhile, the second translation object (i.e. target voice) is directly translated, it can also To obtain the second cypher text.It should be noted that in this implementation to first translation object and second translation object into The specific introduction of the first cypher text and the second cypher text that row translation obtains can be found in 3rd embodiment, and in the first translation On the basis of text and the second cypher text, the specific implementation for obtaining the final cypher text of target voice also will be in third It is introduced in embodiment.

Further, this step S102 can be realized using three models, can specifically include following step A1-A3:

Step A1: using the speech recognition modeling constructed in advance, identifying target voice, obtains identification text.

In this implementation, after target voice to be translated is got by step S101, it can use on the right side of Fig. 2 Shown in the speech recognition modeling that constructs in advance, the target voice got is identified, identification text is obtained.Wherein, should Speech recognition modeling includes encoder, attention layer (Attention) and identification decoder, can be with by the speech recognition modeling Speech recognition is carried out to target voice, for example, the target voice of Chinese to be identified as to the Chinese identification text of same languages therewith.

Specifically, in the present embodiment, a kind of to be optionally achieved in that, the speech recognition modeling constructed in advance can be with Using the network structure as shown on the right side of Fig. 3, next, will by taking the speech recognition modeling as an example, to using it to target voice The realization process identified is introduced:

(1) audio frequency characteristics of target voice are inputted

Audio feature extraction is carried out to the target voice translated first, for example, target voice can be extracted Meier spectrum signature (Mel Bank Features), as the audio frequency characteristics of target voice, the audio frequency characteristics can with feature to The form of amount is indicated, and here, this feature vector is defined as x_1...T, wherein the audio feature vector of T expression target voice Dimension size, i.e. the number of the audio feature vector vector element that includes, it is then possible to by x_1...TIt is defeated as input data Speech recognition modeling shown in entering on the right side of Fig. 3.

(2) the corresponding coding vector of audio frequency characteristics of target voice is generated

As shown in figure 3, the coded portion of the speech recognition modeling includes two layers of convolutional neural networks (Convolutional Neural Networks, abbreviation CNN) and maximum pond layer (MaxPooling), one layer of convolution shot and long term memory network (convolutional Long Short-Term Memory, abbreviation convolutional LSTM), three layers of two-way shot and long term Memory network (Bi-directional Long Short-Term Memory, abbreviation BiLSTM).

The audio frequency characteristics x of (1) input target voice through the above steps_1...TAfterwards, it can be compiled by one layer of CNN Then code, then carry out down-sampled operation to it by MaxPooling, then by one layer of CNN and MaxPooling repeats this behaviour Make, obtains the coding vector that length is L and then recycle one layer of convolutional LSTM and three layers of BiLSTM to this Coding vector is handled, and to obtain final coding vector, is defined as h_1...L, wherein L indicates special to the audio of target voice The number for the vector element that the dimension size for the coding vector that sign obtains after being encoded, i.e. coding vector include, h_1...LTool Body calculation formula is as follows:

h_1...L=enc (W_encx_1...T) (1)

Wherein, enc indicates the entire coding calculating process of model based coding part；W_encIndicate each layer in model based coding part The all-network parameter of network.

(3) the corresponding decoded vector of coding vector is generated

As shown in figure 3, the decoded portion of the speech recognition modeling includes 4 layers of unidirectional shot and long term memory network (Long Short-Term Memory, abbreviation LSTM), softmax classifier.

(2) through the above steps encode the audio frequency characteristics of target voice using the coded portion of model and are compiled After code vector, attention operation first can be carried out to the coding vector, it can be to generate to be concerned about in coding vector Then the related data of decoded vector are again decoded it by 4 layers of LSTM and softmax classifier, corresponding to obtain Decoded vector recycles the decoded vector to generate the identification text of target voice, and is defined as z_1...N, wherein N can be with table Show the number for the individual character (or word) for including in identification text.

The specific formula for calculation of decoded portion is as follows:

c_k=att (s_k,h_1...L) (2)

s_k=lstm (z_k-1,s_k-1,c_k-1) (3)

z_k=soft max (W_z[s_k,c_k]+b_z) (4)

Wherein, h_1...LIndicate the corresponding coding vector of audio frequency characteristics of target voice；c_kIndicate k-th of attention meter Calculate result；Att indicates attention calculating process；c_k-1Indicate -1 attention calculated result of kth；s_kIndicate lsb decoder K-th of the hidden layer vector exported in 4 layers of LSTM network that subpackage contains；Lstm indicates 4 layers of LSTM network that decoded portion includes Calculating process；s_k-1- 1 hidden layer vector of kth exported in 4 layers of LSTM network that expression decoded portion includes；z_kIndicate identification text K-th of the word (or word) for including in this；z_k-1Indicate -1 word of kth (or word) for including in identification text；W_zAnd b_zIt indicates Model parameter in softmax classifier.

If utilizing W_asrThe all-network parameter of each layer network in representative model decoded portion, the then target language that model exports The identification text z of sound_1...NCalculation formula it is as follows:

z_1...N=dec (W_asrh_1...L) (5)

Wherein, dec indicates the entire solution Calculative Process of model decoded portion；W_asrIndicate each layer in model decoded portion The all-network parameter of network；h_1...LIndicate the corresponding coding vector of audio frequency characteristics of target voice.

It should be noted that the network structure of encoder and decoder is not in speech recognition modeling shown on the right side of Fig. 2 It is unique, and network structure shown in the right side Fig. 3 is only one such example, can also take other network structures or net Network layers number.For example, the encoder of model can also be using Recognition with Recurrent Neural Network (Recurrent Neural Network, abbreviation ) etc. RNN encoded, and the number of plies of network can also be set according to the actual situation, the embodiment of the present application to this without Limitation.Wherein, the number of plies of CNN, BiLSTM introduced in above-mentioned or subsequent content etc. is only example, and the application does not limit its layer Number, can be the number of plies referred in the embodiment of the present application, is also possible to other numbers of plies.

Step A2: using the text translation model constructed in advance, the identification text of target voice is translated.

In this implementation, after the identification text that target voice is obtained by step A1, it can use such as institute above Fig. 4 The text translation model constructed in advance shown translates identification text, obtains the intermediate data or the knowledge in translation process The corresponding cypher text of other text.

Wherein, text translation model includes text decoder, attention layer (Attention) and text decoder, and Text decoder is connected with the identification decoder in speech recognition modeling, as shown in Figure 4.

Next, the realization process translated using text translation model to identification text is introduced:

(1) the identification text of target voice is inputted

As shown in figure 4, the identification text z of target voice can will be obtained by step A1 first_1...N(it can be vector shape Formula), as input data, it is input in the text decoder of text translation model.

(2) the corresponding coding vector of identification text of target voice is generated

In the present embodiment, the text decoder of text translation model can be made of BiLSTM, through the above steps (1) Input the identification text z of target voice_1...NAfterwards, it can be encoded by BiLSTM, to obtain corresponding coding vector, It is defined as s_1...N, specific formula for calculation is as follows:

s_1...N=enc (U_encz_1...N) (6)

Wherein, enc indicates the entire coding calculating process of text translation model coded portion；U_encIndicate that text translates mould The all-network parameter of type coded portion.

(3) decoding obtains intermediate data or corresponding first cypher text of the identification text

In the present embodiment, the text decoder of text translation model may include unidirectional LSTM and softmax classifier. (2) generate the coding vector s of target voice through the above steps_1...NAfterwards, attention fortune first can be carried out to the coding vector Calculate, so as to be concerned about in coding vector can to generate the related data of decoding result, then again by unidirectional LSTM and Softmax classifier is decoded it, is turned over obtaining intermediate data in translation process or the identification text corresponding first Translation sheet.

It should be noted that in text translation model the network constituted mode of encoder and decoder be not it is unique, The prototype network structure introduced during above-mentioned realization is only one such example, can also take other network structures or net Network layers number.For example, the encoder of model can also be encoded using RNN etc., and the number of plies of network can also be according to practical feelings Condition is set, and the embodiment of the present application is not limited this.

Step A3: using the voiced translation model constructed in advance, target voice is translated.

In the present embodiment, which generates in translation process for directly being translated to target voice Intermediate data or the corresponding cypher text of the target voice.The voiced translation model can be with the language introduced in above-mentioned steps A1 Sound identification model is shared or does not share department pattern parameter.

When the speech recognition modeling introduced in the voiced translation model and above-mentioned steps A1 shares department pattern parameter, one Kind is optionally achieved in that the network structure of the voiced translation model can be as shown in Fig. 2 left hand view, with speech recognition mould Type shares an encoder, and the voiced translation model then includes one and is translated and decoded device.It should be noted that language in Fig. 2 The identification decoder of sound identification model and the network structure for being translated and decoded device of voiced translation model can be identical, can also not Together, respective concrete composition structure can be set according to the actual situation, and the embodiment of the present application is not limited this.

In the present embodiment, a kind of to be optionally achieved in that, voiced translation model and speech recognition modeling can use Network structure as shown in Figure 3, and network structure based on this model, the detailed process directly translated to target voice is such as Under:

(1) audio frequency characteristics of target voice are inputted

Audio feature extraction is carried out to target voice first, for example, the Meier spectrum signature of target voice can be extracted, is made For the audio frequency characteristics of target voice, and this feature vector is defined as x_1...T, then, by x_1...TAs input data, it is input to Encoder shown in Fig. 3.

Pass through the audio frequency characteristics x for the target voice that encoder shown in Fig. 3 inputs above-mentioned steps (1)_1...TIt is encoded Afterwards, available final coding vector h_1...L, wherein what L expression obtained after encoding to the audio frequency characteristics of target voice The number for the vector element that the dimension size of coding vector, i.e. coding vector include, h_1...LSpecific formula for calculation be above-mentioned public affairs Formula (1), i.e., as follows:

h_1...L=enc (W_encx_1...T)

Wherein, enc indicates the entire coding calculating process of encoder in Fig. 3；W_encEach layer net in encoder in expression Fig. 3 The all-network parameter of network.

(3) decoding obtains intermediate data or corresponding second cypher text of the target voice

As shown in Figure 3, it is assumed that the identification decoder of voiced translation model and the net for being translated and decoded device of voiced translation model Network structure is identical, includes 4 layers of LSTM, softmax classifier, but the training parameter of the two is not shared.

(2) through the above steps encode the audio frequency characteristics of target voice using the coded portion of model and are compiled Code vector h_1...LAfterwards, as shown in figure 3, attention operation first can be carried out to the coding vector respectively, then pass through translation again 4 layers of LSTM and softmax classifier in decoder are decoded attention operation result, obtain in translation process Intermediate data or corresponding second cypher text of the target voice.

It should be noted that voiced translation model shown in Fig. 2 and speech recognition modeling share the mode of coder parameters It is not uniquely, to be only a kind of example, other parameters shared model can also be taken.

In addition, voiced translation model and speech recognition modeling can not also sharing model parameters, at this point, voiced translation model It is two individual models, in this case, the network of voiced translation model and speech recognition modeling with speech recognition modeling Structure may be the same or different, and the respective concrete composition structure of the two can be set according to the actual situation, the application Embodiment is not limited this.

It should be noted that the present embodiment do not limit A1-A2 (executing A2 after first carrying out A1) and A3 execute sequence, can be with A3 is executed after first carrying out A1-A2 or is executed A1-A2 after first carrying out A3 or is performed simultaneously A1-A2 and A3.

Further, since " integration module " shown in Fig. 4 is translated and decoded device and text with speech recognition modeling respectively The text decoder of translation model is connected, so, it can use " integration module " as shown in Figure 4, translated according to first The intermediate data or respective cypher text in translation process that object and the second translation object are translated, determine target The final cypher text of voice.

Specifically, in a kind of implementation of the present embodiment, the final translation that text decoder can be exported is literary Corresponding first probability distribution of k-th of word and k-th of word in the final cypher text that device exports corresponding the is translated and decoded in this Two probability distribution are separately input into " integration module ", are carried out by " integration module " to the first probability distribution and the second probability distribution Fusion specifically refers to the to determine the translation result of k-th of word in final cypher text according to fused probability distribution Two embodiments.

The first cypher text that can be exported text decoder in another implementation of the present embodiment and translation Decoder output the second cypher text be separately input into " integration module ", by " integration module " by both cypher texts into Row comparison, to determine the cypher text of more accurate target voice according to comparing result, specifically refers to 3rd embodiment.

To sum up, a kind of voice translation method provided in this embodiment, after getting target voice to be translated, by target Both the identification text of voice and target voice itself are translated collectively as translation object, to obtain target voice most Whole cypher text, compared with the prior art in first target voice is identified, obtain identification text, then by the identification text make For the mode translated of translation object, translation object more horn of plenty in the application includes identification text and target voice Two translation objects, so, by way of translating to the two translation objects, it can determine more accurate target language The cypher text of sound.

Second embodiment

In the present embodiment, by the step S102 in above-mentioned first embodiment to the first translation object (i.e. target voice Identification text) translated, corresponding first probability point of k-th of word in the final cypher text of target voice can be generated Cloth, and first probability distribution can be defined as P_text(y_k), wherein y_kIt refers in the final cypher text of target voice K-th of word.

Wherein, the first probability distribution P_text(y_k) it may include being obtained after the identification text to target voice is decoded K-th of word y_kFor the first decoding probability in vocabulary when each word to be selected.The value of first decoding probability is bigger, shows that this is right K-th of word y that the identification text of target voice obtains after being decoded_kProbability for correspondence word to be selected is bigger.

Network structure as shown in connection with fig. 4, the output of text translation model are decoded the identification text of target voice K-th of the word y obtained afterwards_kCorresponding first probability distribution P_text(y_k) calculation formula it is as follows:

P_text(y_k)=soft max (dec (U_decs_1...N)) (7)

Wherein, dec indicates the entire solution Calculative Process of text translation model decoded portion；U_decIndicate that text translates mould The all-network parameter of type decoded portion；s_1...NIndicate the corresponding coding vector of identification text of target voice；P_text(y_k) table Show k-th of the word y obtained after the identification text to target voice is decoded_kFor the first decoding in vocabulary when each word to be selected Probability.

Similarly, in the present embodiment, by step S102 in above-mentioned first embodiment to the second translation object (i.e. target language Sound) it is translated, it can be generated corresponding second probability distribution of k-th of word in final cypher text, and can be second general by this Rate distribution is defined as P_trans(y_k), wherein y_kRefer to k-th of word in the final cypher text of target voice.

Wherein, the second probability distribution P_trans(y_k) it may include k-th of the word y obtained after being decoded to target voice_k For the second decoding probability in vocabulary when each word to be selected.The value of second decoding probability is bigger, show this to target voice into K-th of the word y obtained after row decoding_kProbability for correspondence word to be selected is bigger.

Network structure as shown in connection with fig. 4, obtained kth after being decoded to target voice of voiced translation model output A word y_kCorresponding second probability distribution P_trans(y_k) calculation formula it is as follows:

P_trans(y_k)=soft max (dec (W_dec,h_1...L)) (8)

Wherein, dec indicates the entire solution Calculative Process of voiced translation model decoded portion；W_decIndicate voiced translation mould The all-network parameter of type decoded portion；h_1...LIndicate the corresponding coding vector of audio frequency characteristics of target voice；P_trans(y_k) table Show k-th of the word y obtained after being decoded to target voice_kFor the second decoding probability in vocabulary when each word to be selected.

It, may further be according to corresponding first probability distribution of k-th of word in the final cypher text of generation based on this P_text(y_k) and the second probability distribution P_trans(y_k), obtain the translation result of k-th of word.

Next, how the present embodiment is will be to " according to corresponding first probability of k-th of word in the final cypher text of generation It is distributed P_text(y_k) and the second probability distribution P_trans(y_k), obtain the translation result of k-th of word " specific implementation process be situated between It continues.

Referring to Fig. 5, obtained k-th it illustrates provided in this embodiment according to the first probability distribution and the second probability distribution The flow diagram of the translation result of word, the process the following steps are included:

S501: in the first probability distribution and the second probability distribution, by corresponding first decoding probability of identical word to be selected and Second decoding probability is merged, and it is general to obtain corresponding fusing and decoding when k-th of word is each word to be selected in final cypher text Rate.

In the present embodiment, if being generated after being translated to the first translation object (i.e. the identification text of target voice) K-th of word y in final cypher text_kCorresponding first probability distribution P_text(y_k), that is, P_text(y_k) it include to target voice K-th of word y that identification text obtains after being decoded_kFor the first decoding probability in vocabulary when each word to be selected；And to second After translation object (i.e. target voice) is translated, k-th of word y in final cypher text is generated_kCorresponding second probability point Cloth P_trans(y_k), that is, P_trans(y_k) it include k-th of the word y obtained after being decoded to target voice_kFor in vocabulary it is each to Select the second decoding probability when word.

It is then further, it can use " integration module " as shown in Figure 4 for the first probability distribution P_text(y_k) and it is second general Rate is distributed P_trans(y_k) carry out " decoding probability fusion ", that is, by P_text(y_k) and P_trans(y_k) in the corresponding decoding of identical word to be selected Probability is merged, to obtain in final cypher text corresponding fusing and decoding probability when k-th of word is each selected ci poem word to be selected, These fusing and decoding probability form fused probability distribution, are defined as P_ensemble(y_k)。

For example: assuming that each word to be selected is 10000 words for including in a certain English vocabulary, and this 10000 It is contained in word word " system ", and the first probability distribution P_text(y_k) in include the identification text to target voice into K-th of the word y obtained after row decoding_kThe first decoding probability value when for word " system " is 0.87, and the second probability distribution P_trans(y_k) in include target voice is decoded after obtained k-th of word y_kIt is general for the second decoding of word " system " Rate value is 0.76, then can merge the two corresponding decoding probability values 0.87 and 0.76 of word " system ", with To the corresponding fusing and decoding probability of word " system ", to characterize k-th of word y in final cypher text_kFor word A possibility that " system " size.

In the present embodiment, k-th of word y in final cypher text_kThe specific calculating of corresponding fused probability distribution Formula is as follows:

P_ensemble(y_k)=α P_trans(y_k)+(1-α)P_text(y_k) (9)

Wherein, α indicates the fusion weight of decoding probability, can be obtained by experiment or experience；P_ensemble(y_k) indicate final K-th of word y in cypher text_kFor the fusing and decoding probability in vocabulary when each word to be selected；P_text(y_k) indicate to target voice K-th of word y that identification text obtains after being decoded_kIt is general for the first decoding probability in vocabulary when each word to be selected, i.e., first Rate distribution；P_trans(y_k) indicate k-th of the word y obtained after being decoded to target voice_kFor in vocabulary when each word to be selected Second decoding probability, i.e. the second probability distribution.

For example: it is based on the example above, it is assumed that k-th of the word obtained after being decoded to the identification text of target voice y_kThe first decoding probability when for word " system " is 0.87, and k-th of the word y obtained after being decoded to target voice_kFor The second decoding probability when word " system " is 0.76, meanwhile, the value for the α being determined by experiment is 0.6, then by above-mentioned Formula (9) can calculate k-th of word y in final cypher text_kFusing and decoding probability when for word " system " is 0.804, that is, 0.6*0.76+ (1-0.6) * 0.87=0.804.In this way, k-th of word y in final cypher text can be calculated_k Fusing and decoding probability when for other words, and then form k-th of word y_kCorresponding fusion probability distribution P_ensemble(y_k)。

S502: the corresponding word to be selected of maximum fusing and decoding probability, the cypher text as k-th of word are selected.

In the present embodiment, k-th of word y in final cypher text is obtained by step S501_kRespectively each word to be selected When corresponding fusing and decoding probability after, the corresponding word to be selected of maximum fusing and decoding probability can be therefrom selected, as kth The translation result of a word.

For example: assuming that each word to be selected is " system " for including in a certain English vocabulary, " table ", " box " ... waits 10000 words, then every in this 10000 words for k-th of word in final cypher text A word corresponds to a fusing and decoding probability, and then can choose the corresponding word to be selected of maximum fusing and decoding probability, than It such as can translation result by maximum 0.89 corresponding word " system " of fusing and decoding probability, as k-th of word.

To sum up, the present embodiment is in such a way that decoding probability merges, after the identification text to target voice is decoded K-th obtained of word y_kCorresponding first probability distribution P_text(y_k), with target voice is decoded after obtained k-th of word y_kCorresponding second probability distribution P_trans(y_k), it is decoded probability fusion, so as to obtain the kth in final cypher text A word y_kCorresponding more accurate fusing and decoding probability distribution, and then it is corresponding therefrom to select maximum fusing and decoding probability Word to be selected, the translation result as k-th of word.In this mode, turning over for each word in final cypher text can successively be obtained Translate result.

3rd embodiment

In the present embodiment, by the step S102 in above-mentioned first embodiment to the first translation object (i.e. target voice Identification text) translated, available first cypher text.

Wherein, the first cypher text is the text that target translates languages, which can be defined asWherein, K₁Indicate the number of individual character (or word) for including in the first cypher text.Such as, it is assumed that target voice is Chinese Voice, and target translation languages are English, that is, it needs for target voice to be translated as English text, then the first cypher text is English textWherein, K₁What is indicated is the number for the word for including in the English text.

One kind being optionally achieved in that network structure as shown in connection with fig. 4 can use the solution that text decoder obtains First cypher text of code vector generation target voiceWherein, K₁The individual character for indicating to include in the first cypher text is (or single Word) number.

If utilizing U_decThe all-network parameter of the text decoder of text translation model in Fig. 4 is represented, then model exports First cypher textCalculation formula it is as follows:

Wherein, dec indicates the entire solution Calculative Process of the text decoder of text translation model in Fig. 4；U_decIndicate figure The all-network parameter of the text decoder of text translation model in 4；s_1...NIndicate the corresponding volume of identification text of target voice Code vector.

In addition, in the present embodiment, by the step S102 in above-mentioned first embodiment to the second translation object (i.e. target Voice) it is directly translated, available second cypher text.

Wherein, the second cypher text is the text that target translates languages, which can be defined asWherein, K₂It can indicate the number of individual character (or word) for including in the second cypher text.Such as, it is assumed that target voice is Chinese speech, and target translation languages are still English, that is, there is still a need for target voice is translated as English text, then the second translation Text is English textWherein, K₂What is indicated is the number for the word for including in the English text.

One kind being optionally achieved in that network structure as shown in connection with fig. 4 can use and be translated and decoded the solution that device obtains Second cypher text of code vector generation target voiceWherein, K₂Indicate to include in the second cypher text individual character (or Word) number, the specific formula for calculation of decoded portion can be found in above-mentioned formula (2), (3), (4), and details are not described herein.

If utilizing W_decThe all-network parameter for being translated and decoded device of voiced translation model in Fig. 4 is represented, then model exports Second cypher textCalculation formula it is as follows:

Wherein, dec indicates the entire solution Calculative Process for being translated and decoded device of voiced translation model in Fig. 4；W_decIndicate figure The all-network parameter for being translated and decoded device of voiced translation model in 4；h_1...LIndicate the corresponding volume of audio frequency characteristics of target voice Code vector.

In addition, it should be noted that, the individual character for including in the first cypher text and the second cypher text in the present embodiment The number of (or word) can be identical, be also possible to different, that is, may be K₁=K₂, it is also possible to K₁≠K₂, but the One cypher text and the second cypher text belong to same languages, for example, being all Chinese text or English text etc..

It, may further be according to the first cypher text of generation based on thisWith the second cypher textIt obtains The final cypher text of target voice.

Next, how the present embodiment is will be to " according to the first cypher text of generationWith the second cypher textObtain the final cypher text of target voice " specific implementation process be introduced.

Referring to Fig. 6, target is obtained according to the first cypher text and the second cypher text it illustrates provided in this embodiment The flow diagram of the final cypher text of voice, the process the following steps are included:

S601: confidence level when final cypher text of first cypher text as target voice is determined.

In the present embodiment, if being obtained after being translated to the first translation object (i.e. the identification text of target voice) First cypher text then may further carry out data processing to the related data of first cypher text, to determine first Confidence level when final cypher text of the cypher text as target voice, and the confidence level is defined as socre_text。

In a kind of implementation of the present embodiment, S601 can specifically include following step B1-B2:

Step B1: the corresponding decoding probability of each text unit of the first cypher text is obtained.

In this implementation, in order to determine setting when final cypher text of first cypher text as target voice Reliability socre_text, it is necessary first to determine each text unit that the first cypher text includes, wherein text unit refers to The basic composition unit for constituting the first cypher text, it is different with the difference of the affiliated languages of the first cypher text, for example, if One cypher text be Chinese text, then it includes text unit can be word and word；If the first cypher text is English text, Then it includes text unit can be for word, etc..

Then, the available each text unit for including to the first cypher text corresponding decoding in its affiliated languages Probability, wherein decoding probability refers to a possibility that corresponding text unit belongs to translation result size, and specifically, the decoding is general Rate can be the identification text of target voice is decoded in second embodiment after obtained corresponding first probability of k-th of word A decoding probability in distribution.It is understood that decoding probability is bigger, show its corresponding text unit as k-th A possibility that translation result of word, is bigger, conversely, decoding probability is smaller, shows its corresponding text unit as k-th word A possibility that translation result, is smaller.

Step B2: according to the corresponding decoding probability of each text unit of the first cypher text, the first translation is determined Confidence level when final cypher text of the text as target voice.

It, can be right after the corresponding decoding probability of each text unit for getting the first cypher text by step B1 Each decoding probability is further processed, to determine the first cypher text as the final of target voice according to processing result Confidence level socre when cypher text_text。

Specifically, a kind of to be optionally achieved in that, can first by each text unit of the first cypher text respectively Corresponding decoding probability summation recycles the total number K of text unit acquire and that value includes divided by the first cypher text₁, The average decoding probability value of each text unit is obtained, to indicate final translation text of first cypher text as target voice The confidence level socre of this when_text。

For example: assuming that the first cypher text is the English text comprising 6 words, and the 1st word is to the 6th list The corresponding decoding probability of word is respectively 0.82,0.78,0.91,0.85,0.81,0.93, then carries out this 6 decoding probabilities After summation, available and value is 5.1, that is, 0.82+0.78+0.91+0.85+0.81+0.93=5.1, recycling is somebody's turn to do and value The total number 6 of 5.1 words for including divided by the first cypher text, the average decoding probability value for obtaining each word is 0.85, that is, 5.1/6=0.85 then can use the average decoding probability value 0.85, indicate above-mentioned first cypher text as target voice Confidence level socre when final cypher text_text, that is, socre_text=0.85.

S602: confidence level when final cypher text of second cypher text as target voice is determined.

In the present embodiment, it if being carried out after directly translating to the second translation object (i.e. target voice), has obtained second and has turned over Translation sheet then may further carry out data processing to the related data of second cypher text, to determine the second translation text Confidence level when this final cypher text as target voice, and the confidence level is defined as socre_trans。

In a kind of implementation of the present embodiment, S602 can specifically include following step C1-C2:

Step C1: the corresponding decoding probability of each text unit of the second cypher text is obtained.

In this implementation, in order to determine setting when final cypher text of second cypher text as target voice Reliability socre_trans, it is necessary first to determine each text unit that the second cypher text includes, wherein text unit refers to The basic composition unit for constituting the second cypher text, it is different with the difference of the affiliated languages of the second cypher text, for example, if Two cypher texts be Chinese text, then it includes text unit can be word or word；If the second cypher text is English text, Then it includes text unit can be for word, etc..

Then, the available each text unit for including to the second cypher text corresponding decoding in its affiliated languages Probability, wherein decoding probability refers to a possibility that corresponding text unit belongs to cypher text size, and specifically, the decoding is general Rate can be target voice is decoded in second embodiment after in obtained corresponding second probability distribution of k-th of word one A decoding probability.It is understood that decoding probability is bigger, show translation knot of its corresponding text unit as k-th of word A possibility that fruit, is bigger, conversely, decoding probability is smaller, shows its corresponding text unit as the translation result of k-th of word Possibility is smaller.

Step C2: according to the corresponding decoding probability of each text unit of the second cypher text, the second translation is determined Confidence level when final cypher text of the text as target voice.

It, can be right after the corresponding decoding probability of each text unit for getting the second cypher text by step C1 Each decoding probability is further processed, to determine the second cypher text as the final of target voice according to processing result Confidence level socre when cypher text_trans。

Specifically, a kind of to be optionally achieved in that, can first by each text unit of the second cypher text respectively Corresponding decoding probability summation recycles the total number K of text unit acquire and that value includes divided by the second cypher text₂, The average decoding probability value of each text unit is obtained, to indicate final translation text of second cypher text as target voice The confidence level socre of this when_trans。

For example: assuming that the second cypher text is the English text comprising 8 words, and the 1st word is to the 8th list The corresponding decoding probability of word is respectively 0.76,0.78,0.92,0.72,0.89,0.91,0.75,0.83, then solves this 8 After code probability is summed, available and value is 6.56, that is, 0.76+0.78+0.92+0.72+0.89+0.91+0.75+ 0.83=6.56 recycles the word total number 8 for including divided by the second cypher text with value 6.56, obtains the flat of each word Equal decoding probability value is 0.82, that is, 6.56/8=0.82 then can use the average decoding probability value 0.82, indicate above-mentioned the Confidence level socre when final cypher text of two cypher texts as target voice_trans, that is, socre_trans=0.82.

S603: selecting the corresponding cypher text of larger confidence level, the final cypher text as target voice.

In the present embodiment, final cypher text of first cypher text as target voice is determined by step S601 When confidence level socre_text, and by step S602 determine the second cypher text as target voice final translation text The confidence level socre of this when_transIt afterwards, can be from socre_textAnd socre_transIn select biggish value corresponding translation text This, the final cypher text as target voice.

Specifically, if socre_textValue be greater than socre_transValue, then show each text in the first cypher text It is bigger that unit belongs to a possibility that translation result, then can choose socre_textCorresponding first cypher text, as target language The final cypher text of sound；Conversely, if socre_transValue be greater than socre_textValue, then show in the second cypher text each It is bigger that text unit belongs to a possibility that translation result, then can choose socre_transCorresponding second cypher text, as mesh The final cypher text of poster sound.

To sum up, the present embodiment is by comparing the first cypher text and the second cypher text respectively as the final of target voice The size of confidence level when cypher text, so as to select each text unit and belong to translation result according to comparison result A possibility that bigger cypher text, as the final cypher text of target voice, and then can determine more accurate target The cypher text of voice improves the accuracy of voiced translation result.

Fourth embodiment

A kind of speech translation apparatus will be introduced in the present embodiment, and related content refers to above method embodiment.

It is a kind of composition schematic diagram of speech translation apparatus provided in this embodiment referring to Fig. 7, which includes:

Target voice acquiring unit 701, for obtaining target voice to be translated；

Cypher text obtaining unit 702 obtains described for translating to the first translation object and the second translation object The final cypher text of target voice, the first translation object are the identification text of the target voice, second translation Object is the target voice.

In a kind of implementation of the present embodiment, the cypher text obtaining unit 702 includes:

In a kind of implementation of the present embodiment, the translation result obtains subelement and includes:

In a kind of implementation of the present embodiment, the final cypher text obtains subelement and includes:

In a kind of implementation of the present embodiment, first confidence level determines that subelement includes:

In a kind of implementation of the present embodiment, second confidence level determines that subelement includes:

Further, the embodiment of the present application also provides a kind of speech translation apparatus, comprising: processor, memory, system Bus；

The processor and the memory are connected by the system bus；

The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any implementation method of above-mentioned voice translation method when being executed by the processor.

Further, described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium Instruction is stored in storage media, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice Any implementation method of interpretation method.

Further, the embodiment of the present application also provides a kind of computer program product, the computer program product exists When being run on terminal device, so that the terminal device executes any implementation method of above-mentioned voice translation method.

As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of the steps in example method can be realized by means of software and necessary general hardware platform.Based on such Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application It embodies, which can store in storage medium, such as ROM/RAM, magnetic disk, CD, including several Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.

It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place Referring to method part illustration.

It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of voice translation method characterized by comprising

Obtain target voice to be translated；

First translation object and the second translation object are translated, the final cypher text of the target voice is obtained, it is described First translation object is the identification text of the target voice, and the second translation object is the target voice.

2. the method according to claim 1, wherein described carry out the first translation object and the second translation object Translation, obtains the final cypher text of the target voice, comprising:

Wherein, first probability distribution includes k-th of the word obtained after identification text to the target voice is decoded For the first decoding probability in vocabulary when each word to be selected, second probability distribution includes being decoded to the target voice The second decoding probability when k-th of the word obtained afterwards is each word to be selected in vocabulary；

3. according to the method described in claim 2, it is characterized in that, described according to first probability distribution and described second general Rate distribution, obtains the translation result of k-th of word, comprising:

In first probability distribution and second probability distribution, by corresponding first decoding probability of identical word to be selected and Two decoding probabilities are merged, and corresponding fusing and decoding when k-th of word is each word to be selected in the final cypher text is obtained Probability；

4. the method according to claim 1, wherein described carry out the first translation object and the second translation object Translation, obtains the final cypher text of the target voice, comprising:

The target voice is directly translated, the second cypher text is obtained；

According to first cypher text and second cypher text, the final cypher text of the target voice is obtained.

5. according to the method described in claim 4, it is characterized in that, described turn over according to first cypher text and described second Translation sheet obtains the final cypher text of the target voice, comprising:

6. according to the method described in claim 5, it is characterized in that, the determination first cypher text is as the target Confidence level when the final cypher text of voice, comprising:

The corresponding decoding probability of each text unit of first cypher text is obtained, the decoding probability characterization corresponds to Text unit belongs to a possibility that translation result size；

According to the corresponding decoding probability of each text unit of first cypher text, first cypher text is determined Confidence level when final cypher text as the target voice.

7. according to the method described in claim 5, it is characterized in that, the determination second cypher text is as the target Confidence level when the final cypher text of voice, comprising:

The corresponding decoding probability of each text unit of second cypher text is obtained, the decoding probability characterization corresponds to Text unit belongs to a possibility that translation result size；

According to the corresponding decoding probability of each text unit of second cypher text, second cypher text is determined Confidence level when final cypher text as the target voice.

8. method according to any one of claims 1 to 7, which is characterized in that described to be turned over to the first translation object and second Object is translated to be translated, comprising:

9. a kind of speech translation apparatus characterized by comprising

Target voice acquiring unit, for obtaining target voice to be translated；

Cypher text obtaining unit obtains the target language for translating to the first translation object and the second translation object The final cypher text of sound, the first translation object are the identification text of the target voice, and the second translation object is The target voice.

10. device according to claim 9, which is characterized in that the cypher text obtaining unit includes:

Probability distribution generates subelement, for generate in the final cypher text corresponding first probability distribution of k-th of word and Second probability distribution；

Translation result obtains subelement, for obtaining k-th of word according to first probability distribution and second probability distribution Translation result.

11. a kind of speech translation apparatus characterized by comprising processor, memory, system bus；

The processor and the memory are connected by the system bus；

The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt The processor makes the processor perform claim require 1-8 described in any item methods when executing.

12. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium, When described instruction is run on the terminal device, so that the terminal device perform claim requires the described in any item methods of 1-8.

13. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make It obtains the terminal device perform claim and requires the described in any item methods of 1-8.