CN109785824A

CN109785824A - A kind of training method and device of voiced translation model

Info

Publication number: CN109785824A
Application number: CN201910198404.9A
Authority: CN
Inventors: 马志强; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2019-05-21
Anticipated expiration: 2039-03-15
Also published as: CN109785824B

Abstract

This application discloses the training methods and device of a kind of voiced translation model, this method comprises: obtaining the model training data including each sample voice first, then, the sample voice got is directly translated using current voiced translation model, obtain prediction cypher text, simultaneously, the sample voice got is identified using current speech recognition modeling, obtain Forecasting recognition text, then, according to obtained prediction cypher text and Forecasting recognition text, the parameter of voiced translation model and speech recognition modeling is updated.Since voiced translation model and speech recognition modeling share department pattern parameter, so, when updating the parameter of speech recognition modeling, the model parameter that part is shared in voiced translation model can be equally updated, so that this department pattern parameter of voiced translation model is more accurate, and then when carrying out voiced translation using the voiced translation model, it is able to ascend the translation performance of voiced translation model.

Description

A kind of training method and device of voiced translation model

Technical field

This application involves voiced translation technical field more particularly to a kind of training methods and device of voiced translation model.

Background technique

Existing voice translation method generally includes two steps, that is, by the speech recognition of voiced translation model realization and text This translation.Specifically, firstly, one section of voice is passed through speech recognition technology, it is identified as the text of same languages therewith, then, The identification text is translated into the text of another languages using text translation technology, to realize voiced translation process.

But the shortcomings that joint speech recognition technology and text translation technology carry out voiced translation, and there are error accumulations, example Such as, it is assumed that using speech recognition technology by some word recognition errors, and work as and the word is carried out using text translation technology When translation, the translation result of mistake will be obtained according to the word of the mistake.As it can be seen that the mistake of speech recognition period can accumulate text This translating phase, so as to cause the inaccuracy of translation result, that is to say, that the translation performance of existing voiced translation model is also It is to be hoisted.

Summary of the invention

The main purpose of the embodiment of the present application is to provide the training method and device of a kind of voiced translation model, Neng Gouti Rise the translation performance of voiced translation model.

The embodiment of the present application provides a kind of training method of voiced translation model, comprising:

Model training data are obtained, the model training data include each sample voice；

The sample voice is directly translated using current voiced translation model, obtains prediction cypher text, In, voiced translation model and a speech recognition modeling share department pattern parameter；

The sample voice is identified using current speech recognition modeling, obtains Forecasting recognition text；

According to the prediction cypher text and the Forecasting recognition text, updates current voiced translation model and voice and know The parameter of other model.

Optionally, described according to the prediction cypher text and the Forecasting recognition text, update current voiced translation The parameter of model and speech recognition modeling, comprising:

Obtain the true cypher text and true identification text of the sample voice；

According to translation different information and Recognition Different information, current voiced translation model and speech recognition modeling are updated Parameter；

Wherein, the translation different information is the difference between the prediction cypher text and the true cypher text, The Recognition Different information is the difference between the Forecasting recognition text and the true identification text.

Optionally, described according to translation different information and Recognition Different information, update current voiced translation model and language The parameter of sound identification model, comprising:

According to the translation different information, parameter update is carried out to the voiced translation model；

According to the Recognition Different information, parameter update is carried out to the speech recognition modeling.

Optionally, the speech recognition modeling and one encoder of the voiced translation model sharing, the speech recognition Model includes an identification decoder, and the voiced translation model includes one and is translated and decoded device.

The embodiment of the present application also provides a kind of voice translation methods, comprising:

Obtain target voice to be translated；

The voiced translation model obtained using the training method training by above-mentioned voiced translation model, to the target language Sound is translated.

The embodiment of the present application also provides a kind of training devices of voiced translation model, comprising:

Training data acquiring unit, for obtaining model training data, the model training data include each sample language Sound；

Cypher text obtaining unit, for directly being turned over using current voiced translation model to the sample voice It translates, obtains prediction cypher text, wherein voiced translation model and a speech recognition modeling share department pattern parameter；

Identification text obtaining unit is obtained for being identified using current speech recognition modeling to the sample voice To Forecasting recognition text；

Model parameter updating unit, for updating current according to the prediction cypher text and the Forecasting recognition text Voiced translation model and speech recognition modeling parameter.

Optionally, the model parameter updating unit includes:

Real text obtains subelement, for obtaining the true cypher text and true identification text of the sample voice；

Model parameter updates subelement, for updating current voice according to translation different information and Recognition Different information The parameter of translation model and speech recognition modeling；

Optionally, the model parameter update subelement includes:

Translation model parameter updates subelement, for according to the translation different information, to the voiced translation model into Row parameter updates；

Identification model parameter updates subelement, for according to the Recognition Different information, to the speech recognition modeling into Row parameter updates.

The embodiment of the present application also provides a kind of speech translation apparatus, comprising:

Target voice acquiring unit, for obtaining target voice to be translated；

Target voice translation unit, the voice for being obtained using the training device training by above-mentioned voiced translation model Translation model translates the target voice.

The embodiment of the present application also provides a kind of training equipment of voiced translation model, comprising: processor memory, is System bus；

The processor and the memory are connected by the system bus；

The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any one reality in the training method of above-mentioned voiced translation model when being executed by the processor Existing mode.

The embodiment of the present application also provides a kind of computer readable storage medium, deposited in the computer readable storage medium Instruction is contained, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voiced translation model Any one implementation in training method.

The embodiment of the present application also provides a kind of computer program product, the computer program product is on the terminal device When operation, so that the terminal device executes any one implementation in the training method of above-mentioned voiced translation model.

The embodiment of the present application also provides a kind of speech translation apparatus, comprising: processor, memory, system bus；

The processor and the memory are connected by the system bus；

The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any one implementation of above-mentioned voice translation method when being executed by the processor.

The embodiment of the present application also provides a kind of computer readable storage medium, deposited in the computer readable storage medium Instruction is contained, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice translation method Any one implementation.

The embodiment of the present application also provides a kind of computer program product, the computer program product is on the terminal device When operation, so that the terminal device executes any one implementation of above-mentioned voice translation method.

The training method and device of a kind of voiced translation model provided by the embodiments of the present application, to voiced translation model into When row training, the model training data including each sample voice are obtained first and then utilize current voiced translation model pair The sample voice got is directly translated, and prediction cypher text is obtained, meanwhile, using current speech recognition modeling to obtaining The sample voice got is identified, Forecasting recognition text is obtained, then, can be according to obtained prediction cypher text and prediction It identifies text, updates the parameter of current voiced translation model and speech recognition modeling.Due to current voiced translation model with Speech recognition modeling shares department pattern parameter, so, it, equally can be to voiced translation when updating the parameter of speech recognition modeling The model parameter that part is shared in model is updated, so that this department pattern ginseng for the voiced translation model that training obtains Number is more accurate, and then when carrying out voiced translation using the voiced translation model, is able to ascend the translation of voiced translation model Performance.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is one of the structural schematic diagram of end-to-end voiced translation model provided by the embodiments of the present application；

Fig. 2 is the second structural representation of end-to-end voiced translation model provided by the embodiments of the present application；

Fig. 3 is a kind of flow diagram of the training method of voiced translation model provided by the embodiments of the present application；

Fig. 4 is one of voiced translation model provided by the embodiments of the present application and structural schematic diagram of speech recognition modeling；

Fig. 5 is the second structural representation of voiced translation model and speech recognition modeling provided by the embodiments of the present application；

Fig. 6 is provided by the embodiments of the present application to update current voice according to prediction cypher text with Forecasting recognition text and turn over Translate the flow diagram of the parameter of model and speech recognition modeling；

Fig. 7 is that the voice current according to translation different information and Recognition Different information update provided by the embodiments of the present application turns over Translate the flow diagram of the parameter of model and speech recognition modeling；

Fig. 8 is a kind of flow diagram of voice translation method provided by the embodiments of the present application；

Fig. 9 is a kind of composition schematic diagram of the training device of voiced translation model provided by the embodiments of the present application；

Figure 10 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall in the protection scope of this application.

First embodiment

It should be noted that traditional voice translation method is usually first to carry out speech recognition to voice, it is identified as Then the text of same languages therewith is again handled the identification text, that is, using text translation technology to the identification text It is translated, is transcribed into the text of another languages, realize voiced translation.But this traditional voice translation method is often There are problems that error accumulation, that is, if producing mistake when speech recognition, it is translated which can accumulate subsequent text Journey in turn results in translation result inaccuracy.

Therefore, to solve drawbacks described above, voice can be carried out using the model of voiced translation end to end as shown in Figure 1 and turned over It translates, which includes encoder, attention layer (Attention) and decoder, can by the voiced translation model Not carry out speech recognition to source languages voice, and directly by the source languages voiced translation at target language text, realize direct Voiced translation, for example, Chinese speech is translated directly into English text.

One kind being optionally achieved in that voiced translation model can use network structure as shown in Figure 2 end to end, Next, will be introduced by taking the voiced translation model as an example to using its realization process for carrying out voiced translation:

(1) audio frequency characteristics of input source languages voice

Audio feature extraction is carried out to the source languages voice for needing to carry out voiced translation first, for example, can be with extraction source language The Meier spectrum signature (Mel Bank Features) of kind voice, as the audio frequency characteristics of source languages voice, which can To be indicated in the form of feature vector, here, this feature vector is defined as x_1...T, wherein T indicates source languages voice The number for the vector element that the dimension size of audio feature vector, i.e. audio feature vector include, it is then possible to by x_1...TAs Input data is input to voiced translation model end to end shown in Fig. 2.

(2) the corresponding coding vector of audio frequency characteristics of source languages voice is generated

As shown in Fig. 2, this end to end voiced translation model coded portion include two layers of convolutional neural networks (Convolutional Neural Networks, abbreviation CNN) and maximum pond layer (MaxPooling), one layer of convolution length Phase memory network (convolutional Long Short-Term Memory, abbreviation convolutional LSTM), three layers Two-way shot and long term memory network (Bi-directional Long Short-Term Memory, abbreviation BiLSTM).

The audio frequency characteristics x of (1) input source languages voice through the above steps_1...TAfterwards, it can be carried out by one layer of CNN Then coding, then carry out down-sampled operation to it by MaxPooling, then by one layer of CNN and MaxPooling repeats this Operation obtains the coding vector that length is L and then recycles one layer of convolutional LSTM and three layers BiLSTM pairs This coding vector is handled, and to obtain final coding vector, is defined as h_1...L, wherein L indicates the sound to source languages voice The number for the vector element that the dimension size for the coding vector that frequency feature obtains after being encoded, i.e. coding vector include, h_1...L Specific formula for calculation it is as follows:

h_1...L=enc (W_encx_1...T) (1)

Wherein, enc indicates the entire coding calculating process of model based coding part；W_encIndicate each layer in model based coding part The all-network parameter of network.

(3) the corresponding decoded vector of coding vector is generated

As shown in Fig. 2, this voiced translation solution to model code part includes 4 layers of unidirectional shot and long term memory network end to end (Long Short-Term Memory, abbreviation LSTM), softmax classifier.

(2) through the above steps are encoded to obtain using the coded portion of model to the audio frequency characteristics of source languages voice After coding vector, attention operation first can be carried out to the coding vector, it can be to life to be concerned about in coding vector At the related data of decoded vector, then it is decoded by 4 layers of LSTM and softmax classifier again, to be corresponded to Decoded vector, recycle the decoded vector to generate the cypher text of source languages voice, and be defined as y_1...K, wherein K can To indicate the number for the individual character (or word) for including in cypher text.

The specific formula for calculation of decoded portion is as follows:

c_k=att (s_k,h_1...L) (2)

s_k=lstm (y_k-1,s_k-1,c_k-1) (3)

y_k=soft max (W_y[s_k,c_k]+b_y) (4)

Wherein, h_1...LThe corresponding coding vector of audio frequency characteristics of expression source languages voice；c_kIndicate k-th of attention Calculated result；Att indicates attention calculating process；c_k-1Indicate -1 attention calculated result of kth；s_kIndicate decoding K-th of the hidden layer vector exported in 4 layers of LSTM network that part includes；Lstm indicates 4 layers of LSTM network that decoded portion includes Calculating process；s_k-1- 1 hidden layer vector of kth exported in 4 layers of LSTM network that expression decoded portion includes；y_kIndicate translation K-th of the word (or word) for including in text；y_k-1Indicate -1 word of kth (or word) for including in cypher text；W_yAnd b_yIt indicates Model parameter in softmax classifier.

If utilizing W_decThe all-network parameter of each layer network in representative model decoded portion, then the source languages that model exports The cypher text y of voice_1...KCalculation formula it is as follows:

y_1...K=dec (W_dech_1...L) (5)

Wherein, dec indicates the entire solution Calculative Process of model decoded portion；W_decIndicate each layer in model decoded portion The all-network parameter of network；h_1...LThe corresponding coding vector of audio frequency characteristics of expression source languages voice.

It should be noted that in the model of voiced translation end to end shown in FIG. 1 encoder and decoder network structure It is not unique, and network structure shown in Fig. 2 is only one such example, can also take other network structures or net Network layers number.For example, the encoder of model can also be using Recognition with Recurrent Neural Network (Recurrent Neural Network, abbreviation ) etc. RNN encoded, and the number of plies of network can also be set according to the actual situation, the embodiment of the present application to this without Limitation.Wherein, the number of plies of CNN, BiLSTM introduced in above-mentioned or subsequent content etc. is only example, and the application does not limit its layer Number, can be the number of plies referred in the embodiment of the present application, is also possible to other numbers of plies.

In the present embodiment, on the basis of the model of voiced translation end to end shown in Fig. 1, in order to improve voiced translation The translation performance of model may further be trained voiced translation model by the way of multitask training.

Wherein, multitask training refers to the machine learning method being trained that multiple inter-related tasks are put together, In training process, some model parameters are shared between the task model of this multiple inter-related task, for example, can be with Share Model bottom Partial parameters etc., the information acquired to share each task specifically can be by multiple relevant tasks simultaneously simultaneously Row study, and by way of gradient backpropagation simultaneously, the sharing model parameters between multiple inter-related tasks are adjusted, to realize The study of helping each other of multiple inter-related tasks, to promote the extensive effect of task model.As it can be seen that the side of this multitask training Formula can obtain better model generalization effect, improve the generalization ability of model for single task training.

It should be noted that when being trained by the way of multitask training to multi task model in the present embodiment, tool Body is that voiced translation model and speech recognition modeling are carried out while being trained, and after model training, makes voiced translation model With preferable translation performance.

Next, this implementation will carry out the training method of voiced translation model provided in this embodiment detailed in conjunction with attached drawing It is thin to introduce.

Referring to Fig. 3, it illustrates a kind of flow diagram of the training method of voiced translation model provided in this embodiment, Method includes the following steps:

S301: model training data are obtained, wherein model training data include each sample voice.

In the present embodiment, in order to be trained to voiced translation model, the translation performance of voiced translation model is improved, is needed A large amount of preparation is carried out in advance, firstly, it is necessary to a large amount of voice data be collected, as sample voice, to constitute mould Type training data.For example, a large amount of recording data can be collected in advance, for example, read aloud in match the voice of each entrant or Person's session recording etc. can be used as sample voice, to be trained to model.

It should be noted that the languages of the unlimited sample voice processed of the present embodiment, for example, sample voice can be Chinese Sound or English voice etc.；Meanwhile the length of the present embodiment also unlimited sample voice processed, for example, sample voice can be one Words or more words etc..

It should also be noted that, the present embodiment will be using each sample voice to voiced translation model and speech recognition modeling More wheel training are carried out, specifically by by taking the sample voice that current wheel training uses as an example, are realized according to subsequent step S302-S304 When the model training of front-wheel, it is specifically described as follows.

S302: directly translating sample voice using current voiced translation model, obtains prediction cypher text.

In the present embodiment, after getting each sample voice by step S301, it can use current voiced translation Model directly translates (without speech recognition) sample voice got, to obtain prediction cypher text, for example, false If a certain sample voice is Chinese speech, then it can use current voiced translation model and it directly translated, to obtain Predict translator of English text.

Wherein, the voiced translation model and a speech recognition modeling share department pattern parameter.A kind of optional realization Mode is that the network structure of current voiced translation model is as shown in figure 4, itself and the shared coding of a speech recognition modeling Device, and the speech recognition modeling includes an identification decoder, to generate Forecasting recognition text, and current voiced translation Model then includes one and is translated and decoded device, to generate prediction cypher text.It should be noted that speech recognition mould in Fig. 4 The identification decoder of type and the network structure for being translated and decoded device of voiced translation model may be the same or different, respective Concrete composition structure can be set according to the actual situation, and the embodiment of the present application is not limited this.

For example: for example a certain sample voice content is " end-to-end speech translation system ", and its corresponding translation text This languages are English, then available pre- after carrying out directly translation to it by current voiced translation model shown in Fig. 4 Surveying cypher text is " The end-to-end speech translation system ".

S303: identifying sample voice using current speech recognition modeling, obtains Forecasting recognition text.

In the present embodiment, after getting each sample voice by step S301, it can use current speech recognition Model identifies the sample voice got, to obtain Forecasting recognition text, for example, it is assumed that a certain sample voice is Chinese Voice then can use current speech recognition modeling and identify to it, to obtain the prediction Chinese identification text of same languages therewith This.

Wherein, the speech recognition modeling introduced in the speech recognition modeling and step S302 can share department pattern ginseng Number.One kind being optionally achieved in that the network structure of current speech recognition modeling is as shown in figure 4, itself and voiced translation mould Type shares an encoder, and speech recognition modeling includes an identification decoder, to generate Forecasting recognition text.

For example: for example a certain sample voice content is still " end-to-end speech translation system ", then by shown in Fig. 4 After current speech recognition modeling identifies it, available prediction Chinese identification text is " end-to-end speech translation system System ".

In the present embodiment, a kind of to be optionally achieved in that, voiced translation model and speech recognition modeling can use Network structure as shown in Figure 5, and network structure based on this model, generate prediction cypher text and Forecasting recognition text it is specific Process is as follows:

(1) audio frequency characteristics of input sample voice

Audio feature extraction is carried out to sample voice first, for example, the Meier spectrum signature of sample voice can be extracted, is made For the audio frequency characteristics of sample voice, and this feature vector is defined as x_1...T, then, by x_1...TAs input data, it is input to Encoder shown in fig. 5.

(2) the corresponding coding vector of audio frequency characteristics of sample voice is generated

As shown in figure 5, voiced translation model and the shared encoder of speech recognition modeling and coding structure phase shown in Fig. 2 Together, details are not described herein again.Pass through the audio frequency characteristics x for the sample voice that the encoder inputs above-mentioned steps (1)_1...TIt is encoded Afterwards, available final coding vector h_1...L, wherein what L expression obtained after encoding to the audio frequency characteristics of sample voice The number for the vector element that the dimension size of coding vector, i.e. coding vector include, h_1...LSpecific formula for calculation be above-mentioned public affairs Formula (1), i.e., as follows:

h_1...L=enc (W_encx_1...T)

Wherein, enc indicates the entire coding calculating process of encoder in Fig. 5；W_encEach layer net in encoder in expression Fig. 5 The all-network parameter of network.

(3) the corresponding decoded vector of coding vector is generated

As shown in fig. 5, it is assumed that the identification decoder of speech recognition modeling and the net for being translated and decoded device of voiced translation model Network structure is identical, includes 4 layers of LSTM, softmax classifier, but the training parameter of the two is not shared.

(2) through the above steps encode the audio frequency characteristics of sample voice using the coded portion of model and are compiled Code vector h_1...LAfterwards, as shown in figure 5, attention operation first can be carried out to the coding vector respectively, then pass through respectively again It is translated and decoded device and identifies that 4 layers of LSTM and softmax classifier in decoder are decoded attention operation result, To obtain corresponding decoded vector, the two decoded vectors is recycled to generate the prediction cypher text of sample voice respectively y_1...KAnd the Forecasting recognition text z of sample voice_1...N, wherein individual character that N indicates to include in Forecasting recognition text (or it is single Word) number, the specific formula for calculation of decoded portion can be found in above-mentioned formula (2), (3), (4), and details are not described herein.

If utilizing W_decThe all-network parameter for being translated and decoded each layer network in device of voiced translation model in Fig. 5 is represented, then The prediction cypher text y of model output_1...KCalculation formula it is as follows:

y_1...K=dec (W_dech_1...L) (6)

Wherein, dec indicates the entire solution Calculative Process for being translated and decoded device of voiced translation model in Fig. 5；W_decIndicate figure The all-network parameter for being translated and decoded each layer network in device of voiced translation model in 5；h_1...LIndicate that the audio of sample voice is special Levy corresponding coding vector.

Similar, if utilizing W_asrRepresent the all-network of each layer network in the identification decoder of speech recognition modeling in Fig. 5 Parameter, then the Forecasting recognition text z that model exports_1...NCalculation formula it is as follows:

z_1...N=dec (W_asrh_1...L) (7)

Wherein, dec indicates the entire solution Calculative Process of the identification decoder of speech recognition modeling in Fig. 5；W_asrIndicate figure The all-network parameter for identifying each layer network in decoder of speech recognition modeling in 5；h_1...LIndicate that the audio of sample voice is special Levy corresponding coding vector.

It should be noted that the present embodiment do not limit S302 and S303 execute sequence, executed after S302 can be first carried out S303 is first carried out and is executed S302 after S303 or be performed simultaneously S302 and S303.

S304: according to the prediction cypher text of sample voice and Forecasting recognition text, current voiced translation model is updated With the parameter of speech recognition modeling.

In the present embodiment, a sample voice can be successively extracted from the model training data referred in S301, into Row model training, by more taking turns training, to update the parameter of current voiced translation model and speech recognition modeling.

Before model training, the model ginseng of voiced translation model and speech recognition modeling can be gone out with random initializtion first Number W_enc、W_decAnd W_asr.Then, in first round training process, S302-S303 through the above steps, to voiced translation model With the model parameter W of speech recognition modeling_enc、W_decAnd W_asrIt is updated；Second wheel training during, the first round more Step S302-S303 is continued through on the basis of new parameter carries out the second wheel parameter update ... until training terminates.

As an example, in the training process, the objective function that the present embodiment uses is as follows:

Obj=λ log p (y | x)+(1- λ) log p (z | x) (8)

Wherein, λ indicates weight, and the value of λ can be set between 0-1 based on experimental result or experience；Y indicates voice The prediction cypher text of translation model output；Z indicates the Forecasting recognition text of speech recognition modeling output；X indicates sample voice Audio characteristic data.

Specifically, a kind of to be optionally achieved in that, as shown in fig. 6, the realization process of this step S304 specifically can be with Including step S601-S602:

S601: the true cypher text and true identification text of sample voice are obtained.

In this implementation, while obtaining each sample voice as model training data, it can also get Each corresponding true cypher text of sample voice and true identification text.For example: assuming that the content of sample voice is still " end-to-end speech translation system ", then its corresponding true cypher text is " The end-to-end speech Translation system ", true identification text are " end-to-end speech translation system ".

S602: according to translation different information and Recognition Different information, current voiced translation model and speech recognition are updated The parameter of model.

In this implementation, translation different information refers to the difference between prediction cypher text and true cypher text It is different.For example, it is assumed that prediction cypher text is " The end-to-end speech translate system ", true translation text This is " The end-to-end speech translation system ", then the translation different information of the two is " translation " and " translate ".

In this implementation, Recognition Different information refers to the difference between Forecasting recognition text and true identification text It is different.For example, it is assumed that Forecasting recognition text is " end-to-end language translation system ", it is true to identify that text is " end-to-end speech translation System ", then the Recognition Different information of the two as " is sayed " and " sound ".

As a result, after the true cypher text for getting sample voice by step S601 and true identification text, into one It walks the corresponding true cypher text of available sample voice and predicts the translation different information and sample between cypher text Different information between the corresponding true identification text of voice and Forecasting recognition text can translate difference according to these in turn Information and Recognition Different information respectively correspond the parameter for updating current voiced translation model and speech recognition modeling.

In one implementation, as shown in fig. 7, step S602 specific implementation process may include following step S701- S702:

S701: according to translation different information, parameter update is carried out to voiced translation model.

It in the present embodiment, can be according to the translation difference after getting the corresponding translation different information of sample text Information, the encoder model parameter W corresponding with device is translated and decoded in reversed gradient updating voiced translation model_encAnd W_dec。

S702: according to Recognition Different information, parameter update is carried out to speech recognition modeling.

It in the present embodiment, can be according to the Recognition Different after getting the corresponding Recognition Different information of sample text Information, the encoder model parameter W corresponding with identification decoder in reversed gradient updating speech recognition modeling_encAnd W_asr。

It should be noted that the present embodiment do not limit S701 and S702 execute sequence, executed after S701 can be first carried out S702 is first carried out and is executed S701 after S702 or be performed simultaneously S701 and S702.

As it can be seen that using prediction cypher text and Forecasting recognition text, while updating current voiced translation model and language During the parameter of sound identification model, by the training to speech recognition modeling, the model in meeting real-time update encoder is joined Number, so that coding result is more acurrate, as shown in Figure 4 and Figure 5, since speech recognition modeling and voiced translation model sharing one are compiled Code device, in this way, voiced translation model is translated and decoded device in decoding in the more accurate situation of coding result, it can basis More accurate coding result is decoded, so that more accurate decoding result is obtained, it is thus possible to improve voiced translation is accurate Degree, that is, the translation performance of voiced translation model can be promoted.

To sum up, the training method of a kind of voiced translation model provided in this embodiment, is instructed to voiced translation model When practicing, the model training data including each sample voice are obtained first, then, using current voiced translation model to acquisition To sample voice directly translated, obtain prediction cypher text, meanwhile, using current speech recognition modeling to getting Sample voice identified, obtain Forecasting recognition text, then, can be according to obtained prediction cypher text and Forecasting recognition Text updates the parameter of current voiced translation model and speech recognition modeling.Due to current voiced translation model and voice Identification model shares department pattern parameter, so, it, equally can be to voiced translation model when updating the parameter of speech recognition modeling In share the model parameter of part and be updated so that this department pattern parameter of voiced translation model that training obtains is more It is accurate to add, and then when carrying out voiced translation using the voiced translation model, is able to ascend the translation performance of voiced translation model.

Second embodiment

The above are a kind of specific embodiment of the training method of voiced translation model of the application first embodiment offer, bases The voiced translation model that training obtains in above-described embodiment, the embodiment of the present application also provides a kind of voice translation methods.

Referring to Fig. 8, it illustrates a kind of flow charts of voice translation method provided by the embodiments of the present application, as shown in figure 8, This method comprises:

S801: target voice to be translated is obtained.

In the present embodiment, any voice translated using the present embodiment is defined as target voice.The target language Sound is identical as the languages of sample voice in above-mentioned first embodiment.

It is understood that target voice can be obtained by modes such as recording according to actual needs, for example, people day Often telephone relation voice in life or session recording etc. can be used as target voice, can be with after getting target voice It is translated by subsequent step S802.

S802: the voiced translation model obtained using training translates target voice.

In practical applications, it after target voice to be translated being got by step S801, further, can will extract The audio frequency characteristics (such as the spectrum signatures such as Meier spectrum signature) of target voice out are input to training in first embodiment and obtain Voiced translation model, the corresponding cypher text of target voice is obtained, to realize the translation to target voice.

To sum up, a kind of voice translation method provided in this embodiment is to use after getting target voice to be translated The voiced translation model that training obtains in above-mentioned first embodiment, translates the target voice, so as to which it is direct Corresponding languages text is translated into, the operation without carrying out any speech recognition to it is therefore, advanced compared to existing Row speech recognition, then the voice translation method of text translation is carried out, it is tired that the present embodiment can reduce speech recognition bring mistake Meter, obtains more accurate voiced translation result.

3rd embodiment

A kind of training device of voiced translation model will be introduced in the present embodiment, and related content refers to the above method Embodiment.

It is a kind of composition schematic diagram of the training device of voiced translation model provided in this embodiment, the device referring to Fig. 9 900 include:

Training data acquiring unit 901, for obtaining model training data, the model training data include each sample Voice；

Cypher text obtaining unit 902, it is direct for being carried out using current voiced translation model to the sample voice Translation obtains prediction cypher text, wherein voiced translation model and a speech recognition modeling share department pattern parameter；

Identify text obtaining unit 903, for being identified using current speech recognition modeling to the sample voice, Obtain Forecasting recognition text；

Model parameter updating unit 904, for according to the prediction cypher text and the Forecasting recognition text, update to be worked as The parameter of preceding voiced translation model and speech recognition modeling.

In a kind of implementation of the present embodiment, the model parameter updating unit 904 includes:

In a kind of implementation of the present embodiment, the model parameter updates subelement and includes:

In a kind of implementation of the present embodiment, the speech recognition modeling and the voiced translation model sharing one Encoder, the speech recognition modeling include an identification decoder, and the voiced translation model includes one and is translated and decoded device.

Further, the embodiment of the present application also provides a kind of training equipment of voiced translation model, comprising: processor, Memory, system bus；

The processor and the memory are connected by the system bus；

The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any realization side of the training method of above-mentioned voiced translation model when being executed by the processor Method.

Further, described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium Instruction is stored in storage media, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice Any implementation method of the training method of translation model.

Further, the embodiment of the present application also provides a kind of computer program product, the computer program product exists When being run on terminal device, so that the terminal device executes any realization side of the training method of above-mentioned voiced translation model Method.

Fourth embodiment

A kind of speech translation apparatus will be introduced in the present embodiment, and related content refers to above method embodiment.

It is a kind of composition schematic diagram of speech translation apparatus provided in this embodiment referring to Figure 10, which includes:

Target voice acquiring unit 1001, for obtaining target voice to be translated；

Target voice translation unit 1002, for what is obtained using the training device training by above-mentioned voiced translation model Voiced translation model translates the target voice.

Further, the embodiment of the present application also provides a kind of speech translation apparatus, comprising: processor, memory, system Bus；

The processor and the memory are connected by the system bus；

The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any implementation method of above-mentioned voice translation method when being executed by the processor.

Further, described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium Instruction is stored in storage media, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice Any implementation method of interpretation method.

Further, the embodiment of the present application also provides a kind of computer program product, the computer program product exists When being run on terminal device, so that the terminal device executes any implementation method of above-mentioned voice translation method.

As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of the steps in example method can be realized by means of software and necessary general hardware platform.Based on such Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application It embodies, which can store in storage medium, such as ROM/RAM, magnetic disk, CD, including several Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.

It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place Referring to method part illustration.

It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of training method of voiced translation model characterized by comprising

The sample voice is directly translated using current voiced translation model, obtains prediction cypher text, wherein language Sound translation model and a speech recognition modeling share department pattern parameter；

According to the prediction cypher text and the Forecasting recognition text, current voiced translation model and speech recognition mould are updated The parameter of type.

2. the method according to claim 1, wherein described know according to the prediction cypher text and the prediction Other text updates the parameter of current voiced translation model and speech recognition modeling, comprising:

Obtain the true cypher text and true identification text of the sample voice；

According to translation different information and Recognition Different information, the ginseng of current voiced translation model and speech recognition modeling is updated Number；

Wherein, the translation different information is the difference between the prediction cypher text and the true cypher text, described Recognition Different information is the difference between the Forecasting recognition text and the true identification text.

3. according to the method described in claim 2, it is characterized in that, described according to translation different information and Recognition Different information, Update the parameter of current voiced translation model and speech recognition modeling, comprising:

4. method according to any one of claims 1 to 3, which is characterized in that the speech recognition modeling and the voice Translation model shares an encoder, and the speech recognition modeling includes an identification decoder, the voiced translation model packet It includes one and is translated and decoded device.

5. a kind of voice translation method characterized by comprising

Obtain target voice to be translated；

The voiced translation model obtained using the training of the described in any item methods of Claims 1-4 4, carries out the target voice Translation.

6. a kind of training device of voiced translation model characterized by comprising

Training data acquiring unit, for obtaining model training data, the model training data include each sample voice；

Cypher text obtaining unit is obtained for directly being translated using current voiced translation model to the sample voice To prediction cypher text, wherein voiced translation model and a speech recognition modeling share department pattern parameter；

Identification text obtaining unit is obtained pre- for being identified using current speech recognition modeling to the sample voice Survey identification text；

Model parameter updating unit, for updating current language according to the prediction cypher text and the Forecasting recognition text The parameter of sound translation model and speech recognition modeling.

7. device according to claim 6, which is characterized in that the model parameter updating unit includes:

Model parameter updates subelement, for updating current voiced translation according to translation different information and Recognition Different information The parameter of model and speech recognition modeling；

8. device according to claim 7, which is characterized in that the model parameter updates subelement and includes:

Translation model parameter updates subelement, for joining to the voiced translation model according to the translation different information Number updates；

Identification model parameter updates subelement, for joining to the speech recognition modeling according to the Recognition Different information Number updates.

9. according to the described in any item devices of claim 6 to 8, which is characterized in that the speech recognition modeling and the voice Translation model shares an encoder, and the speech recognition modeling includes an identification decoder, the voiced translation model packet It includes one and is translated and decoded device.

10. a kind of speech translation apparatus characterized by comprising

Target voice acquiring unit, for obtaining target voice to be translated；

Target voice translation unit, the voiced translation mould for being obtained using the described in any item device training of claim 6 to 9 Type translates the target voice.

11. a kind of training equipment of voiced translation model characterized by comprising processor, memory, system bus；

The processor and the memory are connected by the system bus；

The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt The processor makes the processor perform claim require 1-4 described in any item methods when executing.

12. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium, When described instruction is run on the terminal device, so that the terminal device perform claim requires the described in any item methods of 1-4.

13. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make It obtains the terminal device perform claim and requires the described in any item methods of 1-4.

14. a kind of speech translation apparatus characterized by comprising processor, memory, system bus；

The processor and the memory are connected by the system bus；

The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt The processor makes method described in the processor perform claim requirement 5 when executing.

15. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium, When described instruction is run on the terminal device, so that method described in terminal device perform claim requirement 5.

16. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make Obtain method described in the terminal device perform claim requirement 5.