CN109979461A - A kind of voice translation method and device - Google Patents
A kind of voice translation method and device Download PDFInfo
- Publication number
- CN109979461A CN109979461A CN201910199082.XA CN201910199082A CN109979461A CN 109979461 A CN109979461 A CN 109979461A CN 201910199082 A CN201910199082 A CN 201910199082A CN 109979461 A CN109979461 A CN 109979461A
- Authority
- CN
- China
- Prior art keywords
- text
- translation
- target voice
- word
- cypher text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Abstract
This application discloses a kind of voice translation method and devices, this method comprises: after getting target voice to be translated, both the identification text of target voice and target voice itself are translated collectively as translation object, to obtain the final cypher text of target voice, first target voice is identified in compared with the prior art, obtain identification text, the mode that the identification text is translated as translation object again, translation object more horn of plenty in the application, it include identification two translation objects of text and target voice, so, by way of being translated to the two translation objects, it can determine the cypher text of more accurate target voice.
Description
Technical field
This application involves voiced translation technical field more particularly to a kind of voice translation methods and device.
Background technique
Existing voice translation method generally includes two steps, that is, speech recognition and text translation.Specifically, firstly,
One section of voice is passed through into speech recognition technology, is identified as the text of same languages therewith, it then, should using text translation technology
Identification text translates into the text of another languages, to realize voiced translation process.
But the shortcomings that joint speech recognition technology and text translation technology carry out voiced translation, and there are error accumulations, example
Such as, it is assumed that using speech recognition technology by some word recognition errors, and work as and the word is carried out using text translation technology
When translation, the translation result of mistake will be obtained according to the word of the mistake.As it can be seen that the mistake of speech recognition period can accumulate text
This translating phase, so as to cause the inaccuracy of translation result.
Summary of the invention
The main purpose of the embodiment of the present application is to provide a kind of voice translation method and device, is able to ascend voiced translation
As a result accuracy.
The embodiment of the present application provides a kind of voice translation method, comprising:
Obtain target voice to be translated;
First translation object and the second translation object are translated, the final cypher text of the target voice is obtained,
The first translation object is the identification text of the target voice, and the second translation object is the target voice.
Optionally, described that first translation object and the second translation object are translated, obtain the target voice most
Whole cypher text, comprising:
Generate corresponding first probability distribution of k-th of word and the second probability distribution in the final cypher text;
Wherein, first probability distribution includes the kth obtained after identification text to the target voice is decoded
The first decoding probability when a word is each word to be selected in vocabulary, second probability distribution include carrying out to the target voice
The second decoding probability when k-th of the word obtained after decoding is each word to be selected in vocabulary;
According to first probability distribution and second probability distribution, the translation result of k-th of word is obtained.
Optionally, described according to first probability distribution and second probability distribution, obtain the translation knot of k-th of word
Fruit, comprising:
In first probability distribution and second probability distribution, by corresponding first decoding probability of identical word to be selected
It is merged with the second decoding probability, obtains corresponding fusion when k-th of word is each word to be selected in the final cypher text
Decoding probability;
Select the corresponding word to be selected of maximum fusing and decoding probability, the translation result as k-th of word.
Optionally, described that first translation object and the second translation object are translated, obtain the target voice most
Whole cypher text, comprising:
The identification text of the target voice is translated, the first cypher text is obtained;
The target voice is directly translated, the second cypher text is obtained;
According to first cypher text and second cypher text, the final translation text of the target voice is obtained
This.
Optionally, described according to first cypher text and second cypher text, obtain the target voice
Final cypher text, comprising:
Determine confidence level when final cypher text of first cypher text as the target voice;
Determine confidence level when final cypher text of second cypher text as the target voice;
The corresponding cypher text of larger confidence level is selected, the final cypher text as the target voice.
Optionally, confidence when final cypher text of the determination first cypher text as the target voice
Degree, comprising:
Obtain the corresponding decoding probability of each text unit of first cypher text, the decoding probability characterization
Correspond to size a possibility that text unit belongs to translation result;
According to the corresponding decoding probability of each text unit of first cypher text, first translation is determined
Confidence level when final cypher text of the text as the target voice.
Optionally, confidence when final cypher text of the determination second cypher text as the target voice
Degree, comprising:
Obtain the corresponding decoding probability of each text unit of second cypher text, the decoding probability characterization
Correspond to size a possibility that text unit belongs to translation result;
According to the corresponding decoding probability of each text unit of second cypher text, second translation is determined
Confidence level when final cypher text of the text as the target voice.
It is optionally, described that first translation object and the second translation object are translated, comprising:
Using the speech recognition modeling constructed in advance, the target voice is identified, obtains identification text;
Using the text translation model constructed in advance, the identification text is translated;
Using the voiced translation model constructed in advance, the target voice is translated;
Wherein, department pattern parameter is shared or do not shared to the voiced translation model and the speech recognition modeling.
The embodiment of the present application also provides a kind of speech translation apparatus, comprising:
Target voice acquiring unit, for obtaining target voice to be translated;
Cypher text obtaining unit obtains the mesh for translating to the first translation object and the second translation object
The final cypher text of poster sound, the first translation object are the identification text of the target voice, second translation pair
As for the target voice.
Optionally, the cypher text obtaining unit includes:
Probability distribution generates subelement, for generating corresponding first probability point of k-th of word in the final cypher text
Cloth and the second probability distribution;
Wherein, first probability distribution includes the kth obtained after identification text to the target voice is decoded
The first decoding probability when a word is each word to be selected in vocabulary, second probability distribution include carrying out to the target voice
The second decoding probability when k-th of the word obtained after decoding is each word to be selected in vocabulary;
Translation result obtains subelement, for obtaining kth according to first probability distribution and second probability distribution
The translation result of a word.
Optionally, the translation result acquisition subelement includes:
Fusing and decoding probability obtains subelement, is used in first probability distribution and second probability distribution, will
Corresponding first decoding probability of identical word to be selected and the second decoding probability are merged, and kth in the final cypher text is obtained
A word corresponding fusing and decoding probability when being each word to be selected;
First translation result obtains subelement, for selecting the corresponding word to be selected of maximum fusing and decoding probability, as k-th
The translation result of word.
Optionally, the cypher text obtaining unit includes:
First cypher text obtains subelement, translates for the identification text to the target voice, obtains first
Cypher text;
Second cypher text obtains subelement, for directly being translated to the target voice, obtains the second translation text
This;
Final cypher text obtains subelement, for obtaining according to first cypher text and second cypher text
To the final cypher text of the target voice.
Optionally, the final cypher text acquisition subelement includes:
First confidence level determines subelement, for determining first cypher text finally turning over as the target voice
The confidence level of this when of translation;
Second confidence level determines subelement, for determining second cypher text finally turning over as the target voice
The confidence level of this when of translation;
Second translation result obtains subelement, for selecting the corresponding cypher text of larger confidence level, as the target
The final cypher text of voice.
Optionally, first confidence level determines that subelement includes:
First decoding probability obtains subelement, and each text unit for obtaining first cypher text respectively corresponds to
Decoding probability, the decoding probability characterizes size a possibility that corresponding text unit belongs to translation result;
First confidence level obtains subelement, corresponding for each text unit according to first cypher text
Decoding probability determines confidence level when final cypher text of first cypher text as the target voice.
Optionally, second confidence level determines that subelement includes:
Second decoding probability obtains subelement, and each text unit for obtaining second cypher text respectively corresponds to
Decoding probability, the decoding probability characterizes size a possibility that corresponding text unit belongs to translation result;
Second confidence level obtains subelement, corresponding for each text unit according to second cypher text
Decoding probability determines confidence level when final cypher text of second cypher text as the target voice.
Optionally, the cypher text obtaining unit includes:
Text identification subelement, for being identified to the target voice using the speech recognition modeling constructed in advance,
Obtain identification text;
Text translates subelement, for being translated to the identification text using the text translation model constructed in advance;
Voiced translation subelement, for being translated to the target voice using the voiced translation model constructed in advance;
Wherein, department pattern parameter is shared or do not shared to the voiced translation model and the speech recognition modeling.
The embodiment of the present application also provides a kind of speech translation apparatus, comprising: processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs
The processor is set to execute any one implementation in above-mentioned voice translation method when being executed by the processor.
The embodiment of the present application also provides a kind of computer readable storage medium, deposited in the computer readable storage medium
Instruction is contained, when described instruction is run on the terminal device, so that the terminal device executes in above-mentioned voice translation method
Any one implementation.
The embodiment of the present application also provides a kind of computer program product, the computer program product is on the terminal device
When operation, so that the terminal device executes any one implementation in above-mentioned voice translation method.
A kind of voice translation method and device provided by the embodiments of the present application, after getting target voice to be translated,
Both the identification text of target voice and target voice itself are translated collectively as translation object, to obtain target language
The final cypher text of sound, compared with the prior art in first target voice is identified, obtain identification text, then by the identification
Text is as the mode translated of translation object, translation object more horn of plenty in the application includes identification text and mesh
Two translation objects of poster sound, so, by way of being translated to the two translation objects, it can determine more accurately
The cypher text of target voice.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 2 is one of voiced translation model provided by the embodiments of the present application and structural schematic diagram of speech recognition modeling;
Fig. 3 is the second structural representation of voiced translation model and speech recognition modeling provided by the embodiments of the present application;
Fig. 4 is the knot of voiced translation model provided by the embodiments of the present application, speech recognition modeling and text translation model
Structure schematic diagram;
Fig. 5 obtains turning over for k-th of word according to the first probability distribution and the second probability distribution to be provided by the embodiments of the present application
Translate the flow diagram of result;
Fig. 6 obtains target voice most according to the first cypher text and the second cypher text to be provided by the embodiments of the present application
The flow diagram of whole cypher text;
Fig. 7 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall in the protection scope of this application.
First embodiment
It is a kind of flow diagram of voice translation method provided in this embodiment, this method includes following step referring to Fig. 1
It is rapid:
S101: target voice to be translated is obtained.
In the present embodiment, any voice for carrying out voiced translation using the present embodiment is defined as target voice.Also,
The present embodiment does not limit the languages type of target voice, for example, target voice can be Chinese speech or English voice etc.;Together
When, the present embodiment does not limit the length of target voice yet, for example, target voice can be a word or more words etc..
It is understood that target voice can be obtained by modes such as recording according to actual needs, for example, people day
Often telephone relation voice in life or session recording etc. can be used as target voice, can be with after getting target voice
The translation to the target voice is realized using the present embodiment.
S102: translating the first translation object and the second translation object, obtain the final cypher text of target voice,
Wherein, the first translation object is the identification text of target voice, and the second translation object is target voice.
It in the present embodiment, can be by the knowledge of target voice in order to determine the cypher text of more accurate target voice
Other text and target voice itself are translated as translation object, more acurrate so as to obtain since translation object is abundant
Cypher text, to the final cypher text as target voice.
Specifically, after target voice to be translated being got by step S101, voice can be carried out to target voice
Identification obtains corresponding identification text, and using the identification text as the first translation object, it is then possible to first translation
Object (i.e. identification text) carries out text translation, to obtain the translation text of intermediate data or the first translation object in translation process
This.Similarly, after target voice to be translated being got by step S101, the target voice itself can be used as to the second translation
Then object is translated (without speech recognition) second translation object (i.e. target voice), directly to be translated
The cypher text of intermediate data or the second translation object in the process.
In a kind of implementation of the present embodiment, this step S102 may include: to the first translation object and second
Translation object is translated, and after obtaining the intermediate data in corresponding translation process, is based on these intermediate data, is obtained mesh
The final cypher text of poster sound.
In this implementation, above-mentioned intermediate data can be probability distribution data, specifically, S102 through the above steps
First translation object (i.e. the identification text of target voice) and the second translation object (i.e. target voice) are translated, Ke Yisheng
At corresponding first probability distribution of k-th of word in final cypher text and the second probability distribution, wherein the first probability distribution includes
The first decoding when k-th of the word obtained after being decoded to the identification text of target voice is each word to be selected in vocabulary is general
Rate, the second probability distribution includes k-th of word obtaining after being decoded to target voice when being each word to be selected in vocabulary the
Two decoding probabilities.It should be noted that translating in this implementation to the first translation object and the second translation object, obtains
To final cypher text in the specific introduction of corresponding first probability distribution of k-th of word and the second probability distribution can be found in second
Embodiment, and corresponding first probability distribution of k-th of word and on the basis of the second probability distribution in final cypher text, obtain
The specific implementation of the translation result of k-th of word will also be introduced in a second embodiment in final cypher text.
In another implementation of the present embodiment, this step S102 may include: to the first translation object and the
Two translation objects are translated, and after obtaining corresponding cypher text, are based on the two cypher texts, are obtained target voice
Final cypher text.
In this implementation, through the above steps S102 to first translation object (i.e. the identification text of target voice) into
Row translation, available first cypher text, meanwhile, the second translation object (i.e. target voice) is directly translated, it can also
To obtain the second cypher text.It should be noted that in this implementation to first translation object and second translation object into
The specific introduction of the first cypher text and the second cypher text that row translation obtains can be found in 3rd embodiment, and in the first translation
On the basis of text and the second cypher text, the specific implementation for obtaining the final cypher text of target voice also will be in third
It is introduced in embodiment.
Further, this step S102 can be realized using three models, can specifically include following step A1-A3:
Step A1: using the speech recognition modeling constructed in advance, identifying target voice, obtains identification text.
In this implementation, after target voice to be translated is got by step S101, it can use on the right side of Fig. 2
Shown in the speech recognition modeling that constructs in advance, the target voice got is identified, identification text is obtained.Wherein, should
Speech recognition modeling includes encoder, attention layer (Attention) and identification decoder, can be with by the speech recognition modeling
Speech recognition is carried out to target voice, for example, the target voice of Chinese to be identified as to the Chinese identification text of same languages therewith.
Specifically, in the present embodiment, a kind of to be optionally achieved in that, the speech recognition modeling constructed in advance can be with
Using the network structure as shown on the right side of Fig. 3, next, will by taking the speech recognition modeling as an example, to using it to target voice
The realization process identified is introduced:
(1) audio frequency characteristics of target voice are inputted
Audio feature extraction is carried out to the target voice translated first, for example, target voice can be extracted
Meier spectrum signature (Mel Bank Features), as the audio frequency characteristics of target voice, the audio frequency characteristics can with feature to
The form of amount is indicated, and here, this feature vector is defined as x1...T, wherein the audio feature vector of T expression target voice
Dimension size, i.e. the number of the audio feature vector vector element that includes, it is then possible to by x1...TIt is defeated as input data
Speech recognition modeling shown in entering on the right side of Fig. 3.
(2) the corresponding coding vector of audio frequency characteristics of target voice is generated
As shown in figure 3, the coded portion of the speech recognition modeling includes two layers of convolutional neural networks (Convolutional
Neural Networks, abbreviation CNN) and maximum pond layer (MaxPooling), one layer of convolution shot and long term memory network
(convolutional Long Short-Term Memory, abbreviation convolutional LSTM), three layers of two-way shot and long term
Memory network (Bi-directional Long Short-Term Memory, abbreviation BiLSTM).
The audio frequency characteristics x of (1) input target voice through the above steps1...TAfterwards, it can be compiled by one layer of CNN
Then code, then carry out down-sampled operation to it by MaxPooling, then by one layer of CNN and MaxPooling repeats this behaviour
Make, obtains the coding vector that length is L and then recycle one layer of convolutional LSTM and three layers of BiLSTM to this
Coding vector is handled, and to obtain final coding vector, is defined as h1...L, wherein L indicates special to the audio of target voice
The number for the vector element that the dimension size for the coding vector that sign obtains after being encoded, i.e. coding vector include, h1...LTool
Body calculation formula is as follows:
h1...L=enc (Wencx1...T) (1)
Wherein, enc indicates the entire coding calculating process of model based coding part;WencIndicate each layer in model based coding part
The all-network parameter of network.
(3) the corresponding decoded vector of coding vector is generated
As shown in figure 3, the decoded portion of the speech recognition modeling includes 4 layers of unidirectional shot and long term memory network (Long
Short-Term Memory, abbreviation LSTM), softmax classifier.
(2) through the above steps encode the audio frequency characteristics of target voice using the coded portion of model and are compiled
After code vector, attention operation first can be carried out to the coding vector, it can be to generate to be concerned about in coding vector
Then the related data of decoded vector are again decoded it by 4 layers of LSTM and softmax classifier, corresponding to obtain
Decoded vector recycles the decoded vector to generate the identification text of target voice, and is defined as z1...N, wherein N can be with table
Show the number for the individual character (or word) for including in identification text.
The specific formula for calculation of decoded portion is as follows:
ck=att (sk,h1...L) (2)
sk=lstm (zk-1,sk-1,ck-1) (3)
zk=soft max (Wz[sk,ck]+bz) (4)
Wherein, h1...LIndicate the corresponding coding vector of audio frequency characteristics of target voice;ckIndicate k-th of attention meter
Calculate result;Att indicates attention calculating process;ck-1Indicate -1 attention calculated result of kth;skIndicate lsb decoder
K-th of the hidden layer vector exported in 4 layers of LSTM network that subpackage contains;Lstm indicates 4 layers of LSTM network that decoded portion includes
Calculating process;sk-1- 1 hidden layer vector of kth exported in 4 layers of LSTM network that expression decoded portion includes;zkIndicate identification text
K-th of the word (or word) for including in this;zk-1Indicate -1 word of kth (or word) for including in identification text;WzAnd bzIt indicates
Model parameter in softmax classifier.
If utilizing WasrThe all-network parameter of each layer network in representative model decoded portion, the then target language that model exports
The identification text z of sound1...NCalculation formula it is as follows:
z1...N=dec (Wasrh1...L) (5)
Wherein, dec indicates the entire solution Calculative Process of model decoded portion;WasrIndicate each layer in model decoded portion
The all-network parameter of network;h1...LIndicate the corresponding coding vector of audio frequency characteristics of target voice.
It should be noted that the network structure of encoder and decoder is not in speech recognition modeling shown on the right side of Fig. 2
It is unique, and network structure shown in the right side Fig. 3 is only one such example, can also take other network structures or net
Network layers number.For example, the encoder of model can also be using Recognition with Recurrent Neural Network (Recurrent Neural Network, abbreviation
) etc. RNN encoded, and the number of plies of network can also be set according to the actual situation, the embodiment of the present application to this without
Limitation.Wherein, the number of plies of CNN, BiLSTM introduced in above-mentioned or subsequent content etc. is only example, and the application does not limit its layer
Number, can be the number of plies referred in the embodiment of the present application, is also possible to other numbers of plies.
Step A2: using the text translation model constructed in advance, the identification text of target voice is translated.
In this implementation, after the identification text that target voice is obtained by step A1, it can use such as institute above Fig. 4
The text translation model constructed in advance shown translates identification text, obtains the intermediate data or the knowledge in translation process
The corresponding cypher text of other text.
Wherein, text translation model includes text decoder, attention layer (Attention) and text decoder, and
Text decoder is connected with the identification decoder in speech recognition modeling, as shown in Figure 4.
Next, the realization process translated using text translation model to identification text is introduced:
(1) the identification text of target voice is inputted
As shown in figure 4, the identification text z of target voice can will be obtained by step A1 first1...N(it can be vector shape
Formula), as input data, it is input in the text decoder of text translation model.
(2) the corresponding coding vector of identification text of target voice is generated
In the present embodiment, the text decoder of text translation model can be made of BiLSTM, through the above steps (1)
Input the identification text z of target voice1...NAfterwards, it can be encoded by BiLSTM, to obtain corresponding coding vector,
It is defined as s1...N, specific formula for calculation is as follows:
s1...N=enc (Uencz1...N) (6)
Wherein, enc indicates the entire coding calculating process of text translation model coded portion;UencIndicate that text translates mould
The all-network parameter of type coded portion.
(3) decoding obtains intermediate data or corresponding first cypher text of the identification text
In the present embodiment, the text decoder of text translation model may include unidirectional LSTM and softmax classifier.
(2) generate the coding vector s of target voice through the above steps1...NAfterwards, attention fortune first can be carried out to the coding vector
Calculate, so as to be concerned about in coding vector can to generate the related data of decoding result, then again by unidirectional LSTM and
Softmax classifier is decoded it, is turned over obtaining intermediate data in translation process or the identification text corresponding first
Translation sheet.
It should be noted that in text translation model the network constituted mode of encoder and decoder be not it is unique,
The prototype network structure introduced during above-mentioned realization is only one such example, can also take other network structures or net
Network layers number.For example, the encoder of model can also be encoded using RNN etc., and the number of plies of network can also be according to practical feelings
Condition is set, and the embodiment of the present application is not limited this.
Step A3: using the voiced translation model constructed in advance, target voice is translated.
In the present embodiment, which generates in translation process for directly being translated to target voice
Intermediate data or the corresponding cypher text of the target voice.The voiced translation model can be with the language introduced in above-mentioned steps A1
Sound identification model is shared or does not share department pattern parameter.
When the speech recognition modeling introduced in the voiced translation model and above-mentioned steps A1 shares department pattern parameter, one
Kind is optionally achieved in that the network structure of the voiced translation model can be as shown in Fig. 2 left hand view, with speech recognition mould
Type shares an encoder, and the voiced translation model then includes one and is translated and decoded device.It should be noted that language in Fig. 2
The identification decoder of sound identification model and the network structure for being translated and decoded device of voiced translation model can be identical, can also not
Together, respective concrete composition structure can be set according to the actual situation, and the embodiment of the present application is not limited this.
In the present embodiment, a kind of to be optionally achieved in that, voiced translation model and speech recognition modeling can use
Network structure as shown in Figure 3, and network structure based on this model, the detailed process directly translated to target voice is such as
Under:
(1) audio frequency characteristics of target voice are inputted
Audio feature extraction is carried out to target voice first, for example, the Meier spectrum signature of target voice can be extracted, is made
For the audio frequency characteristics of target voice, and this feature vector is defined as x1...T, then, by x1...TAs input data, it is input to
Encoder shown in Fig. 3.
(2) the corresponding coding vector of audio frequency characteristics of target voice is generated
Pass through the audio frequency characteristics x for the target voice that encoder shown in Fig. 3 inputs above-mentioned steps (1)1...TIt is encoded
Afterwards, available final coding vector h1...L, wherein what L expression obtained after encoding to the audio frequency characteristics of target voice
The number for the vector element that the dimension size of coding vector, i.e. coding vector include, h1...LSpecific formula for calculation be above-mentioned public affairs
Formula (1), i.e., as follows:
h1...L=enc (Wencx1...T)
Wherein, enc indicates the entire coding calculating process of encoder in Fig. 3;WencEach layer net in encoder in expression Fig. 3
The all-network parameter of network.
(3) decoding obtains intermediate data or corresponding second cypher text of the target voice
As shown in Figure 3, it is assumed that the identification decoder of voiced translation model and the net for being translated and decoded device of voiced translation model
Network structure is identical, includes 4 layers of LSTM, softmax classifier, but the training parameter of the two is not shared.
(2) through the above steps encode the audio frequency characteristics of target voice using the coded portion of model and are compiled
Code vector h1...LAfterwards, as shown in figure 3, attention operation first can be carried out to the coding vector respectively, then pass through translation again
4 layers of LSTM and softmax classifier in decoder are decoded attention operation result, obtain in translation process
Intermediate data or corresponding second cypher text of the target voice.
It should be noted that voiced translation model shown in Fig. 2 and speech recognition modeling share the mode of coder parameters
It is not uniquely, to be only a kind of example, other parameters shared model can also be taken.
In addition, voiced translation model and speech recognition modeling can not also sharing model parameters, at this point, voiced translation model
It is two individual models, in this case, the network of voiced translation model and speech recognition modeling with speech recognition modeling
Structure may be the same or different, and the respective concrete composition structure of the two can be set according to the actual situation, the application
Embodiment is not limited this.
It should be noted that the present embodiment do not limit A1-A2 (executing A2 after first carrying out A1) and A3 execute sequence, can be with
A3 is executed after first carrying out A1-A2 or is executed A1-A2 after first carrying out A3 or is performed simultaneously A1-A2 and A3.
Further, since " integration module " shown in Fig. 4 is translated and decoded device and text with speech recognition modeling respectively
The text decoder of translation model is connected, so, it can use " integration module " as shown in Figure 4, translated according to first
The intermediate data or respective cypher text in translation process that object and the second translation object are translated, determine target
The final cypher text of voice.
Specifically, in a kind of implementation of the present embodiment, the final translation that text decoder can be exported is literary
Corresponding first probability distribution of k-th of word and k-th of word in the final cypher text that device exports corresponding the is translated and decoded in this
Two probability distribution are separately input into " integration module ", are carried out by " integration module " to the first probability distribution and the second probability distribution
Fusion specifically refers to the to determine the translation result of k-th of word in final cypher text according to fused probability distribution
Two embodiments.
The first cypher text that can be exported text decoder in another implementation of the present embodiment and translation
Decoder output the second cypher text be separately input into " integration module ", by " integration module " by both cypher texts into
Row comparison, to determine the cypher text of more accurate target voice according to comparing result, specifically refers to 3rd embodiment.
To sum up, a kind of voice translation method provided in this embodiment, after getting target voice to be translated, by target
Both the identification text of voice and target voice itself are translated collectively as translation object, to obtain target voice most
Whole cypher text, compared with the prior art in first target voice is identified, obtain identification text, then by the identification text make
For the mode translated of translation object, translation object more horn of plenty in the application includes identification text and target voice
Two translation objects, so, by way of translating to the two translation objects, it can determine more accurate target language
The cypher text of sound.
Second embodiment
In the present embodiment, by the step S102 in above-mentioned first embodiment to the first translation object (i.e. target voice
Identification text) translated, corresponding first probability point of k-th of word in the final cypher text of target voice can be generated
Cloth, and first probability distribution can be defined as Ptext(yk), wherein ykIt refers in the final cypher text of target voice
K-th of word.
Wherein, the first probability distribution Ptext(yk) it may include being obtained after the identification text to target voice is decoded
K-th of word ykFor the first decoding probability in vocabulary when each word to be selected.The value of first decoding probability is bigger, shows that this is right
K-th of word y that the identification text of target voice obtains after being decodedkProbability for correspondence word to be selected is bigger.
Network structure as shown in connection with fig. 4, the output of text translation model are decoded the identification text of target voice
K-th of the word y obtained afterwardskCorresponding first probability distribution Ptext(yk) calculation formula it is as follows:
Ptext(yk)=soft max (dec (Udecs1...N)) (7)
Wherein, dec indicates the entire solution Calculative Process of text translation model decoded portion;UdecIndicate that text translates mould
The all-network parameter of type decoded portion;s1...NIndicate the corresponding coding vector of identification text of target voice;Ptext(yk) table
Show k-th of the word y obtained after the identification text to target voice is decodedkFor the first decoding in vocabulary when each word to be selected
Probability.
Similarly, in the present embodiment, by step S102 in above-mentioned first embodiment to the second translation object (i.e. target language
Sound) it is translated, it can be generated corresponding second probability distribution of k-th of word in final cypher text, and can be second general by this
Rate distribution is defined as Ptrans(yk), wherein ykRefer to k-th of word in the final cypher text of target voice.
Wherein, the second probability distribution Ptrans(yk) it may include k-th of the word y obtained after being decoded to target voicek
For the second decoding probability in vocabulary when each word to be selected.The value of second decoding probability is bigger, show this to target voice into
K-th of the word y obtained after row decodingkProbability for correspondence word to be selected is bigger.
Network structure as shown in connection with fig. 4, obtained kth after being decoded to target voice of voiced translation model output
A word ykCorresponding second probability distribution Ptrans(yk) calculation formula it is as follows:
Ptrans(yk)=soft max (dec (Wdec,h1...L)) (8)
Wherein, dec indicates the entire solution Calculative Process of voiced translation model decoded portion;WdecIndicate voiced translation mould
The all-network parameter of type decoded portion;h1...LIndicate the corresponding coding vector of audio frequency characteristics of target voice;Ptrans(yk) table
Show k-th of the word y obtained after being decoded to target voicekFor the second decoding probability in vocabulary when each word to be selected.
It, may further be according to corresponding first probability distribution of k-th of word in the final cypher text of generation based on this
Ptext(yk) and the second probability distribution Ptrans(yk), obtain the translation result of k-th of word.
Next, how the present embodiment is will be to " according to corresponding first probability of k-th of word in the final cypher text of generation
It is distributed Ptext(yk) and the second probability distribution Ptrans(yk), obtain the translation result of k-th of word " specific implementation process be situated between
It continues.
Referring to Fig. 5, obtained k-th it illustrates provided in this embodiment according to the first probability distribution and the second probability distribution
The flow diagram of the translation result of word, the process the following steps are included:
S501: in the first probability distribution and the second probability distribution, by corresponding first decoding probability of identical word to be selected and
Second decoding probability is merged, and it is general to obtain corresponding fusing and decoding when k-th of word is each word to be selected in final cypher text
Rate.
In the present embodiment, if being generated after being translated to the first translation object (i.e. the identification text of target voice)
K-th of word y in final cypher textkCorresponding first probability distribution Ptext(yk), that is, Ptext(yk) it include to target voice
K-th of word y that identification text obtains after being decodedkFor the first decoding probability in vocabulary when each word to be selected;And to second
After translation object (i.e. target voice) is translated, k-th of word y in final cypher text is generatedkCorresponding second probability point
Cloth Ptrans(yk), that is, Ptrans(yk) it include k-th of the word y obtained after being decoded to target voicekFor in vocabulary it is each to
Select the second decoding probability when word.
It is then further, it can use " integration module " as shown in Figure 4 for the first probability distribution Ptext(yk) and it is second general
Rate is distributed Ptrans(yk) carry out " decoding probability fusion ", that is, by Ptext(yk) and Ptrans(yk) in the corresponding decoding of identical word to be selected
Probability is merged, to obtain in final cypher text corresponding fusing and decoding probability when k-th of word is each selected ci poem word to be selected,
These fusing and decoding probability form fused probability distribution, are defined as Pensemble(yk)。
For example: assuming that each word to be selected is 10000 words for including in a certain English vocabulary, and this 10000
It is contained in word word " system ", and the first probability distribution Ptext(yk) in include the identification text to target voice into
K-th of the word y obtained after row decodingkThe first decoding probability value when for word " system " is 0.87, and the second probability distribution
Ptrans(yk) in include target voice is decoded after obtained k-th of word ykIt is general for the second decoding of word " system "
Rate value is 0.76, then can merge the two corresponding decoding probability values 0.87 and 0.76 of word " system ", with
To the corresponding fusing and decoding probability of word " system ", to characterize k-th of word y in final cypher textkFor word
A possibility that " system " size.
In the present embodiment, k-th of word y in final cypher textkThe specific calculating of corresponding fused probability distribution
Formula is as follows:
Pensemble(yk)=α Ptrans(yk)+(1-α)Ptext(yk) (9)
Wherein, α indicates the fusion weight of decoding probability, can be obtained by experiment or experience;Pensemble(yk) indicate final
K-th of word y in cypher textkFor the fusing and decoding probability in vocabulary when each word to be selected;Ptext(yk) indicate to target voice
K-th of word y that identification text obtains after being decodedkIt is general for the first decoding probability in vocabulary when each word to be selected, i.e., first
Rate distribution;Ptrans(yk) indicate k-th of the word y obtained after being decoded to target voicekFor in vocabulary when each word to be selected
Second decoding probability, i.e. the second probability distribution.
For example: it is based on the example above, it is assumed that k-th of the word obtained after being decoded to the identification text of target voice
ykThe first decoding probability when for word " system " is 0.87, and k-th of the word y obtained after being decoded to target voicekFor
The second decoding probability when word " system " is 0.76, meanwhile, the value for the α being determined by experiment is 0.6, then by above-mentioned
Formula (9) can calculate k-th of word y in final cypher textkFusing and decoding probability when for word " system " is
0.804, that is, 0.6*0.76+ (1-0.6) * 0.87=0.804.In this way, k-th of word y in final cypher text can be calculatedk
Fusing and decoding probability when for other words, and then form k-th of word ykCorresponding fusion probability distribution Pensemble(yk)。
S502: the corresponding word to be selected of maximum fusing and decoding probability, the cypher text as k-th of word are selected.
In the present embodiment, k-th of word y in final cypher text is obtained by step S501kRespectively each word to be selected
When corresponding fusing and decoding probability after, the corresponding word to be selected of maximum fusing and decoding probability can be therefrom selected, as kth
The translation result of a word.
For example: assuming that each word to be selected is " system " for including in a certain English vocabulary, " table ",
" box " ... waits 10000 words, then every in this 10000 words for k-th of word in final cypher text
A word corresponds to a fusing and decoding probability, and then can choose the corresponding word to be selected of maximum fusing and decoding probability, than
It such as can translation result by maximum 0.89 corresponding word " system " of fusing and decoding probability, as k-th of word.
To sum up, the present embodiment is in such a way that decoding probability merges, after the identification text to target voice is decoded
K-th obtained of word ykCorresponding first probability distribution Ptext(yk), with target voice is decoded after obtained k-th of word
ykCorresponding second probability distribution Ptrans(yk), it is decoded probability fusion, so as to obtain the kth in final cypher text
A word ykCorresponding more accurate fusing and decoding probability distribution, and then it is corresponding therefrom to select maximum fusing and decoding probability
Word to be selected, the translation result as k-th of word.In this mode, turning over for each word in final cypher text can successively be obtained
Translate result.
3rd embodiment
In the present embodiment, by the step S102 in above-mentioned first embodiment to the first translation object (i.e. target voice
Identification text) translated, available first cypher text.
Wherein, the first cypher text is the text that target translates languages, which can be defined asWherein, K1Indicate the number of individual character (or word) for including in the first cypher text.Such as, it is assumed that target voice is Chinese
Voice, and target translation languages are English, that is, it needs for target voice to be translated as English text, then the first cypher text is
English textWherein, K1What is indicated is the number for the word for including in the English text.
One kind being optionally achieved in that network structure as shown in connection with fig. 4 can use the solution that text decoder obtains
First cypher text of code vector generation target voiceWherein, K1The individual character for indicating to include in the first cypher text is (or single
Word) number.
If utilizing UdecThe all-network parameter of the text decoder of text translation model in Fig. 4 is represented, then model exports
First cypher textCalculation formula it is as follows:
Wherein, dec indicates the entire solution Calculative Process of the text decoder of text translation model in Fig. 4;UdecIndicate figure
The all-network parameter of the text decoder of text translation model in 4;s1...NIndicate the corresponding volume of identification text of target voice
Code vector.
In addition, in the present embodiment, by the step S102 in above-mentioned first embodiment to the second translation object (i.e. target
Voice) it is directly translated, available second cypher text.
Wherein, the second cypher text is the text that target translates languages, which can be defined asWherein, K2It can indicate the number of individual character (or word) for including in the second cypher text.Such as, it is assumed that target voice is
Chinese speech, and target translation languages are still English, that is, there is still a need for target voice is translated as English text, then the second translation
Text is English textWherein, K2What is indicated is the number for the word for including in the English text.
One kind being optionally achieved in that network structure as shown in connection with fig. 4 can use and be translated and decoded the solution that device obtains
Second cypher text of code vector generation target voiceWherein, K2Indicate to include in the second cypher text individual character (or
Word) number, the specific formula for calculation of decoded portion can be found in above-mentioned formula (2), (3), (4), and details are not described herein.
If utilizing WdecThe all-network parameter for being translated and decoded device of voiced translation model in Fig. 4 is represented, then model exports
Second cypher textCalculation formula it is as follows:
Wherein, dec indicates the entire solution Calculative Process for being translated and decoded device of voiced translation model in Fig. 4;WdecIndicate figure
The all-network parameter for being translated and decoded device of voiced translation model in 4;h1...LIndicate the corresponding volume of audio frequency characteristics of target voice
Code vector.
In addition, it should be noted that, the individual character for including in the first cypher text and the second cypher text in the present embodiment
The number of (or word) can be identical, be also possible to different, that is, may be K1=K2, it is also possible to K1≠K2, but the
One cypher text and the second cypher text belong to same languages, for example, being all Chinese text or English text etc..
It, may further be according to the first cypher text of generation based on thisWith the second cypher textIt obtains
The final cypher text of target voice.
Next, how the present embodiment is will be to " according to the first cypher text of generationWith the second cypher textObtain the final cypher text of target voice " specific implementation process be introduced.
Referring to Fig. 6, target is obtained according to the first cypher text and the second cypher text it illustrates provided in this embodiment
The flow diagram of the final cypher text of voice, the process the following steps are included:
S601: confidence level when final cypher text of first cypher text as target voice is determined.
In the present embodiment, if being obtained after being translated to the first translation object (i.e. the identification text of target voice)
First cypher text then may further carry out data processing to the related data of first cypher text, to determine first
Confidence level when final cypher text of the cypher text as target voice, and the confidence level is defined as socretext。
In a kind of implementation of the present embodiment, S601 can specifically include following step B1-B2:
Step B1: the corresponding decoding probability of each text unit of the first cypher text is obtained.
In this implementation, in order to determine setting when final cypher text of first cypher text as target voice
Reliability socretext, it is necessary first to determine each text unit that the first cypher text includes, wherein text unit refers to
The basic composition unit for constituting the first cypher text, it is different with the difference of the affiliated languages of the first cypher text, for example, if
One cypher text be Chinese text, then it includes text unit can be word and word;If the first cypher text is English text,
Then it includes text unit can be for word, etc..
Then, the available each text unit for including to the first cypher text corresponding decoding in its affiliated languages
Probability, wherein decoding probability refers to a possibility that corresponding text unit belongs to translation result size, and specifically, the decoding is general
Rate can be the identification text of target voice is decoded in second embodiment after obtained corresponding first probability of k-th of word
A decoding probability in distribution.It is understood that decoding probability is bigger, show its corresponding text unit as k-th
A possibility that translation result of word, is bigger, conversely, decoding probability is smaller, shows its corresponding text unit as k-th word
A possibility that translation result, is smaller.
Step B2: according to the corresponding decoding probability of each text unit of the first cypher text, the first translation is determined
Confidence level when final cypher text of the text as target voice.
It, can be right after the corresponding decoding probability of each text unit for getting the first cypher text by step B1
Each decoding probability is further processed, to determine the first cypher text as the final of target voice according to processing result
Confidence level socre when cypher texttext。
Specifically, a kind of to be optionally achieved in that, can first by each text unit of the first cypher text respectively
Corresponding decoding probability summation recycles the total number K of text unit acquire and that value includes divided by the first cypher text1,
The average decoding probability value of each text unit is obtained, to indicate final translation text of first cypher text as target voice
The confidence level socre of this whentext。
For example: assuming that the first cypher text is the English text comprising 6 words, and the 1st word is to the 6th list
The corresponding decoding probability of word is respectively 0.82,0.78,0.91,0.85,0.81,0.93, then carries out this 6 decoding probabilities
After summation, available and value is 5.1, that is, 0.82+0.78+0.91+0.85+0.81+0.93=5.1, recycling is somebody's turn to do and value
The total number 6 of 5.1 words for including divided by the first cypher text, the average decoding probability value for obtaining each word is 0.85, that is,
5.1/6=0.85 then can use the average decoding probability value 0.85, indicate above-mentioned first cypher text as target voice
Confidence level socre when final cypher texttext, that is, socretext=0.85.
S602: confidence level when final cypher text of second cypher text as target voice is determined.
In the present embodiment, it if being carried out after directly translating to the second translation object (i.e. target voice), has obtained second and has turned over
Translation sheet then may further carry out data processing to the related data of second cypher text, to determine the second translation text
Confidence level when this final cypher text as target voice, and the confidence level is defined as socretrans。
In a kind of implementation of the present embodiment, S602 can specifically include following step C1-C2:
Step C1: the corresponding decoding probability of each text unit of the second cypher text is obtained.
In this implementation, in order to determine setting when final cypher text of second cypher text as target voice
Reliability socretrans, it is necessary first to determine each text unit that the second cypher text includes, wherein text unit refers to
The basic composition unit for constituting the second cypher text, it is different with the difference of the affiliated languages of the second cypher text, for example, if
Two cypher texts be Chinese text, then it includes text unit can be word or word;If the second cypher text is English text,
Then it includes text unit can be for word, etc..
Then, the available each text unit for including to the second cypher text corresponding decoding in its affiliated languages
Probability, wherein decoding probability refers to a possibility that corresponding text unit belongs to cypher text size, and specifically, the decoding is general
Rate can be target voice is decoded in second embodiment after in obtained corresponding second probability distribution of k-th of word one
A decoding probability.It is understood that decoding probability is bigger, show translation knot of its corresponding text unit as k-th of word
A possibility that fruit, is bigger, conversely, decoding probability is smaller, shows its corresponding text unit as the translation result of k-th of word
Possibility is smaller.
Step C2: according to the corresponding decoding probability of each text unit of the second cypher text, the second translation is determined
Confidence level when final cypher text of the text as target voice.
It, can be right after the corresponding decoding probability of each text unit for getting the second cypher text by step C1
Each decoding probability is further processed, to determine the second cypher text as the final of target voice according to processing result
Confidence level socre when cypher texttrans。
Specifically, a kind of to be optionally achieved in that, can first by each text unit of the second cypher text respectively
Corresponding decoding probability summation recycles the total number K of text unit acquire and that value includes divided by the second cypher text2,
The average decoding probability value of each text unit is obtained, to indicate final translation text of second cypher text as target voice
The confidence level socre of this whentrans。
For example: assuming that the second cypher text is the English text comprising 8 words, and the 1st word is to the 8th list
The corresponding decoding probability of word is respectively 0.76,0.78,0.92,0.72,0.89,0.91,0.75,0.83, then solves this 8
After code probability is summed, available and value is 6.56, that is, 0.76+0.78+0.92+0.72+0.89+0.91+0.75+
0.83=6.56 recycles the word total number 8 for including divided by the second cypher text with value 6.56, obtains the flat of each word
Equal decoding probability value is 0.82, that is, 6.56/8=0.82 then can use the average decoding probability value 0.82, indicate above-mentioned the
Confidence level socre when final cypher text of two cypher texts as target voicetrans, that is, socretrans=0.82.
S603: selecting the corresponding cypher text of larger confidence level, the final cypher text as target voice.
In the present embodiment, final cypher text of first cypher text as target voice is determined by step S601
When confidence level socretext, and by step S602 determine the second cypher text as target voice final translation text
The confidence level socre of this whentransIt afterwards, can be from socretextAnd socretransIn select biggish value corresponding translation text
This, the final cypher text as target voice.
Specifically, if socretextValue be greater than socretransValue, then show each text in the first cypher text
It is bigger that unit belongs to a possibility that translation result, then can choose socretextCorresponding first cypher text, as target language
The final cypher text of sound;Conversely, if socretransValue be greater than socretextValue, then show in the second cypher text each
It is bigger that text unit belongs to a possibility that translation result, then can choose socretransCorresponding second cypher text, as mesh
The final cypher text of poster sound.
To sum up, the present embodiment is by comparing the first cypher text and the second cypher text respectively as the final of target voice
The size of confidence level when cypher text, so as to select each text unit and belong to translation result according to comparison result
A possibility that bigger cypher text, as the final cypher text of target voice, and then can determine more accurate target
The cypher text of voice improves the accuracy of voiced translation result.
Fourth embodiment
A kind of speech translation apparatus will be introduced in the present embodiment, and related content refers to above method embodiment.
It is a kind of composition schematic diagram of speech translation apparatus provided in this embodiment referring to Fig. 7, which includes:
Target voice acquiring unit 701, for obtaining target voice to be translated;
Cypher text obtaining unit 702 obtains described for translating to the first translation object and the second translation object
The final cypher text of target voice, the first translation object are the identification text of the target voice, second translation
Object is the target voice.
In a kind of implementation of the present embodiment, the cypher text obtaining unit 702 includes:
Probability distribution generates subelement, for generating corresponding first probability point of k-th of word in the final cypher text
Cloth and the second probability distribution;
Wherein, first probability distribution includes the kth obtained after identification text to the target voice is decoded
The first decoding probability when a word is each word to be selected in vocabulary, second probability distribution include carrying out to the target voice
The second decoding probability when k-th of the word obtained after decoding is each word to be selected in vocabulary;
Translation result obtains subelement, for obtaining kth according to first probability distribution and second probability distribution
The translation result of a word.
In a kind of implementation of the present embodiment, the translation result obtains subelement and includes:
Fusing and decoding probability obtains subelement, is used in first probability distribution and second probability distribution, will
Corresponding first decoding probability of identical word to be selected and the second decoding probability are merged, and kth in the final cypher text is obtained
A word corresponding fusing and decoding probability when being each word to be selected;
First translation result obtains subelement, for selecting the corresponding word to be selected of maximum fusing and decoding probability, as k-th
The translation result of word.
In a kind of implementation of the present embodiment, the cypher text obtaining unit 702 includes:
First cypher text obtains subelement, translates for the identification text to the target voice, obtains first
Cypher text;
Second cypher text obtains subelement, for directly being translated to the target voice, obtains the second translation text
This;
Final cypher text obtains subelement, for obtaining according to first cypher text and second cypher text
To the final cypher text of the target voice.
In a kind of implementation of the present embodiment, the final cypher text obtains subelement and includes:
First confidence level determines subelement, for determining first cypher text finally turning over as the target voice
The confidence level of this when of translation;
Second confidence level determines subelement, for determining second cypher text finally turning over as the target voice
The confidence level of this when of translation;
Second translation result obtains subelement, for selecting the corresponding cypher text of larger confidence level, as the target
The final cypher text of voice.
In a kind of implementation of the present embodiment, first confidence level determines that subelement includes:
First decoding probability obtains subelement, and each text unit for obtaining first cypher text respectively corresponds to
Decoding probability, the decoding probability characterizes size a possibility that corresponding text unit belongs to translation result;
First confidence level obtains subelement, corresponding for each text unit according to first cypher text
Decoding probability determines confidence level when final cypher text of first cypher text as the target voice.
In a kind of implementation of the present embodiment, second confidence level determines that subelement includes:
Second decoding probability obtains subelement, and each text unit for obtaining second cypher text respectively corresponds to
Decoding probability, the decoding probability characterizes size a possibility that corresponding text unit belongs to translation result;
Second confidence level obtains subelement, corresponding for each text unit according to second cypher text
Decoding probability determines confidence level when final cypher text of second cypher text as the target voice.
In a kind of implementation of the present embodiment, the cypher text obtaining unit 702 includes:
Text identification subelement, for being identified to the target voice using the speech recognition modeling constructed in advance,
Obtain identification text;
Text translates subelement, for being translated to the identification text using the text translation model constructed in advance;
Voiced translation subelement, for being translated to the target voice using the voiced translation model constructed in advance;
Wherein, department pattern parameter is shared or do not shared to the voiced translation model and the speech recognition modeling.
Further, the embodiment of the present application also provides a kind of speech translation apparatus, comprising: processor, memory, system
Bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs
The processor is set to execute any implementation method of above-mentioned voice translation method when being executed by the processor.
Further, described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium
Instruction is stored in storage media, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice
Any implementation method of interpretation method.
Further, the embodiment of the present application also provides a kind of computer program product, the computer program product exists
When being run on terminal device, so that the terminal device executes any implementation method of above-mentioned voice translation method.
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation
All or part of the steps in example method can be realized by means of software and necessary general hardware platform.Based on such
Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application
It embodies, which can store in storage medium, such as ROM/RAM, magnetic disk, CD, including several
Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway
Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.
It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said
Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality
For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place
Referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one
Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation
There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (13)
1. a kind of voice translation method characterized by comprising
Obtain target voice to be translated;
First translation object and the second translation object are translated, the final cypher text of the target voice is obtained, it is described
First translation object is the identification text of the target voice, and the second translation object is the target voice.
2. the method according to claim 1, wherein described carry out the first translation object and the second translation object
Translation, obtains the final cypher text of the target voice, comprising:
Generate corresponding first probability distribution of k-th of word and the second probability distribution in the final cypher text;
Wherein, first probability distribution includes k-th of the word obtained after identification text to the target voice is decoded
For the first decoding probability in vocabulary when each word to be selected, second probability distribution includes being decoded to the target voice
The second decoding probability when k-th of the word obtained afterwards is each word to be selected in vocabulary;
According to first probability distribution and second probability distribution, the translation result of k-th of word is obtained.
3. according to the method described in claim 2, it is characterized in that, described according to first probability distribution and described second general
Rate distribution, obtains the translation result of k-th of word, comprising:
In first probability distribution and second probability distribution, by corresponding first decoding probability of identical word to be selected and
Two decoding probabilities are merged, and corresponding fusing and decoding when k-th of word is each word to be selected in the final cypher text is obtained
Probability;
Select the corresponding word to be selected of maximum fusing and decoding probability, the translation result as k-th of word.
4. the method according to claim 1, wherein described carry out the first translation object and the second translation object
Translation, obtains the final cypher text of the target voice, comprising:
The identification text of the target voice is translated, the first cypher text is obtained;
The target voice is directly translated, the second cypher text is obtained;
According to first cypher text and second cypher text, the final cypher text of the target voice is obtained.
5. according to the method described in claim 4, it is characterized in that, described turn over according to first cypher text and described second
Translation sheet obtains the final cypher text of the target voice, comprising:
Determine confidence level when final cypher text of first cypher text as the target voice;
Determine confidence level when final cypher text of second cypher text as the target voice;
The corresponding cypher text of larger confidence level is selected, the final cypher text as the target voice.
6. according to the method described in claim 5, it is characterized in that, the determination first cypher text is as the target
Confidence level when the final cypher text of voice, comprising:
The corresponding decoding probability of each text unit of first cypher text is obtained, the decoding probability characterization corresponds to
Text unit belongs to a possibility that translation result size;
According to the corresponding decoding probability of each text unit of first cypher text, first cypher text is determined
Confidence level when final cypher text as the target voice.
7. according to the method described in claim 5, it is characterized in that, the determination second cypher text is as the target
Confidence level when the final cypher text of voice, comprising:
The corresponding decoding probability of each text unit of second cypher text is obtained, the decoding probability characterization corresponds to
Text unit belongs to a possibility that translation result size;
According to the corresponding decoding probability of each text unit of second cypher text, second cypher text is determined
Confidence level when final cypher text as the target voice.
8. method according to any one of claims 1 to 7, which is characterized in that described to be turned over to the first translation object and second
Object is translated to be translated, comprising:
Using the speech recognition modeling constructed in advance, the target voice is identified, obtains identification text;
Using the text translation model constructed in advance, the identification text is translated;
Using the voiced translation model constructed in advance, the target voice is translated;
Wherein, department pattern parameter is shared or do not shared to the voiced translation model and the speech recognition modeling.
9. a kind of speech translation apparatus characterized by comprising
Target voice acquiring unit, for obtaining target voice to be translated;
Cypher text obtaining unit obtains the target language for translating to the first translation object and the second translation object
The final cypher text of sound, the first translation object are the identification text of the target voice, and the second translation object is
The target voice.
10. device according to claim 9, which is characterized in that the cypher text obtaining unit includes:
Probability distribution generates subelement, for generate in the final cypher text corresponding first probability distribution of k-th of word and
Second probability distribution;
Wherein, first probability distribution includes k-th of the word obtained after identification text to the target voice is decoded
For the first decoding probability in vocabulary when each word to be selected, second probability distribution includes being decoded to the target voice
The second decoding probability when k-th of the word obtained afterwards is each word to be selected in vocabulary;
Translation result obtains subelement, for obtaining k-th of word according to first probability distribution and second probability distribution
Translation result.
11. a kind of speech translation apparatus characterized by comprising processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt
The processor makes the processor perform claim require 1-8 described in any item methods when executing.
12. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium,
When described instruction is run on the terminal device, so that the terminal device perform claim requires the described in any item methods of 1-8.
13. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make
It obtains the terminal device perform claim and requires the described in any item methods of 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910199082.XA CN109979461B (en) | 2019-03-15 | 2019-03-15 | Voice translation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910199082.XA CN109979461B (en) | 2019-03-15 | 2019-03-15 | Voice translation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109979461A true CN109979461A (en) | 2019-07-05 |
CN109979461B CN109979461B (en) | 2022-02-25 |
Family
ID=67079130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910199082.XA Active CN109979461B (en) | 2019-03-15 | 2019-03-15 | Voice translation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109979461B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112668346A (en) * | 2020-12-24 | 2021-04-16 | 科大讯飞股份有限公司 | Translation method, device, equipment and storage medium |
CN112818704A (en) * | 2021-01-19 | 2021-05-18 | 传神语联网网络科技股份有限公司 | Multilingual translation system and method based on inter-thread consensus feedback |
WO2023082916A1 (en) * | 2021-11-10 | 2023-05-19 | 北京有竹居网络技术有限公司 | Training method, speech translation method, device and computer-readable medium |
CN112668346B (en) * | 2020-12-24 | 2024-04-30 | 中国科学技术大学 | Translation method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050055217A1 (en) * | 2003-09-09 | 2005-03-10 | Advanced Telecommunications Research Institute International | System that translates by improving a plurality of candidate translations and selecting best translation |
US20090210214A1 (en) * | 2008-02-19 | 2009-08-20 | Jiang Qian | Universal Language Input |
JP2011090100A (en) * | 2009-10-21 | 2011-05-06 | National Institute Of Information & Communication Technology | Speech translation system, controller, speech recognition device, translation device, and speech synthesizer |
CN103678285A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Machine translation method and machine translation system |
CN107170453A (en) * | 2017-05-18 | 2017-09-15 | 百度在线网络技术(北京)有限公司 | Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence |
CN108986793A (en) * | 2018-09-28 | 2018-12-11 | 北京百度网讯科技有限公司 | translation processing method, device and equipment |
-
2019
- 2019-03-15 CN CN201910199082.XA patent/CN109979461B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050055217A1 (en) * | 2003-09-09 | 2005-03-10 | Advanced Telecommunications Research Institute International | System that translates by improving a plurality of candidate translations and selecting best translation |
US20090210214A1 (en) * | 2008-02-19 | 2009-08-20 | Jiang Qian | Universal Language Input |
JP2011090100A (en) * | 2009-10-21 | 2011-05-06 | National Institute Of Information & Communication Technology | Speech translation system, controller, speech recognition device, translation device, and speech synthesizer |
CN103678285A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Machine translation method and machine translation system |
CN107170453A (en) * | 2017-05-18 | 2017-09-15 | 百度在线网络技术(北京)有限公司 | Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence |
CN108986793A (en) * | 2018-09-28 | 2018-12-11 | 北京百度网讯科技有限公司 | translation processing method, device and equipment |
Non-Patent Citations (1)
Title |
---|
CHUANDONG XIE ET AL.: "《Web Data Selection Based on Word Embedding for Low-Resource Speech Recognition》", 《INTERSPEECH 2016》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112668346A (en) * | 2020-12-24 | 2021-04-16 | 科大讯飞股份有限公司 | Translation method, device, equipment and storage medium |
WO2022134164A1 (en) * | 2020-12-24 | 2022-06-30 | 科大讯飞股份有限公司 | Translation method, apparatus and device, and storage medium |
CN112668346B (en) * | 2020-12-24 | 2024-04-30 | 中国科学技术大学 | Translation method, device, equipment and storage medium |
CN112818704A (en) * | 2021-01-19 | 2021-05-18 | 传神语联网网络科技股份有限公司 | Multilingual translation system and method based on inter-thread consensus feedback |
CN112818704B (en) * | 2021-01-19 | 2024-04-02 | 传神语联网网络科技股份有限公司 | Multilingual translation system and method based on inter-thread consensus feedback |
WO2023082916A1 (en) * | 2021-11-10 | 2023-05-19 | 北京有竹居网络技术有限公司 | Training method, speech translation method, device and computer-readable medium |
Also Published As
Publication number | Publication date |
---|---|
CN109979461B (en) | 2022-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108062388B (en) | Reply generation method and device for man-machine conversation | |
CN108038107B (en) | Sentence emotion classification method, device and equipment based on convolutional neural network | |
CN109785824A (en) | A kind of training method and device of voiced translation model | |
CN111274362B (en) | Dialogue generation method based on transformer architecture | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN110516244B (en) | Automatic sentence filling method based on BERT | |
CN112214604A (en) | Training method of text classification model, text classification method, device and equipment | |
CN112000772B (en) | Sentence-to-semantic matching method based on semantic feature cube and oriented to intelligent question and answer | |
CN113051399B (en) | Small sample fine-grained entity classification method based on relational graph convolutional network | |
CN111460833A (en) | Text generation method, device and equipment | |
CN112069328B (en) | Method for establishing entity relation joint extraction model based on multi-label classification | |
CN109979461A (en) | A kind of voice translation method and device | |
CN115438215B (en) | Image-text bidirectional search and matching model training method, device, equipment and medium | |
CN113128206B (en) | Question generation method based on word importance weighting | |
CN112463989A (en) | Knowledge graph-based information acquisition method and system | |
CN108363685B (en) | Self-media data text representation method based on recursive variation self-coding model | |
CN112837669A (en) | Voice synthesis method and device and server | |
CN114281982B (en) | Book propaganda abstract generation method and system adopting multi-mode fusion technology | |
CN114528398A (en) | Emotion prediction method and system based on interactive double-graph convolutional network | |
CN114692605A (en) | Keyword generation method and device fusing syntactic structure information | |
CN110197521B (en) | Visual text embedding method based on semantic structure representation | |
CN112307179A (en) | Text matching method, device, equipment and storage medium | |
CN111797225A (en) | Text abstract generation method and device | |
CN114611529B (en) | Intention recognition method and device, electronic equipment and storage medium | |
CN115526149A (en) | Text summarization method for fusing double attention and generating confrontation network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |