CN105206257A - Voice conversion method and device - Google Patents

Voice conversion method and device Download PDF

Info

Publication number
CN105206257A
CN105206257A CN201510673278.XA CN201510673278A CN105206257A CN 105206257 A CN105206257 A CN 105206257A CN 201510673278 A CN201510673278 A CN 201510673278A CN 105206257 A CN105206257 A CN 105206257A
Authority
CN
China
Prior art keywords
model
target speaker
speech data
phonetic synthesis
converted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510673278.XA
Other languages
Chinese (zh)
Other versions
CN105206257B (en
Inventor
陈凌辉
江源
李栋梁
李啸
张卫庆
胡国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201510673278.XA priority Critical patent/CN105206257B/en
Publication of CN105206257A publication Critical patent/CN105206257A/en
Application granted granted Critical
Publication of CN105206257B publication Critical patent/CN105206257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a voice conversion method and device. The method includes the steps that voice data to be converted are received; voice recognition is performed on the voice data to be converted, and a recognition result and duration information of the recognition result are acquired; a voice synthesis model of a target voice making person is acquired; by means of the voice synthesis model and the duration information, voice synthesis parameters are generated; voice synthesis is performed on the recognition result through the voice synthesis parameters, and tone synthesized voice data of the target voice making person are obtained. By means of the voice conversion method and device, duration of converted voice data can be consistent with duration of the voice data to be converted, and the naturalness of synthesize voice is improved.

Description

A kind of sound converting method and device
Technical field
The present invention relates to voice process technology field, be specifically related to a kind of sound converting method and device.
Background technology
In daily life exchanges, the sound of a people is exactly often his identity business card, after hearing the sound oneself being familiar with people, just can recognize this people.Speech technology is owing to can be converted to the sound of another speaker by the sound of a speaker, people is sounded like be the pronunciation of another person, have a wide range of applications, as the sound of oneself can be converted to the sound of the star oneself liked by user, or convert the sound that user oneself is familiar with people to.
Existing sound converting method is generally that speech data to be converted is carried out speech recognition, after obtaining identifying text, utilizes target speaker synthetic model to carry out phonetic synthesis to described identification text, thus obtains the synthetic speech data of target speaker tone color.When this method carries out phonetic synthesis to identification text, easily occur the situation that the duration of speech data and the speech data to be converted synthesized is inconsistent, thus make synthetic speech sound more mechanical, rhythmical image is poor, greatly reduces the naturalness of synthetic speech.
Summary of the invention
The invention provides a kind of sound converting method and device, to make the duration of the speech data after conversion consistent with the duration of speech data to be converted, improve the naturalness of synthetic speech.
For this reason, the invention provides following technical scheme:
A kind of sound converting method, comprising:
Receive speech data to be converted;
Speech recognition is carried out to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Obtain the phonetic synthesis model of target speaker;
Described phonetic synthesis model and described duration information is utilized to generate phonetic synthesis parameter;
Utilize described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.
Preferably, describedly carry out speech recognition to described speech data to be converted, the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
Preferably, describedly carry out speech recognition to described speech data to be converted, the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
Preferably, the phonetic synthesis model of described acquisition target speaker comprises:
Represent optional target speaker information to user, and determine target speaker according to the selection of user, then obtain the phonetic synthesis model of described target speaker; Or
Receive the target speaker speech data that user provides, and utilize described target speaker speech data to train the phonetic synthesis model obtaining target speaker.
Preferably, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described described phonetic synthesis model and the described duration information generation phonetic synthesis parameter of utilizing comprises:
Described duration information and described duration synthetic model is utilized to generate the duration synthetic parameters of each state of each syntactic units;
Target speaker fundamental frequency synthetic model is utilized to generate fundamental frequency synthetic parameters;
Utilize target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
A kind of sound conversion device, comprising:
Receiver module, for receiving speech data to be converted;
Sound identification module, for carrying out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Model acquisition module, for obtaining the phonetic synthesis model of target speaker;
Synthetic parameters generation module, generates phonetic synthesis parameter for utilizing described phonetic synthesis model and described duration information;
Voice synthetic module, for utilizing described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
Preferably, described sound identification module comprises:
First decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
First decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
Preferably, described sound identification module comprises:
Second decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
Second decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
Preferably, described model acquisition module comprises:
Represent unit, for representing optional target speaker information to user;
Target speaker determining unit, determines target speaker for the selection according to user;
Model acquiring unit, for obtaining the phonetic synthesis model of described target speaker;
Or described target speaker determination module comprises:
Receiving element, for receiving the target speaker speech data that user provides;
Model training unit, trains for utilizing described target speaker speech data the phonetic synthesis model obtaining target speaker.
Preferably, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described synthetic parameters generation module comprises:
Duration synthetic parameters generation unit, for the duration synthetic parameters utilizing described duration information and described duration synthetic model to generate each state of each syntactic units;
Fundamental frequency synthetic parameters generation unit, generates fundamental frequency synthetic parameters for utilizing target speaker fundamental frequency synthetic model;
Spectrum synthesizing parameter generating unit, for utilizing target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
The sound converting method that the embodiment of the present invention provides and device, first speech data to be converted is received, then treat converting speech data and carry out speech recognition, obtain recognition result and duration information thereof, the phonetic synthesis model of target speaker and described duration information is finally utilized to generate phonetic synthesis parameter, utilize this phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.The method and device treat converting speech data when carrying out speech recognition, not only obtain recognition result, but also the duration information of this recognition result will be obtained, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improve the naturalness of the rear voice of conversion.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of process flow diagram of embodiment of the present invention sound converting method;
Fig. 2 is a kind of embody rule process flow diagram of embodiment of the present invention sound converting method;
Fig. 3 is a kind of structural representation of embodiment of the present invention sound conversion device.
Embodiment
In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
The situation that the duration of speech data and the speech data to be converted synthesized is inconsistent is easily there is when carrying out sound conversion for prior art, make the sound rhythmical image after conversion poor, the problem that naturalness is low, the embodiment of the present invention provides a kind of sound converting method and device, when treating converting speech data and carrying out speech recognition, obtain the duration information that recognition result is corresponding, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, thus the synthetic speech data of the target speaker tone color finally obtained and the duration of speech data to be converted are consistent, improve the naturalness of the rear voice of conversion.
As shown in Figure 1, be a kind of process flow diagram of embodiment of the present invention sound converting method, comprise the following steps:
Step 101, receives speech data to be converted.
Step 102, carries out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result.
The detailed process of speech recognition is same as the prior art, namely utilizes the acoustic model of training in advance and language model to build decoding network; Extract the characteristic parameter of speech data, such as, linear forecasting parameter (LPCC) and/or Mel frequency cepstral coefficient (MFCC) parameter, then based on described decoding network and described characteristic parameter, described speech data is decoded, obtain the identification text that optimum decoding paths is corresponding, the text sequence be namely made up of word and/word.Unlike, in embodiments of the present invention, not only to obtain recognition result, also will obtain the duration information corresponding with this recognition result, that is, the duration information of each word and/or word in described text sequence.Described duration information can obtain according to the duration information of described word, voice segments that word is corresponding, is not described in detail in this.
Step 103, obtains the phonetic synthesis model of target speaker.
Described target speaker speech synthetic model mainly comprises the duration synthetic model of target speaker, fundamental frequency synthetic model and Spectrum synthesizing model.
In actual applications, the acquisition of the phonetic synthesis model of target speaker can have various ways.
Such as, represent optional target speaker information to user, target speaker is determined in the selection according to user, then can obtain the phonetic synthesis model of described target speaker from model bank.Described speaker information can be target speaker numbering, target speaker title etc., does not limit this embodiment of the present invention.Certainly, while providing target speaker, the simple description to each target speaker pronunciation characteristic can also be provided, as speaker: Xiao Ming, pronunciation characteristic: simple and honest strong, word speed is slower.The phonetic synthesis model of described target speaker can obtain by collecting the training of a large amount of target speaker speech data in advance.Certainly, the determination of target speaker can also have alternate manner, such as provides target speaker etc. at random by system, will not enumerate at this.
For another example, the target speaker speech data that user also can be utilized to provide obtain the phonetic synthesis model of target speaker, particularly, receive the target speaker speech data that user provides, then utilize described target speaker speech data to train the phonetic synthesis model obtaining target speaker; Or carry out model adaptation according to the target speaker speech data that user provides to obtain, concrete training process or adaptive process same as the prior art, be not described in detail in this.
Step 104, utilizes described phonetic synthesis model and described duration information to generate phonetic synthesis parameter.
Described phonetic synthesis parameter comprises duration parameters, base frequency parameters, frequency spectrum parameter, and various parameter generation method is specific as follows:
For identification text, use phonetic synthesis text analyzer to be resolved to corresponding syntactic units sequence, when described syntactic units is phonetic synthesis, the minimum syntactic units of use, as phoneme; Each syntactic units comprises multiple state, and as 5, single Gaussian distribution is obeyed in the duration distributional assumption of each state:
P ( d n i | p n , i ) = N ( d ; μ n i , σ n i 2 ) - - - ( 1 )
Wherein, p nbe the n-th syntactic units, be the duration of the n-th syntactic units i-th state, with be duration synthetic model average and the variance of the n-th syntactic units i-th state.
In order to ensure that synthetic speech data are consistent with speech data duration to be converted, the embodiment of the present invention retrains the duration parameters generated, namely within the scope of voice duration to be converted, duration synthetic parameters is generated, as retrained the duration of each word or word, concrete constrained procedure is such as formula shown in (2):
Σ n ∈ C j Σ i = 1 i = S d n i = D j - - - ( 2 )
Wherein, C jfor the syntactic units set that a jth word or word comprise, D jfor the duration of a jth word or word, S is the status number of each syntactic units.
Maximum-likelihood criterion is used to estimate to obtain the duration parameters set of each state of each syntactic units shown in (3):
{ d n i * } = argmax { d n i } Π i = 1 S P ( d n i | p n , i ) - - - ( 3 )
Wherein, it is the duration parameters that the n-th syntactic units i-th state estimation obtains.
Formula (1) and formula (2) are substituted into formula (3) and solve to obtain the duration parameters of each state of each syntactic units, shown in (4).
d n i * = μ n i + σ n i 2 ( d p n - Σ n ∈ C j Σ i = 1 i = S μ i ) Σ n ∈ C j Σ i = 1 i = S σ n i 2 - - - ( 4 )
The generation of frequency spectrum, base frequency parameters is consistent with classic method.
Step 105, utilizes described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
The sound converting method that the embodiment of the present invention provides, first speech data to be converted is received, then treat converting speech data and carry out speech recognition, obtain recognition result and duration information thereof, the phonetic synthesis model of target speaker and described duration information is finally utilized to generate phonetic synthesis parameter, utilize this phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.The method treats converting speech data when carrying out speech recognition, not only obtain recognition result, but also the duration information of this recognition result will be obtained, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improve the naturalness of the rear voice of conversion.
Consider and directly phonetic synthesis is carried out to identification text, easily the mistake that speech recognition process occurs is brought in phonetic synthesis, as polyphone problem, the semanteme causing the semanteme of the rear speech data of synthesis to compare speech data to be converted there occurs change, if speech data to be converted is " doing an American credit card ", identify that text is for " do to open and do not have credit card ", there is identification error, after utilizing target speaker speech synthetic model to synthesize identification text, the synthetic speech obtained is " do to open and do not have credit card ", the semanteme of synthetic speech there occurs change, this is the result not wishing to occur.Therefore, in actual applications, using the syntactic units sequence that obtains according to acoustic model as described recognition result, the duration information of each syntactic units in described syntactic units sequence can also be obtained simultaneously.Like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.
Below in conjunction with flow process shown in Fig. 2, tut conversion method is described in further details.
As shown in Figure 2, be a kind of embody rule process flow diagram of embodiment of the present invention sound converting method, comprise the following steps:
Step 201, receives speech data to be converted.
Step 202, utilizes the acoustic model of training in advance and language model to build decoding network.
Step 203, extracts the characteristic parameter of described speech data to be converted.
Described characteristic parameter can be LPCC and/or MFCC.
Step 204, decodes to described speech data based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
The minimum syntactic units that described syntactic units uses when referring to speech recognition, as phoneme.
Step 205, obtains the phonetic synthesis model of target speaker.
Step 206, utilizes described phonetic synthesis model and described duration information to generate phonetic synthesis parameter.
Described phonetic synthesis parameter comprises duration parameters, base frequency parameters, frequency spectrum parameter, and various parameter generation method is specific as follows:
1) syntactic units sequence duration information and target speaker duration synthetic model is utilized to generate duration synthetic parameters
Each syntactic units sequence adopts multiple state representation, as 5 states; The duration modeling hypothesis of each state obeys single Gaussian distribution, shown in (5):
P ( d n i | n p , i ) = N ( d ; μ n i , σ n i 2 ) - - - ( 5 )
Wherein, p nbe the n-th syntactic units, be the duration of the n-th syntactic units i-th state, with be duration synthetic model average and the variance of the n-th syntactic units i-th state.
In order to ensure that synthetic speech data are consistent with speech data duration to be converted, the embodiment of the present invention retrains the duration parameters generated, and namely within the scope of voice duration to be converted, generate duration synthetic parameters, concrete constrained procedure is such as formula shown in (6):
Σ i = 1 i = S d n i = d p n - - - ( 6 )
Wherein, d pnfor the n-th syntactic units duration in voice to be converted, S is the state sum of syntactic units.
The duration constraint corresponding according to speech data syntactic units to be converted, and target speaker duration synthetic model, adopt maximum-likelihood criterion to estimate to obtain the duration synthetic parameters of each state of each syntactic units shown in (7):
{ d n i * } = argmax { d n i } Π i = 1 S P ( d n i | p n , i ) - - - ( 7 )
Wherein, it is the duration parameters that the n-th syntactic units i-th state estimation obtains.
Formula (5) and formula (6) are substituted into formula (7) calculate, the duration of each state of syntactic units can be obtained, specifically such as formula shown in (8):
d n i * = μ n i + σ n i 2 ( d p n - Σ i = 1 i = S μ i ) Σ i = 1 i = S σ n i 2 - - - ( 8 )
2) target speaker fundamental frequency synthetic model is utilized to generate fundamental frequency synthetic parameters
The generative process of fundamental frequency synthetic parameters is as follows:
First, to identifying that the syntactic units sequence obtained is expanded, be extended to context-sensitive syntactic units sequence, if syntactic units sequence is " xx-y-u-y-in-h-e-ch-eng-xx ", context-sensitive syntactic units sequence described syntactic units sequence extension is become to be: " xx-y+u:/A, y-u+y:/A, u-y+in:/A, y-in+h:/A, in-h+e:/A, h-e+ch:/A, e-ch+eng:/A, ch-eng+xx:/A ", wherein between "-" and "+", syntactic units is current grammar unit, the context-related information that " :/A " is current grammar unit, as tone information, the method for expressing of certain described context-sensitive syntactic units sequence is not limited to above-mentioned method for expressing,
Then, utilize fundamental frequency synthetic model to predict the fundamental frequency model obtaining each state of current grammar unit, concrete Forecasting Methodology is same as the prior art, is not described in detail in this;
Subsequently, state duration information according to syntactic units sequence copies each syntactic units corresponding state, predict the fundamental frequency model of each state obtained according to each syntactic units, obtain the fundamental frequency distribution of the syntactic units sequence after copying, i.e. the fundamental frequency model that obtains of syntactic units sequence prediction;
Finally, fundamental frequency synthetic parameters is generated, shown in (9) according to the fundamental frequency distribution of syntactic units sequence:
c=(W ΤUW) -1W ΤUW(9)
Wherein, W is the window function matrix of calculation syntax unit sequence dynamic parameter, and c is fundamental frequency synthetic parameters to be generated, M and U is respectively the average and covariance matrix of predicting all state fundamental frequency models of syntactic units sequence obtained.
3) target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter is utilized
The generative process of Spectrum synthesizing parameter and the generative process of above-mentioned fundamental frequency synthetic parameters similar, do not repeat them here.
Step 207, utilizes described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
The specific implementation process of phonetic synthesis is same as the prior art, does not repeat them here.
The sound converting method that the embodiment of the present invention provides, not only effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improves the naturalness of the rear voice of conversion; But also further using the syntactic units sequence that obtains according to acoustic model as described recognition result, like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.
Correspondingly, the embodiment of the present invention also provides a kind of sound conversion device, as shown in Figure 3, is a kind of structural representation of embodiment of the present invention sound conversion device.
In this embodiment, described device comprises:
Receiver module 301, for receiving speech data to be converted;
Sound identification module 302, for carrying out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Model acquisition module 303, for obtaining the phonetic synthesis model of target speaker;
Synthetic parameters generation module 304, generates phonetic synthesis parameter for utilizing described phonetic synthesis model and described duration information;
Voice synthetic module 305, for utilizing described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
In actual applications, sound identification module 302 can carry out speech recognition to speech data to be identified, obtains the duration information of each word and/or word in text sequence corresponding to speech data to be identified and described text sequence.Correspondingly, a kind of concrete structure of sound identification module 302 comprises following unit:
First decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of speech data to be converted;
First decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
Consider and directly carry out phonetic synthesis to identification text, be easily brought in phonetic synthesis by the mistake that speech recognition process occurs, as polyphone problem, the semanteme causing the semanteme of the rear speech data of synthesis to compare speech data to be converted there occurs change.Therefore, in actual applications, sound identification module 302 using the syntactic units sequence that obtains according to acoustic model as described recognition result, can also obtain the duration information of each syntactic units in described syntactic units sequence simultaneously.Like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.Correspondingly, the another kind of concrete structure of sound identification module 302 comprises following unit:
Second decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of speech data to be converted;
Second decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
In addition, above-mentioned model acquisition module 303 also can have multiple implementation.
Such as, a kind of concrete structure of model acquisition module 303 can comprise: represent unit, target speaker determining unit and model acquiring unit.Wherein, unit is represented described in for representing optional target speaker information to user; Described target speaker determining unit is used for determining target speaker according to the selection of user; Described model acquiring unit is for obtaining the phonetic synthesis model of described target speaker.
For another example, the another kind of concrete structure of model acquisition module 303 can comprise: receiving element and model training unit.Wherein, the target speaker speech data that provide for receiving user of described receiving element; Described model training unit trains for utilizing described target speaker speech data the phonetic synthesis model obtaining target speaker.
Described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model.
Correspondingly, described synthetic parameters generation module 304 comprises:
Duration synthetic parameters generation unit, for the duration synthetic parameters utilizing described duration information and described duration synthetic model to generate each state of each syntactic units;
Fundamental frequency synthetic parameters generation unit, generates fundamental frequency synthetic parameters for utilizing target speaker fundamental frequency synthetic model;
Spectrum synthesizing parameter generating unit, for utilizing target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
The sound conversion device that the embodiment of the present invention provides, first speech data to be converted is received, then treat converting speech data and carry out speech recognition, obtain recognition result and duration information thereof, the phonetic synthesis model of target speaker and described duration information is finally utilized to generate phonetic synthesis parameter, utilize this phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.The method and system treat converting speech data when carrying out speech recognition, not only obtain recognition result, but also the duration information of this recognition result will be obtained, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improve the naturalness of the rear voice of conversion.Further, can using the syntactic units sequence that obtains according to acoustic model as described recognition result, like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.Device embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method of the present invention and device for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. a sound converting method, is characterized in that, comprising:
Receive speech data to be converted;
Speech recognition is carried out to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Obtain the phonetic synthesis model of target speaker;
Described phonetic synthesis model and described duration information is utilized to generate phonetic synthesis parameter;
Utilize described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.
2. method according to claim 1, is characterized in that, describedly carries out speech recognition to described speech data to be converted, and the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
3. method according to claim 1, is characterized in that, describedly carries out speech recognition to described speech data to be converted, and the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
4. method according to claim 1, is characterized in that, the phonetic synthesis model of described acquisition target speaker comprises:
Represent optional target speaker information to user, and determine target speaker according to the selection of user, then obtain the phonetic synthesis model of described target speaker; Or
Receive the target speaker speech data that user provides, and utilize described target speaker speech data to train the phonetic synthesis model obtaining target speaker.
5. the method according to any one of Claims 1-4, is characterized in that, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described described phonetic synthesis model and the described duration information generation phonetic synthesis parameter of utilizing comprises:
Described duration information and described duration synthetic model is utilized to generate the duration synthetic parameters of each state of each syntactic units;
Target speaker fundamental frequency synthetic model is utilized to generate fundamental frequency synthetic parameters;
Utilize target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
6. a sound conversion device, is characterized in that, comprising:
Receiver module, for receiving speech data to be converted;
Sound identification module, for carrying out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Model acquisition module, for obtaining the phonetic synthesis model of target speaker;
Synthetic parameters generation module, generates phonetic synthesis parameter for utilizing described phonetic synthesis model and described duration information;
Voice synthetic module, for utilizing described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
7. device according to claim 6, is characterized in that, described sound identification module comprises:
First decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
First decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
8. device according to claim 6, is characterized in that, described sound identification module comprises:
Second decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
Second decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
9. device according to claim 6, is characterized in that,
Described model acquisition module comprises:
Represent unit, for representing optional target speaker information to user;
Target speaker determining unit, determines target speaker for the selection according to user;
Model acquiring unit, for obtaining the phonetic synthesis model of described target speaker;
Or described target speaker determination module comprises:
Receiving element, for receiving the target speaker speech data that user provides;
Model training unit, trains for utilizing described target speaker speech data the phonetic synthesis model obtaining target speaker.
10. the device according to any one of claim 6 to 9, is characterized in that, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described synthetic parameters generation module comprises:
Duration synthetic parameters generation unit, for the duration synthetic parameters utilizing described duration information and described duration synthetic model to generate each state of each syntactic units;
Fundamental frequency synthetic parameters generation unit, generates fundamental frequency synthetic parameters for utilizing target speaker fundamental frequency synthetic model;
Spectrum synthesizing parameter generating unit, for utilizing target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
CN201510673278.XA 2015-10-14 2015-10-14 A kind of sound converting method and device Active CN105206257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510673278.XA CN105206257B (en) 2015-10-14 2015-10-14 A kind of sound converting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510673278.XA CN105206257B (en) 2015-10-14 2015-10-14 A kind of sound converting method and device

Publications (2)

Publication Number Publication Date
CN105206257A true CN105206257A (en) 2015-12-30
CN105206257B CN105206257B (en) 2019-01-18

Family

ID=54953887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510673278.XA Active CN105206257B (en) 2015-10-14 2015-10-14 A kind of sound converting method and device

Country Status (1)

Country Link
CN (1) CN105206257B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302134A (en) * 2016-09-29 2017-01-04 努比亚技术有限公司 A kind of message playing device and method
CN106920547A (en) * 2017-02-21 2017-07-04 腾讯科技(上海)有限公司 Phonetics transfer method and device
CN107705802A (en) * 2017-09-11 2018-02-16 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN107818794A (en) * 2017-10-25 2018-03-20 北京奇虎科技有限公司 audio conversion method and device based on rhythm
CN107833572A (en) * 2017-11-06 2018-03-23 芋头科技(杭州)有限公司 The phoneme synthesizing method and system that a kind of analog subscriber is spoken
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN110164413A (en) * 2019-05-13 2019-08-23 北京百度网讯科技有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110600045A (en) * 2019-08-14 2019-12-20 科大讯飞股份有限公司 Sound conversion method and related product
CN112786018A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Speech conversion and related model training method, electronic equipment and storage device
CN113160794A (en) * 2021-04-30 2021-07-23 京东数字科技控股股份有限公司 Voice synthesis method and device based on timbre clone and related equipment
WO2022141126A1 (en) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Personalized speech conversion training method, computer device, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4692941A (en) * 1984-04-10 1987-09-08 First Byte Real-time text-to-speech conversion system
CN1534595A (en) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 Speech sound change over synthesis device and its method
US20070168189A1 (en) * 2006-01-19 2007-07-19 Kabushiki Kaisha Toshiba Apparatus and method of processing speech
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103035235A (en) * 2011-09-30 2013-04-10 西门子公司 Method and device for transforming voice into melody
CN103065619A (en) * 2012-12-26 2013-04-24 安徽科大讯飞信息科技股份有限公司 Speech synthesis method and speech synthesis system
CN103295574A (en) * 2012-03-02 2013-09-11 盛乐信息技术(上海)有限公司 Singing voice conversion device and method thereof

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4692941A (en) * 1984-04-10 1987-09-08 First Byte Real-time text-to-speech conversion system
CN1534595A (en) * 2003-03-28 2004-10-06 中颖电子(上海)有限公司 Speech sound change over synthesis device and its method
US20070168189A1 (en) * 2006-01-19 2007-07-19 Kabushiki Kaisha Toshiba Apparatus and method of processing speech
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN103035235A (en) * 2011-09-30 2013-04-10 西门子公司 Method and device for transforming voice into melody
CN103295574A (en) * 2012-03-02 2013-09-11 盛乐信息技术(上海)有限公司 Singing voice conversion device and method thereof
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103065619A (en) * 2012-12-26 2013-04-24 安徽科大讯飞信息科技股份有限公司 Speech synthesis method and speech synthesis system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李波等: "《语音转换及相关技术综述》", 《通信学报》 *
郭威彤的话: "《普通话到西安话的韵律转换》", 《计算机工程与应用》 *
陈凌辉等: "《基于话者无关模型的说话人转换方法》", 《模式识别与人工智能》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302134A (en) * 2016-09-29 2017-01-04 努比亚技术有限公司 A kind of message playing device and method
CN106920547A (en) * 2017-02-21 2017-07-04 腾讯科技(上海)有限公司 Phonetics transfer method and device
CN107705802A (en) * 2017-09-11 2018-02-16 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN107818794A (en) * 2017-10-25 2018-03-20 北京奇虎科技有限公司 audio conversion method and device based on rhythm
CN107833572A (en) * 2017-11-06 2018-03-23 芋头科技(杭州)有限公司 The phoneme synthesizing method and system that a kind of analog subscriber is spoken
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN110164413A (en) * 2019-05-13 2019-08-23 北京百度网讯科技有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110164413B (en) * 2019-05-13 2021-06-04 北京百度网讯科技有限公司 Speech synthesis method, apparatus, computer device and storage medium
CN110600045A (en) * 2019-08-14 2019-12-20 科大讯飞股份有限公司 Sound conversion method and related product
WO2022141126A1 (en) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Personalized speech conversion training method, computer device, and storage medium
CN112786018A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Speech conversion and related model training method, electronic equipment and storage device
CN113160794A (en) * 2021-04-30 2021-07-23 京东数字科技控股股份有限公司 Voice synthesis method and device based on timbre clone and related equipment
CN113160794B (en) * 2021-04-30 2022-12-27 京东科技控股股份有限公司 Voice synthesis method and device based on timbre clone and related equipment

Also Published As

Publication number Publication date
CN105206257B (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN105206257A (en) Voice conversion method and device
CN107195296B (en) Voice recognition method, device, terminal and system
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN105869624A (en) Method and apparatus for constructing speech decoding network in digital speech recognition
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
CN105593936B (en) System and method for text-to-speech performance evaluation
US20220013106A1 (en) Multi-speaker neural text-to-speech synthesis
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN115516552A (en) Speech recognition using synthesis of unexplained text and speech
KR102311922B1 (en) Apparatus and method for controlling outputting target information to voice using characteristic of user voice
US10255903B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN106057192A (en) Real-time voice conversion method and apparatus
CN104217713A (en) Tibetan-Chinese speech synthesis method and device
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN109767778A (en) A kind of phonetics transfer method merging Bi-LSTM and WaveNet
CN108922521A (en) A kind of voice keyword retrieval method, apparatus, equipment and storage medium
Yin et al. Modeling F0 trajectories in hierarchically structured deep neural networks
CN106653002A (en) Literal live broadcasting method and platform
KR102272554B1 (en) Method and system of text to multiple speech
Nidhyananthan et al. Language and text-independent speaker identification system using GMM
CN109300339A (en) A kind of exercising method and system of Oral English Practice
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
CN115101046A (en) Method and device for synthesizing voice of specific speaker
AU2015411306A1 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
Sultana et al. A survey on Bengali speech-to-text recognition techniques

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant