CN105206257A - Voice conversion method and device - Google Patents
Voice conversion method and device Download PDFInfo
- Publication number
- CN105206257A CN105206257A CN201510673278.XA CN201510673278A CN105206257A CN 105206257 A CN105206257 A CN 105206257A CN 201510673278 A CN201510673278 A CN 201510673278A CN 105206257 A CN105206257 A CN 105206257A
- Authority
- CN
- China
- Prior art keywords
- model
- target speaker
- speech data
- phonetic synthesis
- converted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention discloses a voice conversion method and device. The method includes the steps that voice data to be converted are received; voice recognition is performed on the voice data to be converted, and a recognition result and duration information of the recognition result are acquired; a voice synthesis model of a target voice making person is acquired; by means of the voice synthesis model and the duration information, voice synthesis parameters are generated; voice synthesis is performed on the recognition result through the voice synthesis parameters, and tone synthesized voice data of the target voice making person are obtained. By means of the voice conversion method and device, duration of converted voice data can be consistent with duration of the voice data to be converted, and the naturalness of synthesize voice is improved.
Description
Technical field
The present invention relates to voice process technology field, be specifically related to a kind of sound converting method and device.
Background technology
In daily life exchanges, the sound of a people is exactly often his identity business card, after hearing the sound oneself being familiar with people, just can recognize this people.Speech technology is owing to can be converted to the sound of another speaker by the sound of a speaker, people is sounded like be the pronunciation of another person, have a wide range of applications, as the sound of oneself can be converted to the sound of the star oneself liked by user, or convert the sound that user oneself is familiar with people to.
Existing sound converting method is generally that speech data to be converted is carried out speech recognition, after obtaining identifying text, utilizes target speaker synthetic model to carry out phonetic synthesis to described identification text, thus obtains the synthetic speech data of target speaker tone color.When this method carries out phonetic synthesis to identification text, easily occur the situation that the duration of speech data and the speech data to be converted synthesized is inconsistent, thus make synthetic speech sound more mechanical, rhythmical image is poor, greatly reduces the naturalness of synthetic speech.
Summary of the invention
The invention provides a kind of sound converting method and device, to make the duration of the speech data after conversion consistent with the duration of speech data to be converted, improve the naturalness of synthetic speech.
For this reason, the invention provides following technical scheme:
A kind of sound converting method, comprising:
Receive speech data to be converted;
Speech recognition is carried out to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Obtain the phonetic synthesis model of target speaker;
Described phonetic synthesis model and described duration information is utilized to generate phonetic synthesis parameter;
Utilize described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.
Preferably, describedly carry out speech recognition to described speech data to be converted, the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
Preferably, describedly carry out speech recognition to described speech data to be converted, the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
Preferably, the phonetic synthesis model of described acquisition target speaker comprises:
Represent optional target speaker information to user, and determine target speaker according to the selection of user, then obtain the phonetic synthesis model of described target speaker; Or
Receive the target speaker speech data that user provides, and utilize described target speaker speech data to train the phonetic synthesis model obtaining target speaker.
Preferably, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described described phonetic synthesis model and the described duration information generation phonetic synthesis parameter of utilizing comprises:
Described duration information and described duration synthetic model is utilized to generate the duration synthetic parameters of each state of each syntactic units;
Target speaker fundamental frequency synthetic model is utilized to generate fundamental frequency synthetic parameters;
Utilize target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
A kind of sound conversion device, comprising:
Receiver module, for receiving speech data to be converted;
Sound identification module, for carrying out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Model acquisition module, for obtaining the phonetic synthesis model of target speaker;
Synthetic parameters generation module, generates phonetic synthesis parameter for utilizing described phonetic synthesis model and described duration information;
Voice synthetic module, for utilizing described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
Preferably, described sound identification module comprises:
First decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
First decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
Preferably, described sound identification module comprises:
Second decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
Second decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
Preferably, described model acquisition module comprises:
Represent unit, for representing optional target speaker information to user;
Target speaker determining unit, determines target speaker for the selection according to user;
Model acquiring unit, for obtaining the phonetic synthesis model of described target speaker;
Or described target speaker determination module comprises:
Receiving element, for receiving the target speaker speech data that user provides;
Model training unit, trains for utilizing described target speaker speech data the phonetic synthesis model obtaining target speaker.
Preferably, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described synthetic parameters generation module comprises:
Duration synthetic parameters generation unit, for the duration synthetic parameters utilizing described duration information and described duration synthetic model to generate each state of each syntactic units;
Fundamental frequency synthetic parameters generation unit, generates fundamental frequency synthetic parameters for utilizing target speaker fundamental frequency synthetic model;
Spectrum synthesizing parameter generating unit, for utilizing target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
The sound converting method that the embodiment of the present invention provides and device, first speech data to be converted is received, then treat converting speech data and carry out speech recognition, obtain recognition result and duration information thereof, the phonetic synthesis model of target speaker and described duration information is finally utilized to generate phonetic synthesis parameter, utilize this phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.The method and device treat converting speech data when carrying out speech recognition, not only obtain recognition result, but also the duration information of this recognition result will be obtained, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improve the naturalness of the rear voice of conversion.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of process flow diagram of embodiment of the present invention sound converting method;
Fig. 2 is a kind of embody rule process flow diagram of embodiment of the present invention sound converting method;
Fig. 3 is a kind of structural representation of embodiment of the present invention sound conversion device.
Embodiment
In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
The situation that the duration of speech data and the speech data to be converted synthesized is inconsistent is easily there is when carrying out sound conversion for prior art, make the sound rhythmical image after conversion poor, the problem that naturalness is low, the embodiment of the present invention provides a kind of sound converting method and device, when treating converting speech data and carrying out speech recognition, obtain the duration information that recognition result is corresponding, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, thus the synthetic speech data of the target speaker tone color finally obtained and the duration of speech data to be converted are consistent, improve the naturalness of the rear voice of conversion.
As shown in Figure 1, be a kind of process flow diagram of embodiment of the present invention sound converting method, comprise the following steps:
Step 101, receives speech data to be converted.
Step 102, carries out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result.
The detailed process of speech recognition is same as the prior art, namely utilizes the acoustic model of training in advance and language model to build decoding network; Extract the characteristic parameter of speech data, such as, linear forecasting parameter (LPCC) and/or Mel frequency cepstral coefficient (MFCC) parameter, then based on described decoding network and described characteristic parameter, described speech data is decoded, obtain the identification text that optimum decoding paths is corresponding, the text sequence be namely made up of word and/word.Unlike, in embodiments of the present invention, not only to obtain recognition result, also will obtain the duration information corresponding with this recognition result, that is, the duration information of each word and/or word in described text sequence.Described duration information can obtain according to the duration information of described word, voice segments that word is corresponding, is not described in detail in this.
Step 103, obtains the phonetic synthesis model of target speaker.
Described target speaker speech synthetic model mainly comprises the duration synthetic model of target speaker, fundamental frequency synthetic model and Spectrum synthesizing model.
In actual applications, the acquisition of the phonetic synthesis model of target speaker can have various ways.
Such as, represent optional target speaker information to user, target speaker is determined in the selection according to user, then can obtain the phonetic synthesis model of described target speaker from model bank.Described speaker information can be target speaker numbering, target speaker title etc., does not limit this embodiment of the present invention.Certainly, while providing target speaker, the simple description to each target speaker pronunciation characteristic can also be provided, as speaker: Xiao Ming, pronunciation characteristic: simple and honest strong, word speed is slower.The phonetic synthesis model of described target speaker can obtain by collecting the training of a large amount of target speaker speech data in advance.Certainly, the determination of target speaker can also have alternate manner, such as provides target speaker etc. at random by system, will not enumerate at this.
For another example, the target speaker speech data that user also can be utilized to provide obtain the phonetic synthesis model of target speaker, particularly, receive the target speaker speech data that user provides, then utilize described target speaker speech data to train the phonetic synthesis model obtaining target speaker; Or carry out model adaptation according to the target speaker speech data that user provides to obtain, concrete training process or adaptive process same as the prior art, be not described in detail in this.
Step 104, utilizes described phonetic synthesis model and described duration information to generate phonetic synthesis parameter.
Described phonetic synthesis parameter comprises duration parameters, base frequency parameters, frequency spectrum parameter, and various parameter generation method is specific as follows:
For identification text, use phonetic synthesis text analyzer to be resolved to corresponding syntactic units sequence, when described syntactic units is phonetic synthesis, the minimum syntactic units of use, as phoneme; Each syntactic units comprises multiple state, and as 5, single Gaussian distribution is obeyed in the duration distributional assumption of each state:
Wherein, p
nbe the n-th syntactic units,
be the duration of the n-th syntactic units i-th state,
with
be duration synthetic model average and the variance of the n-th syntactic units i-th state.
In order to ensure that synthetic speech data are consistent with speech data duration to be converted, the embodiment of the present invention retrains the duration parameters generated, namely within the scope of voice duration to be converted, duration synthetic parameters is generated, as retrained the duration of each word or word, concrete constrained procedure is such as formula shown in (2):
Wherein, C
jfor the syntactic units set that a jth word or word comprise, D
jfor the duration of a jth word or word, S is the status number of each syntactic units.
Maximum-likelihood criterion is used to estimate to obtain the duration parameters set of each state of each syntactic units
shown in (3):
Wherein,
it is the duration parameters that the n-th syntactic units i-th state estimation obtains.
Formula (1) and formula (2) are substituted into formula (3) and solve to obtain the duration parameters of each state of each syntactic units, shown in (4).
The generation of frequency spectrum, base frequency parameters is consistent with classic method.
Step 105, utilizes described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
The sound converting method that the embodiment of the present invention provides, first speech data to be converted is received, then treat converting speech data and carry out speech recognition, obtain recognition result and duration information thereof, the phonetic synthesis model of target speaker and described duration information is finally utilized to generate phonetic synthesis parameter, utilize this phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.The method treats converting speech data when carrying out speech recognition, not only obtain recognition result, but also the duration information of this recognition result will be obtained, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improve the naturalness of the rear voice of conversion.
Consider and directly phonetic synthesis is carried out to identification text, easily the mistake that speech recognition process occurs is brought in phonetic synthesis, as polyphone problem, the semanteme causing the semanteme of the rear speech data of synthesis to compare speech data to be converted there occurs change, if speech data to be converted is " doing an American credit card ", identify that text is for " do to open and do not have credit card ", there is identification error, after utilizing target speaker speech synthetic model to synthesize identification text, the synthetic speech obtained is " do to open and do not have credit card ", the semanteme of synthetic speech there occurs change, this is the result not wishing to occur.Therefore, in actual applications, using the syntactic units sequence that obtains according to acoustic model as described recognition result, the duration information of each syntactic units in described syntactic units sequence can also be obtained simultaneously.Like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.
Below in conjunction with flow process shown in Fig. 2, tut conversion method is described in further details.
As shown in Figure 2, be a kind of embody rule process flow diagram of embodiment of the present invention sound converting method, comprise the following steps:
Step 201, receives speech data to be converted.
Step 202, utilizes the acoustic model of training in advance and language model to build decoding network.
Step 203, extracts the characteristic parameter of described speech data to be converted.
Described characteristic parameter can be LPCC and/or MFCC.
Step 204, decodes to described speech data based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
The minimum syntactic units that described syntactic units uses when referring to speech recognition, as phoneme.
Step 205, obtains the phonetic synthesis model of target speaker.
Step 206, utilizes described phonetic synthesis model and described duration information to generate phonetic synthesis parameter.
Described phonetic synthesis parameter comprises duration parameters, base frequency parameters, frequency spectrum parameter, and various parameter generation method is specific as follows:
1) syntactic units sequence duration information and target speaker duration synthetic model is utilized to generate duration synthetic parameters
Each syntactic units sequence adopts multiple state representation, as 5 states; The duration modeling hypothesis of each state obeys single Gaussian distribution, shown in (5):
Wherein, p
nbe the n-th syntactic units,
be the duration of the n-th syntactic units i-th state,
with
be duration synthetic model average and the variance of the n-th syntactic units i-th state.
In order to ensure that synthetic speech data are consistent with speech data duration to be converted, the embodiment of the present invention retrains the duration parameters generated, and namely within the scope of voice duration to be converted, generate duration synthetic parameters, concrete constrained procedure is such as formula shown in (6):
Wherein, d
pnfor the n-th syntactic units duration in voice to be converted, S is the state sum of syntactic units.
The duration constraint corresponding according to speech data syntactic units to be converted, and target speaker duration synthetic model, adopt maximum-likelihood criterion to estimate to obtain the duration synthetic parameters of each state of each syntactic units
shown in (7):
Wherein,
it is the duration parameters that the n-th syntactic units i-th state estimation obtains.
Formula (5) and formula (6) are substituted into formula (7) calculate, the duration of each state of syntactic units can be obtained, specifically such as formula shown in (8):
2) target speaker fundamental frequency synthetic model is utilized to generate fundamental frequency synthetic parameters
The generative process of fundamental frequency synthetic parameters is as follows:
First, to identifying that the syntactic units sequence obtained is expanded, be extended to context-sensitive syntactic units sequence, if syntactic units sequence is " xx-y-u-y-in-h-e-ch-eng-xx ", context-sensitive syntactic units sequence described syntactic units sequence extension is become to be: " xx-y+u:/A, y-u+y:/A, u-y+in:/A, y-in+h:/A, in-h+e:/A, h-e+ch:/A, e-ch+eng:/A, ch-eng+xx:/A ", wherein between "-" and "+", syntactic units is current grammar unit, the context-related information that " :/A " is current grammar unit, as tone information, the method for expressing of certain described context-sensitive syntactic units sequence is not limited to above-mentioned method for expressing,
Then, utilize fundamental frequency synthetic model to predict the fundamental frequency model obtaining each state of current grammar unit, concrete Forecasting Methodology is same as the prior art, is not described in detail in this;
Subsequently, state duration information according to syntactic units sequence copies each syntactic units corresponding state, predict the fundamental frequency model of each state obtained according to each syntactic units, obtain the fundamental frequency distribution of the syntactic units sequence after copying, i.e. the fundamental frequency model that obtains of syntactic units sequence prediction;
Finally, fundamental frequency synthetic parameters is generated, shown in (9) according to the fundamental frequency distribution of syntactic units sequence:
c=(W
ΤUW)
-1W
ΤUW(9)
Wherein, W is the window function matrix of calculation syntax unit sequence dynamic parameter, and c is fundamental frequency synthetic parameters to be generated, M and U is respectively the average and covariance matrix of predicting all state fundamental frequency models of syntactic units sequence obtained.
3) target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter is utilized
The generative process of Spectrum synthesizing parameter and the generative process of above-mentioned fundamental frequency synthetic parameters similar, do not repeat them here.
Step 207, utilizes described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
The specific implementation process of phonetic synthesis is same as the prior art, does not repeat them here.
The sound converting method that the embodiment of the present invention provides, not only effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improves the naturalness of the rear voice of conversion; But also further using the syntactic units sequence that obtains according to acoustic model as described recognition result, like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.
Correspondingly, the embodiment of the present invention also provides a kind of sound conversion device, as shown in Figure 3, is a kind of structural representation of embodiment of the present invention sound conversion device.
In this embodiment, described device comprises:
Receiver module 301, for receiving speech data to be converted;
Sound identification module 302, for carrying out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Model acquisition module 303, for obtaining the phonetic synthesis model of target speaker;
Synthetic parameters generation module 304, generates phonetic synthesis parameter for utilizing described phonetic synthesis model and described duration information;
Voice synthetic module 305, for utilizing described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
In actual applications, sound identification module 302 can carry out speech recognition to speech data to be identified, obtains the duration information of each word and/or word in text sequence corresponding to speech data to be identified and described text sequence.Correspondingly, a kind of concrete structure of sound identification module 302 comprises following unit:
First decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of speech data to be converted;
First decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
Consider and directly carry out phonetic synthesis to identification text, be easily brought in phonetic synthesis by the mistake that speech recognition process occurs, as polyphone problem, the semanteme causing the semanteme of the rear speech data of synthesis to compare speech data to be converted there occurs change.Therefore, in actual applications, sound identification module 302 using the syntactic units sequence that obtains according to acoustic model as described recognition result, can also obtain the duration information of each syntactic units in described syntactic units sequence simultaneously.Like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.Correspondingly, the another kind of concrete structure of sound identification module 302 comprises following unit:
Second decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of speech data to be converted;
Second decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
In addition, above-mentioned model acquisition module 303 also can have multiple implementation.
Such as, a kind of concrete structure of model acquisition module 303 can comprise: represent unit, target speaker determining unit and model acquiring unit.Wherein, unit is represented described in for representing optional target speaker information to user; Described target speaker determining unit is used for determining target speaker according to the selection of user; Described model acquiring unit is for obtaining the phonetic synthesis model of described target speaker.
For another example, the another kind of concrete structure of model acquisition module 303 can comprise: receiving element and model training unit.Wherein, the target speaker speech data that provide for receiving user of described receiving element; Described model training unit trains for utilizing described target speaker speech data the phonetic synthesis model obtaining target speaker.
Described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model.
Correspondingly, described synthetic parameters generation module 304 comprises:
Duration synthetic parameters generation unit, for the duration synthetic parameters utilizing described duration information and described duration synthetic model to generate each state of each syntactic units;
Fundamental frequency synthetic parameters generation unit, generates fundamental frequency synthetic parameters for utilizing target speaker fundamental frequency synthetic model;
Spectrum synthesizing parameter generating unit, for utilizing target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
The sound conversion device that the embodiment of the present invention provides, first speech data to be converted is received, then treat converting speech data and carry out speech recognition, obtain recognition result and duration information thereof, the phonetic synthesis model of target speaker and described duration information is finally utilized to generate phonetic synthesis parameter, utilize this phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.The method and system treat converting speech data when carrying out speech recognition, not only obtain recognition result, but also the duration information of this recognition result will be obtained, this duration information is utilized to generate the phonetic synthesis parameter of target speaker, effectively ensure that the duration of synthetic speech data is consistent with the duration of speech data to be converted, improve the naturalness of the rear voice of conversion.Further, can using the syntactic units sequence that obtains according to acoustic model as described recognition result, like this, when carrying out phonetic synthesis, the syntactic units sequence directly treating converting speech data corresponding carries out phonetic synthesis, thus the mistake avoided speech recognition process occurs is brought in phonetic synthesis, ensure that the consistance of the semanteme of the speech data after synthesis and the semanteme of speech data to be converted.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.Device embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method of the present invention and device for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.
Claims (10)
1. a sound converting method, is characterized in that, comprising:
Receive speech data to be converted;
Speech recognition is carried out to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Obtain the phonetic synthesis model of target speaker;
Described phonetic synthesis model and described duration information is utilized to generate phonetic synthesis parameter;
Utilize described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtain target speaker Timbre Synthesis speech data.
2. method according to claim 1, is characterized in that, describedly carries out speech recognition to described speech data to be converted, and the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
3. method according to claim 1, is characterized in that, describedly carries out speech recognition to described speech data to be converted, and the duration information obtaining recognition result and described recognition result comprises:
The acoustic model of training in advance and language model is utilized to build decoding network;
Extract the characteristic parameter of described speech data to be converted;
Based on described decoding network and described characteristic parameter, described speech data to be converted is decoded, obtain the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
4. method according to claim 1, is characterized in that, the phonetic synthesis model of described acquisition target speaker comprises:
Represent optional target speaker information to user, and determine target speaker according to the selection of user, then obtain the phonetic synthesis model of described target speaker; Or
Receive the target speaker speech data that user provides, and utilize described target speaker speech data to train the phonetic synthesis model obtaining target speaker.
5. the method according to any one of Claims 1-4, is characterized in that, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described described phonetic synthesis model and the described duration information generation phonetic synthesis parameter of utilizing comprises:
Described duration information and described duration synthetic model is utilized to generate the duration synthetic parameters of each state of each syntactic units;
Target speaker fundamental frequency synthetic model is utilized to generate fundamental frequency synthetic parameters;
Utilize target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
6. a sound conversion device, is characterized in that, comprising:
Receiver module, for receiving speech data to be converted;
Sound identification module, for carrying out speech recognition to described speech data to be converted, obtains the duration information of recognition result and described recognition result;
Model acquisition module, for obtaining the phonetic synthesis model of target speaker;
Synthetic parameters generation module, generates phonetic synthesis parameter for utilizing described phonetic synthesis model and described duration information;
Voice synthetic module, for utilizing described phonetic synthesis parameter to carry out phonetic synthesis to described recognition result, obtains target speaker Timbre Synthesis speech data.
7. device according to claim 6, is characterized in that, described sound identification module comprises:
First decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
First decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each word and/or word in text sequence corresponding to optimum decoding paths and described text sequence.
8. device according to claim 6, is characterized in that, described sound identification module comprises:
Second decoding network construction unit, builds decoding network for the acoustic model and language model utilizing training in advance;
Feature extraction unit, for extracting the characteristic parameter of described speech data to be converted;
Second decoding unit, for decoding to described speech data to be converted based on described decoding network and described characteristic parameter, obtains the duration information of each syntactic units in syntactic units sequence corresponding to optimum decoding paths and described syntactic units sequence.
9. device according to claim 6, is characterized in that,
Described model acquisition module comprises:
Represent unit, for representing optional target speaker information to user;
Target speaker determining unit, determines target speaker for the selection according to user;
Model acquiring unit, for obtaining the phonetic synthesis model of described target speaker;
Or described target speaker determination module comprises:
Receiving element, for receiving the target speaker speech data that user provides;
Model training unit, trains for utilizing described target speaker speech data the phonetic synthesis model obtaining target speaker.
10. the device according to any one of claim 6 to 9, is characterized in that, described target speaker synthetic model comprises: duration synthetic model, fundamental frequency synthetic model, Spectrum synthesizing model;
Described synthetic parameters generation module comprises:
Duration synthetic parameters generation unit, for the duration synthetic parameters utilizing described duration information and described duration synthetic model to generate each state of each syntactic units;
Fundamental frequency synthetic parameters generation unit, generates fundamental frequency synthetic parameters for utilizing target speaker fundamental frequency synthetic model;
Spectrum synthesizing parameter generating unit, for utilizing target speaker Spectrum synthesizing model generation Spectrum synthesizing parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510673278.XA CN105206257B (en) | 2015-10-14 | 2015-10-14 | A kind of sound converting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510673278.XA CN105206257B (en) | 2015-10-14 | 2015-10-14 | A kind of sound converting method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105206257A true CN105206257A (en) | 2015-12-30 |
CN105206257B CN105206257B (en) | 2019-01-18 |
Family
ID=54953887
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510673278.XA Active CN105206257B (en) | 2015-10-14 | 2015-10-14 | A kind of sound converting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105206257B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106302134A (en) * | 2016-09-29 | 2017-01-04 | 努比亚技术有限公司 | A kind of message playing device and method |
CN106920547A (en) * | 2017-02-21 | 2017-07-04 | 腾讯科技(上海)有限公司 | Phonetics transfer method and device |
CN107705802A (en) * | 2017-09-11 | 2018-02-16 | 厦门美图之家科技有限公司 | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing |
CN107818794A (en) * | 2017-10-25 | 2018-03-20 | 北京奇虎科技有限公司 | audio conversion method and device based on rhythm |
CN107833572A (en) * | 2017-11-06 | 2018-03-23 | 芋头科技(杭州)有限公司 | The phoneme synthesizing method and system that a kind of analog subscriber is spoken |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
CN110164413A (en) * | 2019-05-13 | 2019-08-23 | 北京百度网讯科技有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
CN110600045A (en) * | 2019-08-14 | 2019-12-20 | 科大讯飞股份有限公司 | Sound conversion method and related product |
CN112786018A (en) * | 2020-12-31 | 2021-05-11 | 科大讯飞股份有限公司 | Speech conversion and related model training method, electronic equipment and storage device |
CN113160794A (en) * | 2021-04-30 | 2021-07-23 | 京东数字科技控股股份有限公司 | Voice synthesis method and device based on timbre clone and related equipment |
WO2022141126A1 (en) * | 2020-12-29 | 2022-07-07 | 深圳市优必选科技股份有限公司 | Personalized speech conversion training method, computer device, and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4692941A (en) * | 1984-04-10 | 1987-09-08 | First Byte | Real-time text-to-speech conversion system |
CN1534595A (en) * | 2003-03-28 | 2004-10-06 | 中颖电子(上海)有限公司 | Speech sound change over synthesis device and its method |
US20070168189A1 (en) * | 2006-01-19 | 2007-07-19 | Kabushiki Kaisha Toshiba | Apparatus and method of processing speech |
CN101359473A (en) * | 2007-07-30 | 2009-02-04 | 国际商业机器公司 | Auto speech conversion method and apparatus |
CN101894547A (en) * | 2010-06-30 | 2010-11-24 | 北京捷通华声语音技术有限公司 | Speech synthesis method and system |
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103035235A (en) * | 2011-09-30 | 2013-04-10 | 西门子公司 | Method and device for transforming voice into melody |
CN103065619A (en) * | 2012-12-26 | 2013-04-24 | 安徽科大讯飞信息科技股份有限公司 | Speech synthesis method and speech synthesis system |
CN103295574A (en) * | 2012-03-02 | 2013-09-11 | 盛乐信息技术(上海)有限公司 | Singing voice conversion device and method thereof |
-
2015
- 2015-10-14 CN CN201510673278.XA patent/CN105206257B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4692941A (en) * | 1984-04-10 | 1987-09-08 | First Byte | Real-time text-to-speech conversion system |
CN1534595A (en) * | 2003-03-28 | 2004-10-06 | 中颖电子(上海)有限公司 | Speech sound change over synthesis device and its method |
US20070168189A1 (en) * | 2006-01-19 | 2007-07-19 | Kabushiki Kaisha Toshiba | Apparatus and method of processing speech |
CN101359473A (en) * | 2007-07-30 | 2009-02-04 | 国际商业机器公司 | Auto speech conversion method and apparatus |
CN101894547A (en) * | 2010-06-30 | 2010-11-24 | 北京捷通华声语音技术有限公司 | Speech synthesis method and system |
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN103035235A (en) * | 2011-09-30 | 2013-04-10 | 西门子公司 | Method and device for transforming voice into melody |
CN103295574A (en) * | 2012-03-02 | 2013-09-11 | 盛乐信息技术(上海)有限公司 | Singing voice conversion device and method thereof |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103065619A (en) * | 2012-12-26 | 2013-04-24 | 安徽科大讯飞信息科技股份有限公司 | Speech synthesis method and speech synthesis system |
Non-Patent Citations (3)
Title |
---|
李波等: "《语音转换及相关技术综述》", 《通信学报》 * |
郭威彤的话: "《普通话到西安话的韵律转换》", 《计算机工程与应用》 * |
陈凌辉等: "《基于话者无关模型的说话人转换方法》", 《模式识别与人工智能》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106302134A (en) * | 2016-09-29 | 2017-01-04 | 努比亚技术有限公司 | A kind of message playing device and method |
CN106920547A (en) * | 2017-02-21 | 2017-07-04 | 腾讯科技(上海)有限公司 | Phonetics transfer method and device |
CN107705802A (en) * | 2017-09-11 | 2018-02-16 | 厦门美图之家科技有限公司 | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing |
CN107818794A (en) * | 2017-10-25 | 2018-03-20 | 北京奇虎科技有限公司 | audio conversion method and device based on rhythm |
CN107833572A (en) * | 2017-11-06 | 2018-03-23 | 芋头科技(杭州)有限公司 | The phoneme synthesizing method and system that a kind of analog subscriber is spoken |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
CN110164413A (en) * | 2019-05-13 | 2019-08-23 | 北京百度网讯科技有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
CN110164413B (en) * | 2019-05-13 | 2021-06-04 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, computer device and storage medium |
CN110600045A (en) * | 2019-08-14 | 2019-12-20 | 科大讯飞股份有限公司 | Sound conversion method and related product |
WO2022141126A1 (en) * | 2020-12-29 | 2022-07-07 | 深圳市优必选科技股份有限公司 | Personalized speech conversion training method, computer device, and storage medium |
CN112786018A (en) * | 2020-12-31 | 2021-05-11 | 科大讯飞股份有限公司 | Speech conversion and related model training method, electronic equipment and storage device |
CN113160794A (en) * | 2021-04-30 | 2021-07-23 | 京东数字科技控股股份有限公司 | Voice synthesis method and device based on timbre clone and related equipment |
CN113160794B (en) * | 2021-04-30 | 2022-12-27 | 京东科技控股股份有限公司 | Voice synthesis method and device based on timbre clone and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN105206257B (en) | 2019-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105206257A (en) | Voice conversion method and device | |
CN107195296B (en) | Voice recognition method, device, terminal and system | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
CN105869624A (en) | Method and apparatus for constructing speech decoding network in digital speech recognition | |
CN102231278B (en) | Method and system for realizing automatic addition of punctuation marks in speech recognition | |
CN105593936B (en) | System and method for text-to-speech performance evaluation | |
US20220013106A1 (en) | Multi-speaker neural text-to-speech synthesis | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
CN115516552A (en) | Speech recognition using synthesis of unexplained text and speech | |
KR102311922B1 (en) | Apparatus and method for controlling outputting target information to voice using characteristic of user voice | |
US10255903B2 (en) | Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system | |
CN106057192A (en) | Real-time voice conversion method and apparatus | |
CN104217713A (en) | Tibetan-Chinese speech synthesis method and device | |
CN104123933A (en) | Self-adaptive non-parallel training based voice conversion method | |
CN109767778A (en) | A kind of phonetics transfer method merging Bi-LSTM and WaveNet | |
CN108922521A (en) | A kind of voice keyword retrieval method, apparatus, equipment and storage medium | |
Yin et al. | Modeling F0 trajectories in hierarchically structured deep neural networks | |
CN106653002A (en) | Literal live broadcasting method and platform | |
KR102272554B1 (en) | Method and system of text to multiple speech | |
Nidhyananthan et al. | Language and text-independent speaker identification system using GMM | |
CN109300339A (en) | A kind of exercising method and system of Oral English Practice | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
CN115101046A (en) | Method and device for synthesizing voice of specific speaker | |
AU2015411306A1 (en) | Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system | |
Sultana et al. | A survey on Bengali speech-to-text recognition techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |