CN105023574A - Method and system of enhancing TTS - Google Patents

Method and system of enhancing TTS Download PDF

Info

Publication number
CN105023574A
CN105023574A CN201410182886.6A CN201410182886A CN105023574A CN 105023574 A CN105023574 A CN 105023574A CN 201410182886 A CN201410182886 A CN 201410182886A CN 105023574 A CN105023574 A CN 105023574A
Authority
CN
China
Prior art keywords
model
parameter
synthetic
speech
enhancing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410182886.6A
Other languages
Chinese (zh)
Other versions
CN105023574B (en
Inventor
孙见青
陈凌辉
凌震华
江源
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201410182886.6A priority Critical patent/CN105023574B/en
Publication of CN105023574A publication Critical patent/CN105023574A/en
Application granted granted Critical
Publication of CN105023574B publication Critical patent/CN105023574B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to the TTS (Text-To-Speech) technical field, and discloses a method and system of enhancing TTS. The method comprises: constructing an initial TTS model based on training data, wherein the training data comprises text data and speech data corresponding to the text data; establishing an enhancement model used for simulating the mapping relation between TTS parameters and natural speech parameters generated by the initial TTS model; after receiving a text to be synthetized, generating TTS parameters corresponding to the text to be synthetized according to the initial TTS model and the enhancement model; and utilizing the TTS parameters to generate continuous speech signals. The method and system of enhancing TTS can effectively improve TTS effects.

Description

A kind of method and system realizing synthetic speech and strengthen
Technical field
The present invention relates to speech synthesis technique field, be specifically related to a kind of method and system realizing synthetic speech and strengthen.
Background technology
Realize man-machine between hommization, intelligentized effectively mutual, build man-machine communication's environment of efficient natural, become the active demand of current information technology application and development.As an important technology practical in voice technology, speech synthesis technique or title literary periodicals technology (Text-To-Speech, TTS), Word message is converted into natural voice signal, realize the real-time conversion of text, change tradition and realize by playback of recording the troublesome operation that machine lifts up one's voice, save system memory space, increasing current of information interaction, particularly the dynamic queries application of often variation is needed to play more and more important effect in the information content.
Based on the speech synthesis system of parameter synthesis owing to there is good robustness and generalization is widely used, but the method has stronger smoothing effect, the voice of synthesis are flat and tonequality is easily impaired, in synthesis naturalness, performance is not ideal enough, there is certain room for promotion in practical application.The naturalness how improving synthetic speech is the practical important leverage of synthesis system.
For this reason, in prior art, the main method adopting synthetic speech to strengthen improves the naturalness of synthetic speech, its major technique can be summarized as: the experimental knowledge such as sense of hearing characteristic based on people carries out post filtering process to generation frequency spectrum parameter or synthetic speech, such as the resonance peak of synthetic speech is carried out to the dynamic perfromance strengthening process, strengthen generating frequency spectrum parameter, thus improve the tonequality of synthetic speech.
In fact, there is detail differences in the acoustic characteristic of different speaker, and for same speaker, its send out not unisonance time, also there is detail differences in acoustic characteristic.And based on the synthetic speech Enhancement Method of experimental knowledge, the synthetic speech after strengthening can only be made to meet the sense of hearing of people on the whole, it is unsatisfactory that it strengthens effect.
Summary of the invention
The embodiment of the present invention provides a kind of method and system realizing synthetic speech and strengthen, to improve the enhancing effect of synthetic speech.
For this reason, the embodiment of the present invention provides following technical scheme:
Realize the method that synthetic speech strengthens, comprising:
Build initial speech synthetic model based on training data, described training data comprises text data and the speech data corresponding with described text data;
Set up and strengthen model, described enhancing model is for the synthetic speech parameter simulating described initial speech synthetic model and generate and the mapping relations of natural-sounding parameter;
After receiving text to be synthesized, according to the synthetic speech parameter of the corresponding described text to be synthesized of described initial speech synthetic model and described enhancing model generation;
Described synthetic speech parameter is utilized to generate continuous speech signal.
Preferably, described foundation enhancing model comprises:
The synthetic speech parameter of all training datas is generated according to described initial speech synthetic model;
Extract the natural-sounding parameter of all training datas;
Determine the topological structure strengthening model;
The synthetic speech parameter of described for correspondence training data and the data of natural-sounding parameter are gathered as training, carry out parameter training according to described topological structure, be enhanced model.
Preferably, described enhancing model is: the mapping model of linear function or GMM model or DNN model.
Preferably, the mapping relations of the synthetic speech parameter that generates of described initial speech synthetic model and natural-sounding parameter are that the condition of the synthetic speech parameter that generates of described initial speech synthetic model and natural-sounding parameter distributes.
Preferably, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model;
The described synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation comprises:
According to described enhancing model, enhancing process is carried out to the spectral model in described initial speech synthetic model and/or fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model;
The spectral model of described enhancing and/or fundamental frequency model is utilized to generate frequency spectrum parameter and/or the base frequency parameters of corresponding described text to be synthesized;
Described initial speech synthetic model is utilized to generate other speech parameter except spectral model and/or fundamental frequency model of corresponding described text to be synthesized.
Preferably, described according to described enhancing model to the spectral model in described initial speech synthetic model and/or fundamental frequency model carry out enhancing process, the spectral model be enhanced and/or fundamental frequency model comprise:
The model parameter of spectral model and/or fundamental frequency model is obtained from described initial speech synthetic model;
Described enhancing model is utilized to carry out enhancing process to described model parameter, the model parameter after being enhanced;
Model parameter after strengthening is substituted corresponding spectral model and/or the model parameter of fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model.
Preferably, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model;
The described synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation comprises:
Described initial speech synthetic model is utilized to generate the duration parameters of corresponding described text to be synthesized, frequency spectrum parameter and base frequency parameters respectively;
Enhancing model is utilized to carry out enhancing process to described frequency spectrum parameter and/or base frequency parameters, frequency spectrum parameter after being enhanced and/or base frequency parameters, and using the frequency spectrum parameter after described enhancing and/or base frequency parameters as the frequency spectrum parameter of described text to be synthesized corresponding during synthetic speech and/or base frequency parameters.
Realize the system that synthetic speech strengthens, comprising:
Module set up by initial model, and for building initial speech synthetic model based on training data, described training data comprises text data and the speech data corresponding with described text data;
Strengthen model building module, for setting up enhancing model, described enhancing model is for the synthetic speech parameter simulating described initial speech synthetic model and generate and the mapping relations of natural-sounding parameter;
Receiver module, for receiving text to be synthesized;
Parameter generation module, for the synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation;
Synthesis module, generates continuous speech signal for utilizing described synthetic speech parameter.
Preferably, described enhancing model building module comprises:
Synthetic speech parameter generating unit, for generating the synthetic speech parameter of all training datas according to described initial speech synthetic model;
Natural-sounding parameter extraction unit, for extracting the natural-sounding parameter of all training datas;
Topological structure determining unit, for determining the topological structure strengthening model;
Training unit, for the synthetic speech parameter of described for correspondence training data and the data of natural-sounding parameter being gathered as training, carry out parameter training according to described topological structure, be enhanced model.
Preferably, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model; Described parameter generation module comprises:
Model enhancement unit, for carrying out enhancing process according to described enhancing model to the spectral model in described initial speech synthetic model and/or fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model;
Strengthen speech parameter generation unit, for the frequency spectrum parameter and/or the base frequency parameters that utilize the spectral model of described enhancing and/or fundamental frequency model to generate corresponding described text to be synthesized;
Initial speech parameter generating unit, for other speech parameter except spectral model and/or fundamental frequency model utilizing described initial speech synthetic model to generate corresponding described text to be synthesized.
Preferably, described model enhancement unit comprises:
Model parameter acquiring unit, for obtaining the model parameter of spectral model and/or fundamental frequency model from described initial speech synthetic model;
Model parameter enhancement unit, carries out enhancing process to described model parameter, the model parameter after being enhanced for utilizing described enhancing model;
Strengthen model generation unit, for the model parameter after enhancing being substituted corresponding spectral model and/or the model parameter of fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model.
Preferably, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model;
Described parameter generation module comprises:
Initial speech parameter generating unit, generates the duration parameters of corresponding described text to be synthesized, frequency spectrum parameter and base frequency parameters respectively for utilizing described initial speech synthetic model;
Parameter enhancement unit, for utilizing described enhancing model, enhancing process is carried out to described frequency spectrum parameter and/or base frequency parameters, frequency spectrum parameter after being enhanced and/or base frequency parameters, and using the frequency spectrum parameter after described enhancing and/or base frequency parameters as the frequency spectrum parameter of described text to be synthesized corresponding during synthetic speech and/or base frequency parameters.
Method and system that synthetic speech strengthens that what the embodiment of the present invention provided realize, Statistics-Based Method builds the enhancing model of the mapping relations for the synthetic speech parameter and natural-sounding parameter simulating the generation of traditional voice synthetic model, then utilize this enhancing model and traditional voice synthetic model to generate the synthetic speech parameter of corresponding text to be synthesized, and then utilize described synthetic speech parameter to generate continuous speech signal.The scheme of the embodiment of the present invention is using natural phonation mathematic(al) parameter as guidance, therefore stronger assurance can be had to the minutia of parameters,acoustic when different speaker and same pronunciation human hair difference sound, the characteristic of specific speaker can be caught, make the better effects if that synthetic speech strengthens.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram that the embodiment of the present invention realizes the method that synthetic speech strengthens;
Fig. 2 is according to initial speech synthetic model and a kind of process flow diagram strengthening model synthetic speech in the embodiment of the present invention;
Fig. 3 is according to initial speech synthetic model and the another kind of process flow diagram strengthening model synthetic speech in the embodiment of the present invention;
Fig. 4 is the structural representation that the embodiment of the present invention realizes the system that synthetic speech strengthens;
Fig. 5 is a kind of specific implementation structural representation of parameter generation module in the embodiment of the present invention;
Fig. 6 is the another kind of specific implementation structural representation of parameter generation module in the embodiment of the present invention.
Embodiment
In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
Because the acoustic characteristic of different speaker exists detail differences, and for same speaker, its send out not unisonance time, also there is detail differences in acoustic characteristic.And existing synthetic speech Enhancement Method carries out post filtering process based on the experimental knowledge such as sense of hearing characteristic of people to generation frequency spectrum parameter or synthetic speech, do not pay close attention to the details characteristic of speaker parameters,acoustic, the synthetic speech after strengthening can only be made to meet the sense of hearing of people on the whole, desirable enhancing effect can not be obtained.For this reason, the embodiment of the present invention is for prior art Problems existing, a kind of method and system realizing synthetic speech and strengthen are provided, Statistics-Based Method builds the enhancing model of the mapping relations for the synthetic speech parameter and natural-sounding parameter simulating the generation of traditional voice synthetic model, then utilize this enhancing model and traditional voice synthetic model to generate the synthetic speech parameter of corresponding text to be synthesized, and then utilize described synthetic speech parameter to generate continuous speech signal.
As shown in Figure 1, be the process flow diagram that the embodiment of the present invention realizes the method that synthetic speech strengthens, comprise the following steps:
Step 101, build initial speech synthetic model based on training data, described training data comprises text data and the speech data corresponding with described text data.
Described initial speech synthetic model can use traditional parameter synthesis method to build, and it comprises: binary decision tree, spectral model, fundamental frequency model, duration modeling etc. that each basic synthesis unit is corresponding.Such as, the parameter synthesis method based on HMM can be adopted, for spectral model, adopt GMM (GaussianMixture Mode, gauss hybrid models) simulate the spectrum distribution of leaf node, its Gaussage can be defined as positive integer with reference to training data scale usually, such as selects Gaussage to be 1.
Step 102, sets up and strengthens model, and described enhancing model is for the synthetic speech parameter simulating described initial speech synthetic model and generate and the mapping relations of natural-sounding parameter.
Important impact is had because the setting and optimization that strengthen model strengthen effect to synthetic speech, therefore, in embodiments of the present invention, adopt the enhancing model set-up mode based on data-driven, using natural phonation mathematic(al) parameter as guidance, the minutia of parameters,acoustic when the different speaker of real embodiment and same pronunciation human hair difference sound, and then improve the effect of synthetic speech enhancing.
The building process strengthening model is as follows:
(1) the synthetic speech parameter of all training datas is generated according to initial speech synthetic model;
(2) the natural-sounding parameter of all training datas is extracted;
(3) topological structure strengthening model is determined;
(4) the synthetic speech parameter of described for correspondence training data and the data of natural-sounding parameter are gathered as training, carry out parameter training according to described topological structure, be enhanced model.
It should be noted that, in actual applications, the enhancing model for spectral characteristic and/or fundamental frequency characteristic can be built respectively.Such as, for the enhancing model of spectral characteristic, concrete building process is as follows:
(1) the synthesis frequency spectrum parameter of all training datas is generated according to the spectral model in initial speech synthetic model.
According to described spectral model and pressure alignment result, the spectral model sequence that training data is corresponding can be determined.Particularly, for single basic voice unit, according to pressure alignment duration information, selected spectral model is repeatedly copied, obtain the spectrum signature series model of this basic voice unit.
The likelihood score summation of the spectral model sequence that statistics training data is corresponding, is calculated as follows:
log P ( W C s | Q , λ ) = - 1 2 C s T W T U s - 1 W C s + C s T W T U s - 1 M s + const - - - ( 1 )
Wherein W is the window function matrix calculating dynamic parameter, C sfor frequency spectrum parameter to be generated, M sand U sbe respectively average and the covariance matrix of spectral model.The likelihood score summation of obvious spectral model is the function of target spectrum eigenvector.
(2) the natural frequency spectrum parameter of all training datas is extracted.
(3) topological structure of spectral enhancement model is determined.
Spectral enhancement model is for the frequency spectrum parameter of simulating traditional voice synthetic model and generating and the mapping relations of natural frequency spectrum parameter, in embodiments of the present invention, the mapping model of linear function can be adopted, also can adopt the mathematical statistical model such as GMM model or DNN model.In general, model meticulousr then in the sufficient situation of data its simulate effect better.
(4) according to described topological structure, parameter training is carried out to spectral enhancement model, obtain the spectral enhancement model optimized, namely set up synthesis frequency spectrum parameter x twith natural frequency spectrum parameter y tcondition distribution p(y t| x t).
Building process for the enhancing model of fundamental frequency characteristic is similar to the above, is not described in detail at this.
Step 103, after receiving text to be synthesized, according to the synthetic speech parameter of the corresponding described text to be synthesized of described initial speech synthetic model and described enhancing model generation.
Based on above-mentioned enhancing model, in actual applications, various ways can be adopted to strengthen initial speech synthetic model or synthetic speech parameter, and all well can be strengthened effect, specific implementation process will be described in detail later.
Step 104, utilizes described synthetic speech parameter to generate continuous speech signal.
Method that synthetic speech strengthens that what the embodiment of the present invention provided realize, Statistics-Based Method builds the enhancing model of the mapping relations for the synthetic speech parameter and natural-sounding parameter simulating the generation of traditional voice synthetic model, then utilize this enhancing model and traditional voice synthetic model to generate the synthetic speech parameter of corresponding text to be synthesized, and then utilize described synthetic speech parameter to generate continuous speech signal.Because described enhancing model is using natural phonation mathematic(al) parameter as guidance, therefore stronger assurance can be had to the minutia of parameters,acoustic when different speaker and same pronunciation human hair difference sound, the characteristic of specific speaker can be caught, make the better effects if that synthetic speech strengthens.And the scheme of the embodiment of the present invention can not increase operand in actual synthesis task, is conducive to the real time implementation of product.
It should be noted that, in actual applications, the mode according to initial speech synthetic model and enhancing model generation synthetic speech parameter has multiple.Such as, the corresponding model that strengthens can be utilized to carry out enhancing process to the spectral model in initial speech synthetic model and/or fundamental frequency model, the spectral model after strengthening process and/or fundamental frequency model is utilized to generate frequency spectrum parameter and/or the base frequency parameters of corresponding text to be synthesized, generate other phonetic synthesis parameter by initial speech synthetic model, then utilize these phonetic synthesis parameters to generate continuous speech signal.For another example, initial speech synthetic model can also be first utilized to generate the phonetic synthesis parameter (comprising duration parameters, frequency spectrum parameter, base frequency parameters) of corresponding text to be synthesized, and then utilizing the corresponding model that strengthens to carry out enhancing process to some phonetic synthesis parameters (comprising frequency spectrum parameter and/or base frequency parameters) wherein, the phonetic synthesis parameter (mainly duration parameters) that the phonetic synthesis parameter after finally utilizing these to strengthen and other do not strengthen process generates continuous speech signal.
Citing describes in detail in the embodiment of the present invention according to initial speech synthetic model and the process strengthening model generation synthetic speech parameter respectively below.
As shown in Figure 2, be a kind of process flow diagram according to initial speech synthetic model and enhancing model generation synthetic speech parameter in the embodiment of the present invention, comprise the following steps:
Step 201, utilizes initial speech synthetic model to generate duration parameters and the base frequency parameters of corresponding text to be synthesized.
Step 202, carries out enhancing process according to enhancing model to the spectral model in initial speech synthetic model, the spectral model be enhanced.
First, from initial spectral model, obtain model parameter, such as based on the spectral model of GMM, be designated as x t; Then the enhancing model that training in advance is good is utilized, to model parameter x tcarry out enhancing process, namely according to p (y t| x t), try to achieve the model parameter y after enhancing t; Finally with the model parameter y after enhancing treplace the model parameter of spectral model, obtain new spectral model, this model is the spectral model after enhancing.
Step 203, utilizes the spectral model strengthened to generate the frequency spectrum parameter of corresponding described text to be synthesized.
Step 204, utilizes the duration parameters of corresponding described text to be synthesized, base frequency parameters and frequency spectrum parameter to generate continuous speech signal.
It should be noted that, in actual applications, the enhancing model for spectral characteristic and the enhancing model for fundamental frequency characteristic can be generated respectively, therefore, the enhancing model for spectral characteristic can be adopted separately to carry out enhancing process to the spectral model in initial speech synthetic model, or adopt separately the enhancing model for fundamental frequency characteristic to carry out enhancing process to the fundamental frequency model in initial speech synthetic model, also above-mentioned two kinds of enhancing models for different qualities can be comprehensively adopted to carry out enhancing process to the spectral model in initial speech synthetic model and fundamental frequency model respectively.Correspondingly, the spectral model after strengthening and/or fundamental frequency model is utilized to obtain base frequency parameters and/or the frequency spectrum parameter of corresponding text to be synthesized, utilize these phonetic synthesis parameters and by other phonetic synthesis parameter that initial speech synthetic model obtains, continuous speech signal can be generated.
Utilize above-mentioned each phonetic synthesis parameter to generate the process of continuous speech signal similarly to the prior art, do not repeat them here.
As can be seen here, the method of the embodiment of the present invention carries out enhancing process to traditional voice synthetic model, and in follow-up synthesis task, the phonetic synthesis model after only need using enhancing obtains corresponding phonetic synthesis parameter, can not operand be increased, and can reach and well strengthen effect.
As shown in Figure 3, be according to initial speech synthetic model and the another kind of process flow diagram strengthening model synthetic speech in the embodiment of the present invention.
Step 301, utilizes initial speech synthetic model to generate the duration parameters of corresponding text to be synthesized, frequency spectrum parameter and base frequency parameters respectively.
Step 302, utilizes enhancing model to carry out enhancing process to frequency spectrum parameter, the frequency spectrum parameter after being enhanced.
Particularly, by the frequency spectrum parameter C in preceding formula (1) stransaction module p (y is strengthened before substitution t| x t) in x t, the frequency spectrum parameter y after can being enhanced t.
Step 303, utilizes the duration parameters of corresponding described text to be synthesized, base frequency parameters and the frequency spectrum parameter after strengthening to generate continuous speech signal.
As can be seen from above-mentioned flow process, flow process shown in flow process and Fig. 2 shown in Fig. 3 unlike, in this embodiment, first generate the base frequency parameters of corresponding text to be synthesized, frequency spectrum parameter and duration parameters by initial speech synthetic model, and then enhancing is carried out to frequency spectrum parameter wherein process by strengthening model accordingly, thus the detail differences in acoustic characteristic when making the phonetic synthesis parameter after enhancing embody the different sound of different speaker and same pronunciation human hair better.Phonetic synthesis parameter after these are strengthened process is combined with other phonetic synthesis parameter obtained by traditional voice synthetic model, synthesizes voice by compositor.
It should be noted that, in actual applications, the flow process similar with above-mentioned Fig. 3 can be adopted equally, utilize the corresponding model that strengthens to carry out enhancing process to base frequency parameters, the base frequency parameters after being enhanced.Then the duration parameters of corresponding text to be synthesized, frequency spectrum parameter and the base frequency parameters after strengthening is utilized to generate continuous speech signal.Or utilize the enhancing model for spectral characteristic to carry out enhancing process to the frequency spectrum parameter that initial speech synthetic model generates, utilize the enhancing model for fundamental frequency characteristic to carry out enhancing process to the base frequency parameters that initial speech synthetic model generates, the duration parameters then utilizing initial speech synthetic model to generate and the base frequency parameters after strengthening and frequency spectrum parameter generation continuous speech signal simultaneously.
Correspondingly, the embodiment of the present invention also provides a kind of system realizing synthetic speech and strengthen, and as shown in Figure 4, is the structural representation of this system.
In this embodiment, described system comprises:
Module 401 set up by initial model, and for building initial speech synthetic model based on training data, described training data comprises text data and the speech data corresponding with described text data;
Strengthen model building module 402, for setting up enhancing model, described enhancing model is for the synthetic speech parameter simulating described initial speech synthetic model and generate and the mapping relations of natural-sounding parameter;
Receiver module 403, for receiving text to be synthesized;
Parameter generation module 404, for the synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation;
Synthesis module 405, generates continuous speech signal for utilizing described synthetic speech parameter.
Above-mentioned initial model is set up module 401 and traditional parameter synthesis method can be used to build initial speech synthetic model, and described initial speech synthetic model comprises: binary decision tree, spectral model, fundamental frequency model, duration modeling etc. that each basic synthesis unit is corresponding.Such as, can adopt the parameter synthesis method based on HMM, for spectral model, adopt GMM to simulate the spectrum distribution of leaf node, its Gaussage can be defined as positive integer with reference to training data scale usually, such as selects Gaussage to be 1.
Important impact is had because the setting and optimization that strengthen model strengthen effect to synthetic speech, therefore, in embodiments of the present invention, enhancing model building module 402 adopts the enhancing model set-up mode based on data-driven, using natural phonation mathematic(al) parameter as guidance, the minutia of parameters,acoustic when the different speaker of real embodiment and same pronunciation human hair difference sound, and then improve the effect of synthetic speech enhancing.
Above-mentioned enhancing model building module 402 specifically can comprise following unit:
Synthetic speech parameter generating unit, for generating the synthetic speech parameter of all training datas according to described initial speech synthetic model;
Natural-sounding parameter extraction unit, for extracting the natural-sounding parameter of all training datas;
Topological structure determining unit, for determining the topological structure strengthening model;
Training unit, for the synthetic speech parameter of described for correspondence training data and the data of natural-sounding parameter being gathered as training, carry out parameter training according to described topological structure, be enhanced model.
It should be noted that, in actual applications, above-mentioned enhancing model building module 402 can build the enhancing model for spectral characteristic and/or fundamental frequency characteristic respectively.Correspondingly, when building the enhancing model for spectral characteristic, described synthetic speech parameter generating unit needs to generate skilled synthesis frequency spectrum parameter according to the spectral model in initial speech synthetic model; Natural-sounding parameter extraction unit needs the natural frequency spectrum parameter extracting all training datas.Similarly, when building the enhancing model for fundamental frequency characteristic, described synthetic speech parameter generating unit needs to generate skilled synthesis base frequency parameters according to the fundamental frequency model in initial speech synthetic model; Natural-sounding parameter extraction unit needs the natural base frequency parameters extracting all training datas.
Above-mentioned parameter generation module 404 is based on the enhancing model strengthening model building module 402 foundation, various ways can be adopted to strengthen initial speech synthetic model or synthetic speech parameter, all well can be strengthened effect, correspondingly, parameter generation module 404 can have multiple specific implementation structure, will be described in detail later.
System that synthetic speech strengthens that what the embodiment of the present invention provided realize, Statistics-Based Method builds the enhancing model of the mapping relations for the synthetic speech parameter and natural-sounding parameter simulating the generation of traditional voice synthetic model, then utilize this enhancing model and traditional voice synthetic model to generate the synthetic speech parameter of corresponding text to be synthesized, and then utilize described synthetic speech parameter to generate continuous speech signal.Because described enhancing model is using natural phonation mathematic(al) parameter as guidance, therefore stronger assurance can be had to the minutia of parameters,acoustic when different speaker and same pronunciation human hair difference sound, the characteristic of specific speaker can be caught, make the better effects if that synthetic speech strengthens.
As shown in Figure 5, be a kind of specific implementation structural representation of parameter generation module in the embodiment of the present invention.
In this embodiment, described parameter generation module comprises:
Model enhancement unit 501, for carrying out enhancing process according to described enhancing model to the spectral model in described initial speech synthetic model and/or fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model;
Strengthen speech parameter generation unit 502, for the frequency spectrum parameter and/or the base frequency parameters that utilize the spectral model of described enhancing and/or fundamental frequency model to generate corresponding described text to be synthesized;
Initial speech parameter generating unit 503, for other speech parameter except spectral model and/or fundamental frequency model utilizing described initial speech synthetic model to generate corresponding described text to be synthesized.
It should be noted that, in actual applications, aforementioned enhancing model building module 402 can generate the enhancing model for spectral characteristic and the enhancing model for fundamental frequency characteristic respectively, therefore, in the embodiment shown in fig. 5, model enhancement unit 501 can adopt separately the enhancing model for spectral characteristic to carry out enhancing process to the spectral model in initial speech synthetic model, or adopt separately the enhancing model for fundamental frequency characteristic to carry out enhancing process to the fundamental frequency model in initial speech synthetic model, also above-mentioned two kinds of enhancing models for different qualities can be comprehensively adopted to carry out enhancing process to the spectral model in initial speech synthetic model and fundamental frequency model respectively.Correspondingly, spectral model after enhancing speech parameter generation unit 502 can utilize enhancing and/or fundamental frequency model obtain base frequency parameters and/or the frequency spectrum parameter of corresponding text to be synthesized, other phonetic synthesis parameter that synthesis module 405 in Fig. 4 utilizes these phonetic synthesis parameters and initial speech parameter generating unit 503 to obtain, can generate continuous speech signal.
As can be seen here, the system that the embodiment of the present invention realizes synthetic speech enhancing carries out enhancing process to traditional voice synthetic model, in follow-up synthesis task, phonetic synthesis model after only need using enhancing obtains corresponding phonetic synthesis parameter, can not operand be increased, and can reach and well strengthen effect.
Above-mentioned model enhancement unit 501 can by the spectral model that is enhanced to the enhancing of model average and/or fundamental frequency model, and a kind of concrete structure of model enhancement unit 501 can comprise following unit:
Model parameter acquiring unit, for obtaining the model parameter of spectral model and/or fundamental frequency model from described initial speech synthetic model;
Model parameter enhancement unit, carries out enhancing process to described model parameter, the model parameter after being enhanced for utilizing described enhancing model;
Strengthen model generation unit, for the model parameter after enhancing being substituted corresponding spectral model and/or the model parameter of fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model.
As shown in Figure 6, be the another kind of specific implementation structural representation of parameter generation module in the embodiment of the present invention.
In this embodiment, described parameter generation module comprises:
Initial speech parameter generating unit 601, generates the duration parameters of corresponding described text to be synthesized, frequency spectrum parameter and base frequency parameters respectively for utilizing initial speech synthetic model;
Parameter enhancement unit 602, for utilizing enhancing model, enhancing process is carried out to described frequency spectrum parameter and/or base frequency parameters, frequency spectrum parameter after being enhanced and/or base frequency parameters, and using the frequency spectrum parameter after described enhancing and/or base frequency parameters as the frequency spectrum parameter of described text to be synthesized corresponding during synthetic speech and/or base frequency parameters.
With structure of block diagram shown in Fig. 5 unlike, in this embodiment, first initial speech synthetic model is utilized to generate the base frequency parameters of corresponding text to be synthesized, frequency spectrum parameter and duration parameters by initial speech parameter generating unit 601, and then utilize the corresponding model that strengthens to carry out to frequency spectrum parameter wherein enhancings and process by parameter enhancement unit 602, thus the detail differences in acoustic characteristic when making the phonetic synthesis parameter after enhancing embody different speaker and same pronunciation human hair difference sound better.Phonetic synthesis parameter after these are strengthened process by the synthesis module 405 in Fig. 4 is combined with other phonetic synthesis parameter obtained by traditional voice synthetic model, synthesizes voice by compositor.
The embodiment of the present invention is utilized to realize the system of synthetic speech enhancing, the minutia of parameters,acoustic when obtaining the different sound of different speaker and same pronunciation human hair by statistical, and then utilize these minutias to carry out enhancing process to synthetic speech, thus better can be strengthened effect.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for system embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.System embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method and apparatus of the present invention for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (12)

1. realize the method that synthetic speech strengthens, it is characterized in that, comprising:
Build initial speech synthetic model based on training data, described training data comprises text data and the speech data corresponding with described text data;
Set up and strengthen model, described enhancing model is for the synthetic speech parameter simulating described initial speech synthetic model and generate and the mapping relations of natural-sounding parameter;
After receiving text to be synthesized, according to the synthetic speech parameter of the corresponding described text to be synthesized of described initial speech synthetic model and described enhancing model generation;
Described synthetic speech parameter is utilized to generate continuous speech signal.
2. method according to claim 1, is characterized in that, described foundation strengthens model and comprises:
The synthetic speech parameter of all training datas is generated according to described initial speech synthetic model;
Extract the natural-sounding parameter of all training datas;
Determine the topological structure strengthening model;
The synthetic speech parameter of described for correspondence training data and the data of natural-sounding parameter are gathered as training, carry out parameter training according to described topological structure, be enhanced model.
3. method according to claim 2, is characterized in that, described enhancing model is: the mapping model of linear function or GMM model or DNN model.
4. method according to claim 1, it is characterized in that, the mapping relations of the synthetic speech parameter that described initial speech synthetic model generates and natural-sounding parameter are that the condition of the synthetic speech parameter that generates of described initial speech synthetic model and natural-sounding parameter distributes.
5. the method according to any one of Claims 1-4, is characterized in that, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model;
The described synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation comprises:
According to described enhancing model, enhancing process is carried out to the spectral model in described initial speech synthetic model and/or fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model;
The spectral model of described enhancing and/or fundamental frequency model is utilized to generate frequency spectrum parameter and/or the base frequency parameters of corresponding described text to be synthesized;
Described initial speech synthetic model is utilized to generate other speech parameter except spectral model and/or fundamental frequency model of corresponding described text to be synthesized.
6. method according to claim 5, is characterized in that, described according to described enhancing model to the spectral model in described initial speech synthetic model and/or fundamental frequency model carry out enhancing process, the spectral model be enhanced and/or fundamental frequency model comprise:
The model parameter of spectral model and/or fundamental frequency model is obtained from described initial speech synthetic model;
Described enhancing model is utilized to carry out enhancing process to described model parameter, the model parameter after being enhanced;
Model parameter after strengthening is substituted corresponding spectral model and/or the model parameter of fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model.
7. the method according to any one of Claims 1-4, is characterized in that, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model;
The described synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation comprises:
Described initial speech synthetic model is utilized to generate the duration parameters of corresponding described text to be synthesized, frequency spectrum parameter and base frequency parameters respectively;
Enhancing model is utilized to carry out enhancing process to described frequency spectrum parameter and/or base frequency parameters, frequency spectrum parameter after being enhanced and/or base frequency parameters, and using the frequency spectrum parameter after described enhancing and/or base frequency parameters as the frequency spectrum parameter of described text to be synthesized corresponding during synthetic speech and/or base frequency parameters.
8. realize the system that synthetic speech strengthens, it is characterized in that, comprising:
Module set up by initial model, and for building initial speech synthetic model based on training data, described training data comprises text data and the speech data corresponding with described text data;
Strengthen model building module, for setting up enhancing model, described enhancing model is for the synthetic speech parameter simulating described initial speech synthetic model and generate and the mapping relations of natural-sounding parameter;
Receiver module, for receiving text to be synthesized;
Parameter generation module, for the synthetic speech parameter according to described initial speech synthetic model and the corresponding described text to be synthesized of described enhancing model generation;
Synthesis module, generates continuous speech signal for utilizing described synthetic speech parameter.
9. system according to claim 8, is characterized in that, described enhancing model building module comprises:
Synthetic speech parameter generating unit, for generating the synthetic speech parameter of all training datas according to described initial speech synthetic model;
Natural-sounding parameter extraction unit, for extracting the natural-sounding parameter of all training datas;
Topological structure determining unit, for determining the topological structure strengthening model;
Training unit, for the synthetic speech parameter of described for correspondence training data and the data of natural-sounding parameter being gathered as training, carry out parameter training according to described topological structure, be enhanced model.
10. system according to claim 8 or claim 9, it is characterized in that, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model; Described parameter generation module comprises:
Model enhancement unit, for carrying out enhancing process according to described enhancing model to the spectral model in described initial speech synthetic model and/or fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model;
Strengthen speech parameter generation unit, for the frequency spectrum parameter and/or the base frequency parameters that utilize the spectral model of described enhancing and/or fundamental frequency model to generate corresponding described text to be synthesized;
Initial speech parameter generating unit, for other speech parameter except spectral model and/or fundamental frequency model utilizing described initial speech synthetic model to generate corresponding described text to be synthesized.
11. systems according to claim 10, is characterized in that, described model enhancement unit comprises:
Model parameter acquiring unit, for obtaining the model parameter of spectral model and/or fundamental frequency model from described initial speech synthetic model;
Model parameter enhancement unit, carries out enhancing process to described model parameter, the model parameter after being enhanced for utilizing described enhancing model;
Strengthen model generation unit, for the model parameter after enhancing being substituted corresponding spectral model and/or the model parameter of fundamental frequency model, the spectral model be enhanced and/or fundamental frequency model.
12. systems according to claim 8 or claim 9, it is characterized in that, described initial speech synthetic model comprises: duration modeling, spectral model, fundamental frequency model;
Described parameter generation module comprises:
Initial speech parameter generating unit, generates the duration parameters of corresponding described text to be synthesized, frequency spectrum parameter and base frequency parameters respectively for utilizing described initial speech synthetic model;
Parameter enhancement unit, for utilizing described enhancing model, enhancing process is carried out to described frequency spectrum parameter and/or base frequency parameters, frequency spectrum parameter after being enhanced and/or base frequency parameters, and using the frequency spectrum parameter after described enhancing and/or base frequency parameters as the frequency spectrum parameter of described text to be synthesized corresponding during synthetic speech and/or base frequency parameters.
CN201410182886.6A 2014-04-30 2014-04-30 A kind of method and system for realizing synthesis speech enhan-cement Active CN105023574B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410182886.6A CN105023574B (en) 2014-04-30 2014-04-30 A kind of method and system for realizing synthesis speech enhan-cement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410182886.6A CN105023574B (en) 2014-04-30 2014-04-30 A kind of method and system for realizing synthesis speech enhan-cement

Publications (2)

Publication Number Publication Date
CN105023574A true CN105023574A (en) 2015-11-04
CN105023574B CN105023574B (en) 2018-06-15

Family

ID=54413492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410182886.6A Active CN105023574B (en) 2014-04-30 2014-04-30 A kind of method and system for realizing synthesis speech enhan-cement

Country Status (1)

Country Link
CN (1) CN105023574B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231086A (en) * 2017-12-24 2018-06-29 航天恒星科技有限公司 A kind of deep learning voice enhancer and method based on FPGA
CN109346058A (en) * 2018-11-29 2019-02-15 西安交通大学 A kind of speech acoustics feature expansion system
CN109427340A (en) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 A kind of sound enhancement method, device and electronic equipment
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038533A (en) * 1995-07-07 2000-03-14 Lucent Technologies Inc. System and method for selecting training text
CN1835074B (en) * 2006-04-07 2010-05-12 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
CN1835075B (en) * 2006-04-07 2011-06-29 安徽中科大讯飞信息科技有限公司 Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN102568476B (en) * 2012-02-21 2013-07-03 南京邮电大学 Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN102982809B (en) * 2012-12-11 2014-12-10 中国科学技术大学 Conversion method for sound of speaker
CN103065619B (en) * 2012-12-26 2015-02-04 安徽科大讯飞信息科技股份有限公司 Speech synthesis method and speech synthesis system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109427340A (en) * 2017-08-22 2019-03-05 杭州海康威视数字技术股份有限公司 A kind of sound enhancement method, device and electronic equipment
CN108231086A (en) * 2017-12-24 2018-06-29 航天恒星科技有限公司 A kind of deep learning voice enhancer and method based on FPGA
CN109346058A (en) * 2018-11-29 2019-02-15 西安交通大学 A kind of speech acoustics feature expansion system
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device

Also Published As

Publication number Publication date
CN105023574B (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN105118498B (en) The training method and device of phonetic synthesis model
CN107146624B (en) A kind of method for identifying speaker and device
CN102664016B (en) Singing evaluation method and system
CN101064104B (en) Emotion voice creating method based on voice conversion
CN109147758A (en) A kind of speaker's sound converting method and device
CN108847249A (en) Sound converts optimization method and system
CN105161092B (en) A kind of audio recognition method and device
CN104272382A (en) Method and system for template-based personalized singing synthesis
CN106057192A (en) Real-time voice conversion method and apparatus
CN106104674A (en) Mixing voice identification
CN105590625A (en) Acoustic model self-adaptive method and system
CN107316638A (en) A kind of poem recites evaluating method and system, a kind of terminal and storage medium
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN103578462A (en) Speech processing system
CN108831437A (en) A kind of song generation method, device, terminal and storage medium
CN102436807A (en) Method and system for automatically generating voice with stressed syllables
CN104916284A (en) Prosody and acoustics joint modeling method and device for voice synthesis system
CN105023570B (en) A kind of method and system for realizing sound conversion
CN111048064A (en) Voice cloning method and device based on single speaker voice synthesis data set
CN110491393A (en) The training method and relevant apparatus of vocal print characterization model
CN103456295B (en) Sing synthetic middle base frequency parameters and generate method and system
CN105023574A (en) Method and system of enhancing TTS
CN113724683B (en) Audio generation method, computer device and computer readable storage medium
CN109119067A (en) Phoneme synthesizing method and device
CN108877835A (en) Evaluate the method and system of voice signal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant after: Iflytek Co., Ltd.

Address before: Wangjiang Road high tech Development Zone Hefei city Anhui province 230088 No. 666

Applicant before: Anhui USTC iFLYTEK Co., Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant