CN105023574B

CN105023574B - A kind of method and system for realizing synthesis speech enhan-cement

Info

Publication number: CN105023574B
Application number: CN201410182886.6A
Authority: CN
Inventors: 孙见青; 陈凌辉; 凌震华; 江源; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2014-04-30
Filing date: 2014-04-30
Publication date: 2018-06-15
Anticipated expiration: 2034-04-30
Also published as: CN105023574A

Abstract

The present invention relates to speech synthesis technique fields, disclose a kind of method and system for realizing synthesis speech enhan-cement, and this method includes：Initial speech synthetic model is built based on training data, the training data includes text data and voice data corresponding with the text data；Enhancing model is established, the enhancing model is used to simulate the synthesis speech parameter and the mapping relations of natural-sounding parameter of the initial speech synthetic model generation；After text to be synthesized is received, according to the initial speech synthetic model and the synthesis speech parameter of the corresponding text to be synthesized of enhancing model generation；Utilize the synthesis speech parameter generation continuous speech signal.Using the present invention, the enhancing effect of synthesis voice can be effectively improved.

Description

A kind of method and system for realizing synthesis speech enhan-cement

Technical field

The present invention relates to speech synthesis technique fields, and in particular to a kind of method and system for realizing synthesis speech enhan-cement.

Background technology

Realize it is man-machine between hommization, intelligentized effective interaction, build man-machine communication's environment of efficient natural, into The active demand applied and developed for current information technology.As an important technology practical in voice technology, phonetic synthesis Text information is converted into natural voice signal by technology or literary periodicals technology (Text-To-Speech, TTS), is realized The real-time conversion of text, changes the troublesome operation that tradition is lifted up one's voice by playback realization machine of recording, and the system of saving is deposited Space is stored up, in increasing current of information exchange, particularly needs the dynamic queries often changed application neck in the information content Domain has played increasingly important role.

Based on the speech synthesis system of parameter synthesis due to having preferable robustness and generalization to obtain widely should With, however this method has stronger smoothing effect, the voice of synthesis is flat and sound quality is easily damaged, in terms of naturalness is synthesized Show not ideal enough, there are certain rooms for promotion in practical application.How to improve the naturalness of synthesis voice is synthesis system Practical important leverage.

For this purpose, the naturalness of synthesis voice, master are mainly improved using the method for synthesis speech enhan-cement in the prior art Want technology that can be summarized as：The Heuristicses such as the sense of hearing characteristic based on people carry out post filtering to generation frequency spectrum parameter or synthesis voice Processing, such as the formant of pairing into voice carry out enhancing processing, the dynamic characteristic for strengthening generation frequency spectrum parameter, so as to improve conjunction Into the sound quality of voice.

In fact, the acoustic characteristic of different speakers is sending out difference there are detail differences, and for same speaker During sound, there is also detail differences for acoustic characteristic.And the synthesis sound enhancement method based on Heuristics, it can only cause enhanced Synthesis voice meets the sense of hearing of people on the whole, and enhancing effect is unsatisfactory.

Invention content

The embodiment of the present invention provides a kind of method and system for realizing synthesis speech enhan-cement, to improve the enhancing of synthesis voice Effect.

For this purpose, the embodiment of the present invention provides following technical solution：

A kind of method for realizing synthesis speech enhan-cement, including：

Based on training data build initial speech synthetic model, the training data include text data and with the text The corresponding voice data of data；The initial speech synthetic model includes：Duration modeling, spectral model, fundamental frequency model；

Enhancing model is established, the enhancing model is used to simulate the synthesis voice ginseng of the initial speech synthetic model generation Number and the mapping relations of natural-sounding parameter；

After text to be synthesized is received, according to the initial speech synthetic model and the corresponding institute of enhancing model generation The synthesis speech parameter of text to be synthesized is stated, including：According to the enhancing model to the frequency in the initial speech synthetic model Spectrum model and/or fundamental frequency model carry out enhancing processing, the spectral model and/or fundamental frequency model enhanced；Utilize the enhancing Spectral model and/or the corresponding text to be synthesized of fundamental frequency model generation frequency spectrum parameter and/or base frequency parameters；Using described Other voices in addition to spectral model and/or fundamental frequency model of the corresponding text to be synthesized of initial speech synthetic model generation Parameter；

Utilize the synthesis speech parameter generation continuous speech signal.

Preferably, the enhancing model of establishing includes：

The synthesis speech parameter of all training datas is generated according to the initial speech synthetic model；

Extract the natural-sounding parameter of all training datas；

Determine the topological structure of enhancing model；

Using the data of the synthesis speech parameter of the correspondence training data and natural-sounding parameter to gathering as training, root Parameter training is carried out according to the topological structure, obtains enhancing model.

Preferably, the enhancing model is：The mapping model or GMM model of linear function or DNN models.

Preferably, the synthesis speech parameter and the mapping relations of natural-sounding parameter of the initial speech synthetic model generation It is distributed for the synthesis speech parameter of initial speech synthetic model generation and the condition of natural-sounding parameter.

Preferably, it is described according to it is described enhancing model to the spectral model and/or base in the initial speech synthetic model Frequency model carries out enhancing processing, and the spectral model and/or fundamental frequency model enhanced includes：

Spectral model and/or the model parameter of fundamental frequency model are obtained from the initial speech synthetic model；

Enhancing processing is carried out to the model parameter using the enhancing model, obtains enhanced model parameter；

Enhanced model parameter is substituted into corresponding spectral model and/or the model parameter of fundamental frequency model, is enhanced Spectral model and/or fundamental frequency model.

A kind of system for realizing synthesis speech enhan-cement, including：

Initial model establishes module, for being based on training data structure initial speech synthetic model, the training data packet Include text data and voice data corresponding with the text data；The initial speech synthetic model includes：Duration modeling, frequency Spectrum model, fundamental frequency model；

Enhance model building module, enhance model for establishing, the enhancing model closes for simulating the initial speech The synthesis speech parameter and the mapping relations of natural-sounding parameter generated into model；

Receiving module, for receiving text to be synthesized；

Parameter generation module, for being treated according to the initial speech synthetic model and the enhancing model generation are corresponding The synthesis speech parameter of synthesis text；The parameter generation module includes：Model enhancement unit, for according to the enhancing model Enhancing processing is carried out to the spectral model in the initial speech synthetic model and/or fundamental frequency model, the frequency spectrum mould enhanced Type and/or fundamental frequency model；Enhance speech parameter generation unit, for utilizing the spectral model and/or fundamental frequency model of the enhancing The frequency spectrum parameter and/or base frequency parameters of the corresponding text to be synthesized of generation；Initial speech parameter generating unit, for utilizing State other languages in addition to spectral model and/or fundamental frequency model of the corresponding text to be synthesized of initial speech synthetic model generation Sound parameter；

Synthesis module, for utilizing the synthesis speech parameter generation continuous speech signal.

Preferably, the enhancing model building module includes：

Speech parameter generation unit is synthesized, for generating the conjunction of all training datas according to the initial speech synthetic model Into speech parameter；

Natural-sounding parameter extraction unit, for extracting the natural-sounding parameter of all training datas；

Topological structure determination unit, for determining the topological structure of enhancing model；

Training unit, for the data of the synthesis speech parameter of the training data and natural-sounding parameter will to be corresponded to making Gather for training, parameter training is carried out according to the topological structure, obtain enhancing model.

Preferably, the model enhancement unit includes：

Model parameter acquiring unit, for obtaining spectral model and/or fundamental frequency mould from the initial speech synthetic model The model parameter of type；

Model parameter enhancement unit for carrying out enhancing processing to the model parameter using the enhancing model, obtains Enhanced model parameter；

Enhance model generation unit, for enhanced model parameter to be substituted corresponding spectral model and/or fundamental frequency mould The model parameter of type, the spectral model enhanced and/or fundamental frequency model.

The method and system provided in an embodiment of the present invention for realizing synthesis speech enhan-cement, Statistics-Based Method structure are used for The synthesis speech parameter of traditional voice synthetic model generation and the enhancing model of the mapping relations of natural-sounding parameter are simulated, then Using the enhancing model and the synthesis speech parameter of the corresponding text to be synthesized of traditional voice synthetic model generation, and then described in utilization Synthesize speech parameter generation continuous speech signal.The scheme of the embodiment of the present invention, therefore can using natural parameters,acoustic as guidance The minutia of parameters,acoustic has stronger assurance during with to different speakers and same pronunciation human hair difference sound, can grab The firmly characteristic of specific speaker, the effect for making synthesis speech enhan-cement are more preferable.

Description of the drawings

It in order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only one described in the present invention A little embodiments for those of ordinary skill in the art, can also be obtained according to these attached drawings other attached drawings.

Fig. 1 is the flow chart for the method that the embodiment of the present invention realizes synthesis speech enhan-cement；

Fig. 2 is according to a kind of flow of initial speech synthetic model and enhancing model synthesis voice in the embodiment of the present invention Figure；

Fig. 3 is according to another flow of initial speech synthetic model and enhancing model synthesis voice in the embodiment of the present invention Figure；

Fig. 4 is the structure diagram for the system that the embodiment of the present invention realizes synthesis speech enhan-cement；

Fig. 5 is a kind of specific implementation structure diagram of parameter generation module in the embodiment of the present invention；

Fig. 6 is another specific implementation structure diagram of parameter generation module in the embodiment of the present invention.

Specific embodiment

In order to which those skilled in the art is made to more fully understand the scheme of the embodiment of the present invention, below in conjunction with the accompanying drawings and implement Mode is described in further detail the embodiment of the present invention.

Since the acoustic characteristic of different speakers is there are detail differences, and for same speaker, sending out not unisonance When, there is also detail differences for acoustic characteristic.And the Heuristicses such as the existing synthesis sense of hearing characteristic of sound enhancement method based on people Post filtering processing is carried out to generation frequency spectrum parameter or synthesis voice, is not concerned with the details characteristic of speaker parameters,acoustic, it can only So that enhanced synthesis voice meets the sense of hearing of people on the whole, it is impossible to obtain ideal enhancing effect.For this purpose, the present invention is real It applies example in view of the problems of the existing technology, a kind of method and system for realizing synthesis speech enhan-cement, the side based on statistics is provided Method builds to simulate the synthesis speech parameter of traditional voice synthetic model generation and the increasing of the mapping relations of natural-sounding parameter Then strong model corresponds to the synthesis speech parameter of text to be synthesized using the enhancing model and the generation of traditional voice synthetic model, And then utilize the synthesis speech parameter generation continuous speech signal.

As shown in Figure 1, being the flow chart for the method that the embodiment of the present invention realizes synthesis speech enhan-cement, include the following steps：

Step 101, based on training data build initial speech synthetic model, the training data include text data and with The corresponding voice data of the text data.

The initial speech synthetic model can be built using traditional parameter synthesis method, including：It is each basic The corresponding binary decision tree of synthesis unit, spectral model, fundamental frequency model, duration modeling etc..For example, it may be used based on HMM's For spectral model, leaf is simulated using GMM (Gaussian Mixture Mode, gauss hybrid models) for parameter synthesis method The spectrum distribution of node, Gaussage usually can be determined as positive integer with reference to training data scale, for example, select Gaussage for 1。

Step 102, enhancing model is established, the enhancing model is used to simulate the conjunction of the initial speech synthetic model generation Into speech parameter and the mapping relations of natural-sounding parameter.

Since the setting of enhancing model has important influence into voice enhancing effect with optimization pairing, in this hair In bright embodiment, using the enhancing model set-up mode based on data-driven, using natural parameters,acoustic as guidance, real embodiment The minutia of parameters,acoustic when different speakers and same pronunciation human hair difference sound, and then improve synthesis speech enhan-cement Effect.

The building process for enhancing model is as follows：

(1) the synthesis speech parameter of all training datas is generated according to initial speech synthetic model；

(2) the natural-sounding parameter of all training datas is extracted；

(3) topological structure of enhancing model is determined；

(4) using the data of the synthesis speech parameter of the correspondence training data and natural-sounding parameter to as training set It closes, parameter training is carried out according to the topological structure, obtain enhancing model.

It should be noted that in practical applications, the enhancing for spectral characteristic and/or fundamental frequency characteristic can be built respectively Model.For example, for the enhancing model of spectral characteristic, specific building process is as follows：

(1) spectral model in initial speech synthetic model generates the synthesis frequency spectrum parameter of all training datas.

According to the spectral model and force alignment result, it may be determined that the corresponding spectral model sequence of training data.Tool Body, for single basic voice unit, according to alignment duration information is forced repeatedly to be copied selected spectral model, obtain Take the spectrum signature series model of the basic voice unit.

The likelihood score summation of the corresponding spectral model sequence of training data is counted, is calculated as follows：

Wherein W be calculate dynamic parameter window function matrix, C_sFor frequency spectrum parameter to be generated, M_sAnd U_sRespectively frequency spectrum The mean value and covariance matrix of model.The likelihood score summation of obvious spectral model is the function of target spectrum characteristic vector.

(2) the natural frequency spectrum parameter of all training datas is extracted.

(3) topological structure of spectral enhancement model is determined.

Spectral enhancement model is used to simulate the frequency spectrum parameter of traditional voice synthetic model generation and reflecting for natural frequency spectrum parameter Relationship is penetrated, in embodiments of the present invention, the mapping model of linear function may be used, GMM model or DNN models can also be used Etc. mathematical statistical models.In general, the model the fine then better in its sufficient simulation effect of data.

(4) parameter training is carried out to spectral enhancement model according to the topological structure, obtains the spectral enhancement model of optimization, Establish synthesis frequency spectrum parameter x_tWith natural frequency spectrum parameter y_tCondition distribution p (y_t|x_t)。

It is similar to the above for the building process of the enhancing model of fundamental frequency characteristic, it is not described in detail herein.

Step 103, after text to be synthesized is received, according to the initial speech synthetic model and the enhancing model life Into the synthesis speech parameter of the correspondence text to be synthesized.

Based on above-mentioned enhancing model, in practical applications, various ways may be used to initial speech synthetic model or conjunction Enhanced into speech parameter, can obtain good enhancing effect, specific implementation process will be described in detail later.

Step 104, the synthesis speech parameter generation continuous speech signal is utilized.

The method provided in an embodiment of the present invention for realizing synthesis speech enhan-cement, Statistics-Based Method structure pass for simulating The synthesis speech parameter and the enhancing model of the mapping relations of natural-sounding parameter of phonetic synthesis of uniting model generation, then utilizing should Enhance the synthesis speech parameter of model and the corresponding text to be synthesized of traditional voice synthetic model generation, and then utilize the synthesis language Sound parameter generates continuous speech signal.Due to it is described enhancing model be using natural parameters,acoustic as instruct, can be to not The minutia of parameters,acoustic has stronger assurance when same speaker and same pronunciation human hair difference sound, can catch specific The characteristic of speaker, the effect for making synthesis speech enhan-cement are more preferable.Moreover, the scheme of the embodiment of the present invention is in practical synthesis task Operand will not be increased, be conducive to the real time implementation of product.

It should be noted that in practical applications, according to initial speech synthetic model and enhancing model generation synthesis voice There are many modes of parameter.For example, can utilize corresponding enhancing model to the spectral model in initial speech synthetic model and/ Or fundamental frequency model carries out enhancing processing, utilizes enhancing treated spectral model and/or the corresponding text to be synthesized of fundamental frequency model generation This frequency spectrum parameter and/or base frequency parameters generates other phonetic synthesis parameters by initial speech synthetic model, then utilizes these Phonetic synthesis parameter generates continuous speech signal.For another example, correspondence can also be generated first with initial speech synthetic model to wait to close Into the phonetic synthesis parameter (including duration parameters, frequency spectrum parameter, base frequency parameters) of text, corresponding enhancing model is then recycled Enhancing processing is carried out (including frequency spectrum parameter and/or base frequency parameters) to some of which phonetic synthesis parameter, finally utilizes these Phonetic synthesis parameter (mainly duration parameters) generation that enhanced phonetic synthesis parameter and other do not reinforce processing connects Continuous voice signal.

Separately below according to initial speech synthetic model and enhancing model generation in citing the present invention will be described in detail embodiment Synthesize the process of speech parameter.

As shown in Fig. 2, it is according to initial speech synthetic model and enhancing model generation synthesis voice in the embodiment of the present invention A kind of flow chart of parameter, includes the following steps：

Step 201, the duration parameters and base frequency parameters of the corresponding text to be synthesized of initial speech synthetic model generation are utilized.

Step 202, enhancing processing is carried out to the spectral model in initial speech synthetic model according to enhancing model, is increased Strong spectral model.

First, model parameter is obtained from initial spectral model, such as the spectral model based on GMM, it is denoted as x_t；So Afterwards using advance trained enhancing model, to model parameter x_tEnhancing processing is carried out, i.e., according to p (y_t|x_t), after acquiring enhancing Model parameter y_t；Finally with enhanced model parameter y_tThe model parameter of spectral model is replaced, obtains new spectral model, This model is enhanced spectral model.

Step 203, the frequency spectrum parameter of the corresponding text to be synthesized of the spectral model of enhancing generation is utilized.

Step 204, it is generated and connected using the duration parameters, base frequency parameters and frequency spectrum parameter of the correspondence text to be synthesized Continuous voice signal.

It should be noted that it in practical applications, can generate respectively for the enhancing model of spectral characteristic and for base Therefore the enhancing model of frequency characteristic, can individually use the enhancing model for spectral characteristic in initial speech synthetic model Spectral model carry out enhancing processing or individually using for fundamental frequency characteristic enhancing model in initial speech synthetic model Fundamental frequency model carry out enhancing processing, can also integrate using above two for different characteristics enhancing model respectively to initial Spectral model and fundamental frequency model in phonetic synthesis model carry out enhancing processing.Correspondingly, using enhanced spectral model and/ Or fundamental frequency model obtains corresponding to the base frequency parameters and/or frequency spectrum parameter of text to be synthesized, using these phonetic synthesis parameters and leads to Cross other phonetic synthesis parameters that initial speech synthetic model obtains, you can generation continuous speech signal.

Using above-mentioned each phonetic synthesis parameter generation continuous speech signal process similarly to the prior art, it is no longer superfluous herein It states.

It can be seen that the method for the embodiment of the present invention carries out enhancing processing to traditional voice synthetic model, in subsequent conjunction Into in task, only corresponding phonetic synthesis parameter need to be obtained using enhanced phonetic synthesis model, operation will not be increased Amount, and good enhancing effect can be reached.

As shown in figure 3, it is that the another of voice is synthesized according to initial speech synthetic model and enhancing model in the embodiment of the present invention A kind of flow chart.

Step 301, duration parameters, the frequency spectrum parameter of corresponding text to be synthesized are generated respectively using initial speech synthetic model And base frequency parameters.

Step 302, enhancing processing is carried out to frequency spectrum parameter using enhancing model, obtains enhanced frequency spectrum parameter.

Specifically, by the frequency spectrum parameter C in preceding formula (1)_sSubstitute into front enhancing processing model p (y_t|x_t) in x_t, It can obtain enhanced frequency spectrum parameter y_t。

Step 303, joined using the duration parameters, base frequency parameters and enhanced frequency spectrum of the correspondence text to be synthesized Number generation continuous speech signal.

Flow shown in Fig. 3 is unlike flow shown in Fig. 2 it can be seen from above-mentioned flow, in this embodiment, first by Base frequency parameters, frequency spectrum parameter and the duration parameters of the corresponding text to be synthesized of initial speech synthetic model generation, then again by phase The enhancing model answered carries out enhancing processing to frequency spectrum parameter therein, so as to which enhanced phonetic synthesis parameter be made preferably to embody The detail differences in acoustic characteristic when different speakers and same pronunciation human hair difference sound.By these enhancings, treated Phonetic synthesis parameter passes through synthesizer with being combined together by other phonetic synthesis parameters that traditional voice synthetic model obtains Synthesize voice.

It should be noted that in practical applications, the flow similar with above-mentioned Fig. 3 equally may be used, using corresponding Enhancing model carries out enhancing processing to base frequency parameters, obtains enhanced base frequency parameters.Then corresponding text to be synthesized is utilized Duration parameters, frequency spectrum parameter and enhanced base frequency parameters generation continuous speech signal.Or using for spectral characteristic Enhancing model carries out the frequency spectrum parameter that initial speech synthetic model generates enhancing processing, while utilizes the increasing for fundamental frequency characteristic Strong model carries out enhancing processing to the base frequency parameters that initial speech synthetic model generates, and is then given birth to using initial speech synthetic model Into duration parameters and enhanced base frequency parameters and frequency spectrum parameter generation continuous speech signal.

Correspondingly, the embodiment of the present invention also provides a kind of system for realizing synthesis speech enhan-cement, as shown in figure 4, being that this is The structure diagram of system.

In this embodiment, the system comprises：

Initial model establishes module 401, for being based on training data structure initial speech synthetic model, the training data Including text data and voice data corresponding with the text data；

Enhance model building module 402, enhance model for establishing, the enhancing model is used to simulate the initial speech The synthesis speech parameter of synthetic model generation and the mapping relations of natural-sounding parameter；

Receiving module 403, for receiving text to be synthesized；

Parameter generation module 404, for according to the initial speech synthetic model and the corresponding institute of enhancing model generation State the synthesis speech parameter of text to be synthesized；

Synthesis module 405, for utilizing the synthesis speech parameter generation continuous speech signal.

Above-mentioned initial model, which establishes module 401, can use traditional parameter synthesis method structure initial speech synthesis mould Type, the initial speech synthetic model include：The corresponding binary decision tree of each basic synthesis unit, spectral model, fundamental frequency model, Duration modeling etc..For example, the parameter synthesis method based on HMM may be used, for spectral model, leaf segment is simulated using GMM The spectrum distribution of point, Gaussage can usually be determined as positive integer with reference to training data scale, for example select Gaussage as 1.

Since the setting of enhancing model has important influence into voice enhancing effect with optimization pairing, in this hair In bright embodiment, enhancing model building module 402 uses the enhancing model set-up mode based on data-driven, is joined with natural acoustics The minutia of parameters,acoustic when number is as guidance, real embodiment difference speaker and same pronunciation human hair difference sound, and then Improve the effect of synthesis speech enhan-cement.

Above-mentioned enhancing model building module 402 can specifically include following each unit：

It should be noted that in practical applications, above-mentioned enhancing model building module 402 can be respectively built for frequency spectrum The enhancing model of characteristic and/or fundamental frequency characteristic.Correspondingly, in enhancing model of the structure for spectral characteristic, the synthesis language Sound parameter generating unit needs the skilled synthesis frequency spectrum parameter of spectral model generation institute in initial speech synthetic model； Natural-sounding parameter extraction unit needs to extract the natural frequency spectrum parameter of all training datas.Similarly, in structure for fundamental frequency During the enhancing model of characteristic, the synthesis speech parameter generation unit needs the fundamental frequency model in initial speech synthetic model The skilled synthesis base frequency parameters of generation institute；Natural-sounding parameter extraction unit needs to extract the natural fundamental frequency of all training datas Parameter.

The enhancing model that above-mentioned parameter generation module 404 is established based on enhancing model building module 402, may be used a variety of Mode enhances initial speech synthetic model or synthesis speech parameter, can obtain good enhancing effect, correspondingly, Parameter generation module 404 can will be described in detail later there are many specific implementation structure.

The system provided in an embodiment of the present invention for realizing synthesis speech enhan-cement, Statistics-Based Method structure pass for simulating The synthesis speech parameter and the enhancing model of the mapping relations of natural-sounding parameter of phonetic synthesis of uniting model generation, then utilizing should Enhance the synthesis speech parameter of model and the corresponding text to be synthesized of traditional voice synthetic model generation, and then utilize the synthesis language Sound parameter generates continuous speech signal.Due to it is described enhancing model be using natural parameters,acoustic as instruct, can be to not The minutia of parameters,acoustic has stronger assurance when same speaker and same pronunciation human hair difference sound, can catch specific The characteristic of speaker, the effect for making synthesis speech enhan-cement are more preferable.

As shown in figure 5, it is a kind of specific implementation structure diagram of parameter generation module in the embodiment of the present invention.

In this embodiment, the parameter generation module includes：

Model enhancement unit 501, for according to it is described enhancing model to the frequency spectrum mould in the initial speech synthetic model Type and/or fundamental frequency model carry out enhancing processing, the spectral model and/or fundamental frequency model enhanced；

Enhance speech parameter generation unit 502, for utilizing the spectral model of the enhancing and/or fundamental frequency model generation pair Answer the frequency spectrum parameter and/or base frequency parameters of the text to be synthesized；

Initial speech parameter generating unit 503, for waiting to close using described in initial speech synthetic model generation correspondence Into other speech parameters in addition to spectral model and/or fundamental frequency model of text.

It should be noted that in practical applications, aforementioned enhancing model building module 402 can be respectively generated for frequency spectrum The enhancing model of characteristic and the enhancing model for fundamental frequency characteristic, therefore, in the embodiment shown in fig. 5, model enhancement unit 501 Can enhancing processing individually be carried out to the spectral model in initial speech synthetic model using the enhancing model for spectral characteristic, Or enhancing processing is individually carried out to the fundamental frequency model in initial speech synthetic model using the enhancing model for fundamental frequency characteristic, It can also integrate and the enhancing model of different characteristics is directed to respectively to the frequency spectrum mould in initial speech synthetic model using above two Type and fundamental frequency model carry out enhancing processing.Correspondingly, enhancing speech parameter generation unit 502 can utilize enhanced frequency spectrum mould Type and/or fundamental frequency model obtain corresponding to the base frequency parameters and/or frequency spectrum parameter of text to be synthesized, 405 profit of synthesis module in Fig. 4 The other phonetic synthesis parameters obtained with these phonetic synthesis parameters and initial speech parameter generating unit 503, you can generation connects Continuous voice signal.

It can be seen that the system that the embodiment of the present invention realizes synthesis speech enhan-cement enhances traditional voice synthetic model Processing in subsequent synthesis task, only need to obtain corresponding phonetic synthesis parameter i.e. using enhanced phonetic synthesis model Can, operand will not be increased, and good enhancing effect can be reached.

Above-mentioned model enhancement unit 501 can by the enhancing to model mean value come the spectral model that is enhanced and/or Fundamental frequency model, a kind of concrete structure of model enhancement unit 501 can include following each unit：

As shown in fig. 6, it is another specific implementation structure diagram of parameter generation module in the embodiment of the present invention.

In this embodiment, the parameter generation module includes：

Initial speech parameter generating unit 601 is waited to close for generating respectively using initial speech synthetic model described in correspondence Into the duration parameters of text, frequency spectrum parameter and base frequency parameters；

Parameter enhancement unit 602, for being carried out at enhancing to the frequency spectrum parameter and/or base frequency parameters using enhancing model Reason obtains enhanced frequency spectrum parameter and/or base frequency parameters, and the enhanced frequency spectrum parameter and/or base frequency parameters is made To correspond to the frequency spectrum parameter and/or base frequency parameters of the text to be synthesized during synthesis voice.

Unlike structure of block diagram shown in Fig. 5, in this embodiment, first by 601 profit of initial speech parameter generating unit With base frequency parameters, frequency spectrum parameter and the duration parameters of the corresponding text to be synthesized of initial speech synthetic model generation, then again by Parameter enhancement unit 602 carries out enhancing processing using corresponding enhancing model to frequency spectrum parameter therein, enhanced so as to make Phonetic synthesis parameter preferably embody different speakers and it is same pronunciation human hair difference sound when acoustic characteristic on details it is poor It is different.Synthesis module 405 in Fig. 4 is by these enhancings treated phonetic synthesis parameter with being obtained by traditional voice synthetic model Other phonetic synthesis parameters be combined together, voice is synthesized by synthesizer.

Using the embodiment of the present invention realize synthesis speech enhan-cement system, by statistical obtain different speakers, with And during same pronunciation human hair difference sound parameters,acoustic minutia, and then increased using these minutia pairings into voice Strength is managed, so as to obtain better enhancing effect.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Point just to refer each other, and the highlights of each of the examples are difference from other examples.Especially for system reality For applying example, since it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method Part explanation.System embodiment described above is only schematical, wherein described be used as separating component explanation Unit may or may not be physically separate, the component shown as unit may or may not be Physical unit, you can be located at a place or can also be distributed in multiple network element.It can be according to the actual needs Some or all of module therein is selected to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying In the case of creative work, you can to understand and implement.

The embodiment of the present invention is described in detail above, specific embodiment used herein carries out the present invention It illustrates, the explanation of above example is only intended to help to understand the method and apparatus of the present invention；Meanwhile for the one of this field As technical staff, thought according to the present invention, there will be changes in specific embodiments and applications, to sum up institute It states, the content of the present specification should not be construed as limiting the invention.

Claims

A kind of 1. method for realizing synthesis speech enhan-cement, which is characterized in that including：

Based on training data build initial speech synthetic model, the training data include text data and with the text data Corresponding voice data；The initial speech synthetic model includes：Duration modeling, spectral model, fundamental frequency model；

Establish enhancing model, the enhancing model be used to simulating the synthesis speech parameter of initial speech synthetic model generation with The mapping relations of natural-sounding parameter；

After text to be synthesized is received, treated according to the initial speech synthetic model and the enhancing model generation are corresponding The synthesis speech parameter of synthesis text, including：According to the enhancing model to the frequency spectrum mould in the initial speech synthetic model Type and/or fundamental frequency model carry out enhancing processing, the spectral model and/or fundamental frequency model enhanced；Utilize the frequency of the enhancing The frequency spectrum parameter and/or base frequency parameters of spectrum model and/or the corresponding text to be synthesized of fundamental frequency model generation；Using described initial The ginseng of other voices in addition to spectral model and/or fundamental frequency model of the corresponding text to be synthesized of phonetic synthesis model generation Number；

Utilize the synthesis speech parameter generation continuous speech signal.
2. according to the method described in claim 1, it is characterized in that, the enhancing model of establishing includes：

The synthesis speech parameter of all training datas is generated according to the initial speech synthetic model；

Extract the natural-sounding parameter of all training datas；

Determine the topological structure of enhancing model；

Using the data of the synthesis speech parameter of the correspondence training data and natural-sounding parameter to gathering as training, according to institute It states topological structure and carries out parameter training, obtain enhancing model.
3. according to the method described in claim 2, it is characterized in that, the enhancing model is：The mapping model of linear function or Person's GMM model or DNN models.
4. the according to the method described in claim 1, it is characterized in that, synthesis voice ginseng of initial speech synthetic model generation The mapping relations of number and natural-sounding parameter are the synthesis speech parameter and natural-sounding of initial speech synthetic model generation The condition distribution of parameter.
5. according to the method described in claim 1, it is characterized in that, described close the initial speech according to the enhancing model Enhancing processing is carried out into the spectral model in model and/or fundamental frequency model, the spectral model and/or fundamental frequency model packet enhanced It includes：

Spectral model and/or the model parameter of fundamental frequency model are obtained from the initial speech synthetic model；

Enhancing processing is carried out to the model parameter using the enhancing model, obtains enhanced model parameter；

Enhanced model parameter is substituted into corresponding spectral model and/or the model parameter of fundamental frequency model, the frequency enhanced Spectrum model and/or fundamental frequency model.
6. a kind of system for realizing synthesis speech enhan-cement, which is characterized in that including：

Initial model establishes module, and for being based on training data structure initial speech synthetic model, the training data includes text Notebook data and voice data corresponding with the text data；The initial speech synthetic model includes：Duration modeling, frequency spectrum mould Type, fundamental frequency model；

Enhance model building module, enhance model for establishing, the enhancing model is used to simulate the initial speech synthesis mould The synthesis speech parameter of type generation and the mapping relations of natural-sounding parameter；

Receiving module, for receiving text to be synthesized；

Parameter generation module, for described to be synthesized according to the initial speech synthetic model and the enhancing model generation correspondence The synthesis speech parameter of text；The parameter generation module includes：Model enhancement unit, for according to it is described enhancing model to institute The spectral model and/or fundamental frequency model stated in initial speech synthetic model carry out enhancing processing, the spectral model enhanced and/ Or fundamental frequency model；Enhance speech parameter generation unit, for utilizing the spectral model of the enhancing and/or fundamental frequency model generation pair Answer the frequency spectrum parameter and/or base frequency parameters of the text to be synthesized；Initial speech parameter generating unit, it is described initial for utilizing The ginseng of other voices in addition to spectral model and/or fundamental frequency model of the corresponding text to be synthesized of phonetic synthesis model generation Number；

Synthesis module, for utilizing the synthesis speech parameter generation continuous speech signal.
7. system according to claim 6, which is characterized in that the enhancing model building module includes：

Speech parameter generation unit is synthesized, for generating the synthesis language of all training datas according to the initial speech synthetic model Sound parameter；

Natural-sounding parameter extraction unit, for extracting the natural-sounding parameter of all training datas；

Topological structure determination unit, for determining the topological structure of enhancing model；

Training unit, for the data of the synthesis speech parameter of the training data and natural-sounding parameter will to be corresponded to as instruction Practice set, parameter training is carried out according to the topological structure, obtain enhancing model.
8. system according to claim 6, which is characterized in that the model enhancement unit includes：

Model parameter acquiring unit, for obtaining spectral model and/or fundamental frequency model from the initial speech synthetic model Model parameter；

Model parameter enhancement unit for carrying out enhancing processing to the model parameter using the enhancing model, is enhanced Model parameter afterwards；

Enhance model generation unit, for enhanced model parameter to be substituted corresponding spectral model and/or fundamental frequency model Model parameter, the spectral model enhanced and/or fundamental frequency model.