CN106531150A

CN106531150A - Emotion synthesis method based on deep neural network model

Info

Publication number: CN106531150A
Application number: CN201611201686.6A
Authority: CN
Inventors: 王鸣
Original assignee: SHANGHAI YUZHIYI INFORMATION TECHNOLOGY Co Ltd
Current assignee: Unisound Shanghai Intelligent Technology Co Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-03-22
Anticipated expiration: 2036-12-23
Also published as: CN106531150B

Abstract

The invention discloses an emotion synthesis method based on a deep neural network model. The emotion synthesis method comprises the steps of obtaining neutral acoustic feature data and emotional acoustic feature data of a first speaker; establishing an emotion conversion model for the neutral acoustic feature data and the emotional acoustic feature data of the first speaker through the deep neural network model; obtaining neutral speech data of a second speaker, and establishing a neutral speech synthesis model of the second speaker; and enabling the neutral speech synthesis model of the second speaker and the emotion conversion model to be connected in series through the deep neural network model to obtain an emotion speech synthesis model of the second speaker. By adoption of the emotion synthesis method, based on the emotion model of one speaker, the emotion model of any other speaker can be obtained through the neutrality and emotion conversion relation model of one speaker; and meanwhile, the emotion synthesis method has the advantages of less data size, high emotion model constructing speed, low cost and the like.

Description

A kind of emotion synthetic method based on deep neural network model

Technical field

The present invention relates to field of speech recognition, more particularly to a kind of emotion synthesis side based on deep neural network model Method.

Background technology

Phonetic synthesis, also known as literary periodicals (Text to Speech) technology, is that Word message can be converted into by one kind Voice the technology read aloud.Which is related to multiple subject technologies such as acoustics, linguistics, Digital Signal Processing, computer science, It is a cutting edge technology in Chinese information processing field, the subject matter of solution is how Word message to be converted into audible sound Message ceases.

Speech synthesis system is built upon on the voice of neutral bright read mode mostly, is the dullness nothing for solving neutral voice Interest, the emotion model introduced in speech synthesis system so that phonetic synthesis has affective characteristics, strengthen the human nature of synthesis voice Change.Under the individual requirement to speech synthesis system, speech synthesis system can adapt to generate acoustic mode corresponding with speaker Type, that is, need to record the speech data of substantial amounts of speaker and to should the text marking data of speech data carry out model instruction Practice, after emotion model is added, the speech data with different emotions of the substantial amounts of speaker that needs again to record and to should language The text marking data of sound data carry out the training of emotion model, but when having multiple different speakers, data volume can be very huge Greatly so that the development time is longer, and R ＆ D Cost is too high.

The content of the invention

The technical problem to be solved is to provide a kind of emotion synthetic method based on deep neural network model, When solving the problems, such as that existing emotion model is generated, data volume is huge so that the development time is longer and R ＆ D Cost is too high, it is therefore intended that For multiple different speakers, can be using a small amount of neutrality data, the corresponding emotion model of rapid build.

To realize above-mentioned technique effect, the invention discloses a kind of emotion synthesis side based on deep neural network model Method, including step：

Obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker；

The neutral acoustic feature data and emotion acoustics that first speaker is set up using deep neural network model are special Levy the emotion transformation model of data；

The neutral speech data of the second speaker is obtained, the neutral phonetic synthesis model of the second speaker is set up；And

The neutral phonetic synthesis model of second speaker is changed with the emotion using deep neural network model Model is connected, and obtains the emotional speech synthesis model of second speaker.

The emotion synthetic method based on deep neural network model is further improved by, and obtains by the following method Take the neutral acoustic feature data and emotion acoustic feature data of the first speaker, including step：

The a number of statement text of the first speaker is provided, the statement text includes the consistent neutrality of content of text Statement text and emotion statement text；

The neutral speech data of the first speaker is obtained from the neutral statement text；From the emotion statement text Obtain the emotional speech data of the first speaker；

The neutral acoustic feature data of the first speaker are extracted from the neutral speech data；

From the emotion acoustic feature data of first speaker of emotional speech extracting data.

The emotion synthetic method based on deep neural network model is further improved by, and obtains by the following method The neutral acoustic feature data and emotion acoustic feature data of the first speaker are taken, including：

Obtain the neutral speech data and emotional speech data of the first speaker；

Deep neural network model training is carried out using the neutral speech data of first speaker, described first is obtained The neutral phonetic synthesis model of speaker；

Deep neural network model training is carried out using the emotional speech data of first speaker, described first is obtained The emotional speech synthesis model of speaker；

A number of statement text is provided, the statement text is separately input to into the neutral language of first speaker Sound synthetic model and emotional speech synthesis model, obtain the neutral acoustic feature data and emotion of corresponding first speaker Acoustic feature data.

The emotion synthetic method based on deep neural network model is further improved by, and pronounces obtaining second After the neutral speech data of people, the neutral phonetic synthesis model of second speaker is set up by the following method, including：

Using the neutral speech data of the second speaker, the neutral phonetic synthesis model of the first speaker is instructed again Practice, obtain the neutral phonetic synthesis model of the second speaker.

Deep neural network model training is carried out using the neutral speech data of the second speaker, the second speaker is obtained Neutral phonetic synthesis model.

The emotion synthetic method based on deep neural network model is further improved by, by the following method profit The neutral acoustic feature data and the feelings of emotion acoustic feature data of first speaker are set up with deep neural network model Sense transformation model, including：

Using the neutral acoustic feature data of the first speaker as the input data of deep neural network model；

Using the emotion acoustic feature data of the first speaker as the output data of deep neural network model；

The deep neural network model is trained, the neutral acoustic feature data and emotion acoustics for obtaining the first speaker are special Levy the emotion transformation model of data.

The emotion synthetic method based on deep neural network model is further improved by, and instructs by the following method The white silk deep neural network model, obtains the neutral acoustic feature data and the feelings of emotion acoustic feature data of the first speaker Sense transformation model, including：

Regression model is built using the neutral net in deep neural network model, hidden layer is encouraged using S sigmoid growth curves Function, output layer use linear incentive function；

Using randomization network parameter as initial parameter, the least-mean-square-error criterion based on formula 1 carries out model training；

L (y, z)=| | y-z | |² (1)

Wherein, y is emotion acoustic feature data, and z is the emotion acoustic feature parameter of deep neural network model prediction, is instructed Experienced target is to update deep neural network model, cause L (y, z) minimum.

The emotion synthetic method based on deep neural network model is further improved by, and by the following method will The neutral phonetic synthesis model of second speaker is connected with the emotion transformation model, obtains the feelings of second speaker Sense phonetic synthesis model, including：

In synthesis phase, to text to be synthesized, using synthesis front end to text analyzing, corresponding text feature is obtained, The text feature includes the relative position letter in current phoneme of phoneme information, prosodic information, 0/1 coding information and present frame Breath；

Using phoneme information, prosodic information, 0/1 coding information as the input of deep neural network model, phoneme is predicted Duration information；

Using phoneme information, prosodic information, 0/1 coding information and present frame relative in current phoneme positional information as The input of deep neural network model, predicts spectrum information, energy information and fundamental frequency information；

Using the spectrum information for predicting, the energy information and the fundamental frequency information as parameters,acoustic, to described Acoustic feature, enters line parameter generation by formula 2, to obtain smooth acoustic feature；

Wherein, W is the window function matrix for calculating first-order difference and second differnce, and C is acoustic feature to be generated, and M is deep The parameters,acoustic that degree Neural Network model predictive goes out, U are that the global variance for obtaining is counted from training sound storehouse；

Using acoustic feature C, emotional speech synthesis model is synthesized by vocoder.

The emotion synthetic method based on deep neural network model is further improved by, the neutral voice number According to acoustic feature sequence and corresponding text data information including neutral voice, the acoustic feature sequence bag of the neutral voice Include frequency spectrum, energy, fundamental frequency and duration.

The present invention is as a result of above technical scheme so as to have the advantages that：

Emotion synthetic method of the present invention is special by obtaining the neutral acoustic feature data and emotion acoustics of a speaker Data are levied, the transformational relation of the neutrality and emotion acoustic feature of the speaker is set up using deep neural network model, is thus existed In the case of being input into a small amount of neutral speech data of other speakers, you can obtain corresponding emotion model；

When the neutral acoustic feature data and emotion acoustic feature data of speaker are obtained, using the neutrality of speaker The synthesis acoustic feature of same comments sentence is exported with emotional speech model, neutral and feelings is set up using the synthesis acoustics characteristic The transformational relation of sense acoustic feature；Can also pass through to record the consistent neutral sentence of content of text and emotion sentence obtains speaker Neutral speech data and emotional speech data, then the synthesis acoustic feature of neutral and emotion is therefrom extracted, set up neutral and feelings The transformational relation of sense acoustic feature；

Using the present invention, the emotion model based on a speaker can obtain the emotion model of all other men, utilize The transformational relation model of the neutrality and emotion of one speaker is capable of achieving, few with data volume, and component emotion model speed is fast, The low advantage of cost.

Description of the drawings

Fig. 1 is a kind of operational flowchart of the emotion synthetic method based on deep neural network model of the present invention.

Fig. 2 is a kind of data of the first embodiment of the emotion synthetic method based on deep neural network model of the present invention Form figure.

Fig. 3 is a kind of data of second embodiment of the emotion synthetic method based on deep neural network model of the present invention Form figure.

Fig. 4 is a kind of synthesis flow of the happiness emotion of the emotion synthetic method based on deep neural network model of the present invention Figure.

Fig. 5 is a kind of neutral language of the first speaker of the emotion synthetic method based on deep neural network model of the present invention The structural representation of sound synthetic model.

Fig. 6 is a kind of structure of the emotion transformation model of the emotion synthetic method based on deep neural network model of the present invention Schematic diagram.

Fig. 7 is a kind of emotion language of the second speaker of the emotion synthetic method based on deep neural network model of the present invention The structural representation of sound synthetic model.

Specific embodiment

Below in conjunction with the accompanying drawings and specific embodiment the present invention is further detailed explanation.

Embodiments of the present invention are illustrated below by way of specific instantiation, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through concrete realities different in addition The mode of applying is carried out or applies, the every details in this specification can also based on different viewpoints with application, without departing from Various modifications and changes are carried out under the spirit of the present invention.

It should be noted that structure, ratio, size depicted in this specification institute accompanying drawings etc., only to coordinate Content disclosed in bright book, so that those skilled in the art understands and reads, is not limited to enforceable limit of the invention Fixed condition, therefore do not have technical essential meaning, the modification of any structure, the change of proportionate relationship or the adjustment of size, not Affect, under effect that can be generated of the invention and the purpose that can be reached, still to fall In the range of covering.Meanwhile, in this specification it is cited as " on ", D score, "left", "right", " centre " and " one " etc. Term, is merely convenient to understanding for narration, and is not used to limit enforceable scope of the invention, the change of its relativeness or tune It is whole, under without essence change technology contents, when being also considered as enforceable category of the invention.

It is contemplated that proposing a kind of emotion synthetic method based on deep neural network model, existing emotion model is solved During generation, data volume is huge causes the problem that the development time is longer and R ＆ D Cost is too high, it is therefore intended that for multiple different pronunciations People, can be using a small amount of neutrality data, the corresponding emotion model of rapid build.

First, refer to shown in Fig. 1, Fig. 1 is behaviour of the present invention based on the emotion synthetic method of deep neural network model Make flow chart, the present invention following steps are mainly included based on the emotion synthetic method of deep neural network model and realization with Lower function：

S001：Obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker (speaker A)；

S002：Using deep neural network model set up the first speaker (speaker A) neutral acoustic feature data and The emotion transformation model of emotion acoustic feature data；

S003：The neutral speech data of the second speaker (speaker B) is obtained, the second speaker (speaker B) is set up Neutral phonetic synthesis model；

S004：Using deep neural network model by the neutral phonetic synthesis model and feelings of the second speaker (speaker B) Sense transformation model series connection, obtains the emotional speech synthesis model of the second speaker (speaker B).

The present invention based on the emotion synthetic method of deep neural network model be by obtain a speaker neutrality Acoustic feature data and emotion acoustic feature data, set up the neutrality and emotion sound of the speaker using deep neural network model The transformational relation of feature is learned, thus in the case where a small amount of neutral speech data of other speakers is input into, you can corresponded to Emotion model.Wherein, it is when the neutral acoustic feature data and emotion acoustic feature data of speaker are obtained, both available to send out The neutrality of sound people and emotional speech model export the synthesis acoustic feature of same comments sentence, are built using the synthesis acoustics characteristic The transformational relation of vertical neutral and emotion acoustic feature；Also can be obtained by recording the consistent neutral sentence of content of text and emotion sentence The neutral speech data and emotional speech data of speaker are taken, then therefrom extracts the neutral synthesis acoustic feature with emotion, built The transformational relation of vertical neutral and emotion acoustic feature.Therefore, using the present invention, the emotion model based on a speaker can be obtained The emotion model of all other men is obtained, is capable of achieving using the transformational relation model of the neutrality and emotion of a speaker, is had Data volume is few, and component emotion model speed is fast, the low advantage of cost.

For above-mentioned steps S001, the invention provides two kinds of neutral acoustics that can obtain the first speaker (speaker A) The mode of characteristic and emotion acoustic feature data, it is specific as follows：

Mode one：

Coordinate shown in Fig. 2, Fig. 2 is the first enforcement of the present invention based on the emotion synthetic method of deep neural network model The data of example form figure, and which includes：

The a number of statement text (such as 2000) of the first speaker (speaker A), those statement texts are provided Include the consistent neutral statement text of content of text (such as 2000) and emotion statement text (such as 2000)；

The neutral speech data of the first speaker (speaker A) is obtained from those neutral statement texts；Such as using recording Those neutral statement texts, therefrom obtain the neutral speech data of the first speaker (speaker A)；

The emotional speech data of the first speaker (speaker A) are obtained from those emotion statement texts；Such as using recording Those emotion statement texts, therefrom obtain the emotional speech data of the first speaker (speaker A)；

The neutral acoustics that the first speaker is extracted from the neutral speech data of the first speaker (speaker A) for obtaining is special Levy data；

Emotion acoustics from first speaker of emotional speech extracting data of the first speaker (speaker A) for obtaining is special Levy data.

Mode two：

Coordinate shown in Fig. 3 again, Fig. 3 is real second of the emotion synthetic method based on deep neural network model of the invention The data formation figure of example is applied, which includes：

Obtain the emotional speech of the neutral speech data and the first speaker (speaker A) of the first speaker (speaker A) Data, such as obtain the feelings of the neutral speech data and the first speaker (speaker A) of the first speaker (speaker A) using recording Sense speech data；

Deep neural network model (Deep Neural are carried out using the neutral speech data of the first speaker (speaker A) Networks, abbreviation DNN) model training, obtain the neutral phonetic synthesis model of the first speaker (speaker A)；

Deep neural network model (DNN) model instruction is carried out using the emotional speech data of the first speaker (speaker A) Practice, obtain the emotional speech synthesis model of the first speaker (speaker A)；

A number of statement text (such as 5000) is provided, those statement texts are separately input to into the first speaker The emotional speech synthesis model of the neutral phonetic synthesis model and the first speaker (speaker A) of (speaker A), obtains corresponding The emotion acoustic feature data of the neutral acoustic feature data and the first speaker (speaker A) of the first speaker (speaker A).

Using above two mode can obtain the first speaker (speaker A) neutral acoustic feature data and first The emotion acoustic feature of sound people (speaker A), mode one more directly, are directly obtained from a number of statement text recorded The neutral speech data and emotional speech data of the first speaker (speaker A) are taken, then from those neutral speech datas and emotion Corresponding neutral acoustic feature data and emotion acoustic feature data are extracted in speech data, but in those statement texts In recording, must be requested that comprising the consistent neutral statement text of content of text and emotion statement text；And for mode two, then not Make the requirement, it is not necessary to being required to content of text when statement text is recorded, by a number of arbitrary statement text point It is not input in neutral phonetic synthesis model and emotional speech synthesis model, just using the neutral phonetic synthesis model and the feelings Sense phonetic synthesis model obtains corresponding neutral acoustic feature data and emotion acoustic feature data, by neutral phonetic synthesis mould Data precision that type and emotional speech synthesis model are obtained is higher, tightness is more preferable.

For above-mentioned steps S003, the present invention again may be used after the neutral speech data for obtaining the second speaker (speaker B) The neutral phonetic synthesis model of second speaker (speaker B) is set up by following two modes, is referred to shown in Fig. 3, Which specifically includes：

Mode one, which need to be based on the neutral acoustics that the first speaker (speaker A) is obtained with the above-mentioned second way The mode of characteristic and emotion acoustic feature data：

Using the neutral speech data of the second speaker (speaker B) recorded, in the first speaker (speaker A) Vertical phonetic synthesis model carries out retraining (retain), obtains the neutral phonetic synthesis model of the second speaker (speaker B), should Model training of the step based on deep neural network model (DNN) is realized.

Mode two, which can be applied to the neutral acoustics spy that above two obtains the first speaker (speaker A) simultaneously Levy the mode of data and emotion acoustic feature data：

Deep neural network model (DNN) is carried out using the neutral speech data of the second speaker (speaker B) recorded Model training, obtains the neutral phonetic synthesis model of the second speaker.

Above-mentioned steps S002 are the innovative points of the emotion synthetic method based on deep neural network model of the present invention, are passed through Two kinds of data are built using the neutral acoustic feature data and emotion acoustic feature data of the first speaker (speaker A) for obtaining Acoustics transformational relation, recycle deep neural network model (DNN) obtain the acoustics transformational relation corresponding to two kinds of data Emotion transformation model.Just correspondence speaker can be obtained based on deep neural network model (DNN) using the emotion transformation model Emotion model (i.e. emotional speech synthesis model, abbreviation emotion model).

The emotion model that the emotion synthetic method based on deep neural network model of the present invention is suitable for includes happiness, life The emotion model such as gas, indignation, sad.

Emotion model of the present invention based on a speaker can obtain the emotion model of all other men, be sent out using one The transformational relation model of the neutrality and emotion of sound people is capable of achieving, few with data volume, and component emotion model speed is fast, low cost Etc. advantage.

The emotion synthetic method based on deep neural network model of the present invention, is the neutral and feelings by a speaker Sense speech model exports the synthesis acoustic feature of same comments sentence, sets up neutral and emotion sound using the synthesis acoustics characteristic The transformational relation of feature is learned, and thus corresponding emotion mould can be obtained in a small amount of neutral speech data for being input into other speakers Type.

Illustrated as a example by obtaining happiness emotion model (i.e. the emotional speech synthesis model of happiness emotion) below, such as schemed Shown in 4, Fig. 4 is the synthetic schemes of the happiness emotion of the emotion synthetic method based on deep neural network model of the present invention, Comprising having the following steps：

(1) record the neutral speech data and happiness speech data for obtaining speaker A；

(2) carry out DNN (deep neural network model) model training using neutral speech data to obtain in speaker A Vertical phonetic synthesis model, as shown in figure 5, structural representations of the Fig. 5 for the neutral phonetic synthesis model of speaker A；Wherein, it is neutral Speech synthesis data includes the acoustic feature sequence and corresponding text data information of neutral voice, the sound of neutral voice therein Learning characteristic sequence includes frequency spectrum, energy, fundamental frequency and duration, specific as follows：

Step one：Obtain input data：

Correspondence text feature, specifically, obtains the information such as the corresponding traditional phoneme of text and the rhythm and carries out 01 and encode, 1114 dimension bi-level digitals are obtained；Meanwhile, present frame relative position information (regular between 0 and 1) in current phoneme is added, Including forward location and backward position, totally 2 tie up；Phoneme the information such as the rhythm 01 coding and positional information totally 1116 dimension, as DNN Network inputs；

Step 2：Obtain output data：

Including acoustic features such as frequency spectrum, energy, fundamental frequency and durations, acoustic feature is divided into two classes by us, is built respectively Mould, 1) frequency spectrum, energy and fundamental frequency, its intermediate frequency spectrum 40 is tieed up, energy 1 is tieed up, fundamental frequency 1 is tieed up, the pure and impure mark of fundamental frequency 1 is tieed up, and fundamental frequency is carried out Consider above 4 frames and the frame extension of 4 frames below, its first-order difference is considered to frequency spectrum and energy parameter and second differnce is believed Breath, 133 ties up totally；2) duration, is the frame number contained in phoneme duration, i.e. phoneme here, 1 dimension；

Step 3：Training DNN models：

Regression model is built used here as classical BP (Back Propagation) neutral net, hidden layer is used Sigmoid excitation functions (S type formation curve excitation functions), output layer use linear excitation functions (linear incentive function), Randomization network parameter first as initial parameter, be then based on following MMSE (Minimum Mean Square Error, Minimum Mean Square Error) criterion carries out model training：

L (y, z)=| | y-z | |²

Wherein y is natural target component, and z is the parameter of DNN model predictions, and the target of training is to update DNN networks, make Obtain L (y, z) minimum.

Here two classes acoustic feature above-mentioned is modeled respectively：

1) frequency spectrum, energy and fundamental frequency, 133 tie up totally, and network structure is：1116-1024-1024-133, the neutral language for obtaining Sound synthetic model is designated as M_ANS；

2) duration, 1 ties up totally, and network inputs do not consider frame relative position information in current phoneme here, and network structure is： 1114-1024-1024-1, the neutral phonetic synthesis model for obtaining are designated as M_AND；

(3) the happiness phonetic synthesis model that DNN model trainings obtain speaker A is carried out using happiness speech data；The height Emerging speech data includes the characteristic sequence of happiness voice and corresponding text data information, the characteristic sequence of happiness voice therein Including frequency spectrum, energy, fundamental frequency and duration, concrete modeling pattern is similar with the neutral phonetic synthesis model of speaker A, and what is obtained sends out The DNN models of the emotional speech synthesis model of sound people A, are designated as M_AESAnd M_AED。

(4) arbitrarily a collection of a number of statement text (such as 5000) is provided, the statement text is separately input to The happiness phonetic synthesis model of the neutral phonetic synthesis model and speaker A of speaker A, with the neutrality synthesis of corresponding acquisition A The glad synthesis acoustics characteristic of acoustic feature data and A, then builds the glad conjunction of the neutral synthesis acoustic feature and A of A Voice transformational relation into the acoustics transformational relation of acoustic feature, the neutrality and happiness obtains emotion transformation model using DNN, such as Shown in Fig. 6, Fig. 6 is that the structure of the emotion transformation model of the emotion synthetic method based on deep neural network model of the present invention is shown It is intended to, particular content is as follows：

Step one：Obtain input data：

According to the text of input, using the neutral phonetic synthesis model of speaker A, corresponding neutral acoustic feature number is obtained According to specifically, using the neutral phonetic synthesis model M of speaker A_ANSFrequency spectrum, energy and fundamental frequency feature are obtained, using speaker A Neutral phonetic synthesis model M_ANDObtain phoneme duration characteristics；

Step two：Obtain output data：

According to the text of input, using the emotional speech synthesis model of speaker A, corresponding acoustic feature is obtained, specifically , using the emotional speech synthesis model M of speaker A_AESFrequency spectrum, energy and fundamental frequency feature are obtained, using the emotion of speaker A Phonetic synthesis model M_AEDObtain phoneme duration characteristics；Two pairs of features are used as target emotion acoustic feature parameter.

Step three：Training DNN models：

Regression model (one kind of DNN models), hidden layer are built used here as BP (Back Propagation) neutral net Using sigmoid excitation functions, output layer uses linear excitation functions, randomization network parameter first as initial parameter, Being then based on following MMSE criterions carries out model training：

L (y, z)=| | y-z | |²

Wherein y is target emotion acoustic feature parameter, and z is the emotion acoustic feature parameter of DNN model predictions, the mesh of training Mark is to update DNN networks, cause L (y, z) minimum.

Here two classes acoustic feature above-mentioned is modeled respectively：

1) frequency spectrum, energy and fundamental frequency, 133 tie up totally, and network structure is：133-1024-1024-133, the model for obtaining are designated as M_CS；

2) duration, 1 ties up totally, and network structure is：1-1024-1024-1, the model for obtaining are designated as M_CD；

The model M_CSAnd model M_CDAs feelings of the neutral acoustic feature data of speaker A and emotion acoustic feature data Sense transformation model.

Then, obtain the neutral speech data of speaker B.

The neutral speech data of speaker B is recycled, retraining is carried out to the neutral phonetic synthesis model of speaker A (retrain), obtain the neutral phonetic synthesis model of speaker B；Or, it is also possible to using the neutral language of speaker B for obtaining Sound data directly carry out deep neural network model (DNN) model training, can equally obtain the neutral phonetic synthesis of speaker B Model, employs former scheme in the present embodiment.Neutral speech data therein includes the characteristic sequence of neutral voice and right The text data information answered, the characteristic sequence of neutral voice therein include frequency spectrum, energy, fundamental frequency and duration, concrete modeling side Formula is similar with the neutral phonetic synthesis model of speaker A, is not simply randomization network parameter here, but uses speaker A Neutral phonetic synthesis model as initial parameter, the DNN models of the neutral phonetic synthesis model of speaker B for obtaining, be designated as M_BNSAnd M_BND。

By the neutral phonetic synthesis model M of speaker B_BNSAnd M_BND, respectively with emotion transformation model M_CSAnd M_CDGone here and there Connection, obtains the emotional speech synthesis model M of speaker B_BNS-M_CSAnd M_BND-M_CD, structure is as shown in fig. 7, Fig. 7 is speaker B The structural representation of emotional speech synthesis model.

In synthesis phase, for text to be synthesized, using synthesis front end to text analyzing, corresponding text is obtained special Levy, specifically, obtain the information such as the corresponding traditional phoneme of text and the rhythm, and carry out 01 and encode, 1114 dimension two-value numbers are obtained Word；Meanwhile, present frame relative position information (regular between 0 and 1) in current phoneme is added, including forward location and backward Position, 2 ties up totally；Phoneme the information such as the rhythm 01 coding and positional information totally 1116 dimension, as DNN network inputs；

Prediction steps are as follows：

1st, phoneme duration information is predicted, network inputs do not consider frame relative position information in current phoneme here, will 1114 dimension phonemes the information such as the rhythm 01 coding information as input, predict phoneme duration；

2nd, frequency spectrum, energy, fundamental frequency information are predicted, the 1116 dimension information that frontal chromatography above is obtained are used as input, prediction Go out frequency spectrum, energy, fundamental frequency information, totally 133 dimension；

3rd, to the parameters,acoustic for predicting, line parameter generation is entered by equation below, to obtain smooth parameters,acoustic：

Wherein, W is the window function matrix for calculating first-order difference and second differnce, and C is acoustic feature to be generated, and M is DNN The acoustic feature that neural network forecast goes out, U are that the global variance for obtaining is counted from training sound storehouse.

4th, using acoustic feature C, voice is synthesized by vocoder, obtains the emotional speech synthesis model of speaker B.

The above is only presently preferred embodiments of the present invention, and any pro forma restriction is not done to the present invention, though So the present invention is disclosed above with preferred embodiment, but is not limited to the present invention, any to be familiar with this professional technology people Member, in the range of without departing from technical solution of the present invention, when using the technology contents of the disclosure above make it is a little change or repair The Equivalent embodiments for equivalent variations are adornd, as long as being the content without departing from technical solution of the present invention, according to the technology reality of the present invention Any simple modification, equivalent variations and modification that confrontation above example is made, still fall within the scope of technical solution of the present invention It is interior.

Claims

1. a kind of emotion synthetic method based on deep neural network model, it is characterised in that including step：

The neutral acoustic feature data and emotion acoustic feature number of first speaker are set up using deep neural network model According to emotion transformation model；

Using deep neural network model by the neutral phonetic synthesis model of second speaker and the emotion transformation model Series connection, obtains the emotional speech synthesis model of second speaker.

2. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that pass through Following methods obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker, including step：

The a number of statement text of the first speaker is provided, the statement text includes the consistent neutral sentence of content of text Text and emotion statement text；

The neutral speech data of the first speaker is obtained from the neutral statement text；Obtain from the emotion statement text The emotional speech data of the first speaker；

3. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that pass through Following methods obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker, including：

Deep neural network model training is carried out using the neutral speech data of first speaker, first pronunciation is obtained The neutral phonetic synthesis model of people；

Deep neural network model training is carried out using the emotional speech data of first speaker, first pronunciation is obtained The emotional speech synthesis model of people；

A number of statement text is provided, the neutral voice that the statement text is separately input to first speaker is closed Into model and emotional speech synthesis model, the neutral acoustic feature data and emotion acoustics of corresponding first speaker are obtained Characteristic.

4. a kind of emotion synthetic method based on deep neural network model as claimed in claim 3, it is characterised in that obtaining After taking the neutral speech data of the second speaker, the neutral phonetic synthesis mould of second speaker is set up by the following method Type, including：

Using the neutral speech data of the second speaker, retraining is carried out to the neutral phonetic synthesis model of the first speaker, is obtained To the neutral phonetic synthesis model of the second speaker.

5. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that obtaining After taking the neutral speech data of the second speaker, the neutral phonetic synthesis mould of second speaker is set up by the following method Type, including：

Deep neural network model training is carried out using the neutral speech data of the second speaker, the neutrality of the second speaker is obtained Phonetic synthesis model.

6. a kind of synthetic method based on deep neural network emotion model as any one of Claims 1 to 5, which is special Levy and be, by the following method using deep neural network model set up first speaker neutral acoustic feature data and The emotion transformation model of emotion acoustic feature data, including：

The deep neural network model is trained, the neutral acoustic feature data and emotion acoustic feature number of the first speaker are obtained According to emotion transformation model.

7. a kind of emotion synthetic method based on deep neural network model as claimed in claim 6, it is characterised in that pass through Following methods train the deep neural network model, and the neutral acoustic feature data and emotion acoustics for obtaining the first speaker are special The emotion transformation model of data is levied, including：

Regression model is built using the neutral net in deep neural network model, hidden layer uses S sigmoid growth curve excitation functions, Output layer uses linear incentive function；

L (y, z）=| | y-z | |² (1)

Wherein, y is emotion acoustic feature data, and z is the emotion acoustic feature parameter of deep neural network model prediction, training Target is to update deep neural network model, cause L (y, z) minimum.

8. a kind of emotion synthetic method based on deep neural network model as claimed in claim 2, it is characterised in that pass through The neutral phonetic synthesis model of second speaker is connected by following methods with the emotion transformation model, obtains described second The emotional speech synthesis model of speaker, including：

In synthesis phase, to text to be synthesized, using synthesis front end to text analyzing, corresponding text feature is obtained, it is described Text feature includes the relative positional information in current phoneme of phoneme information, prosodic information, 0/1 coding information and present frame；

Using phoneme information, prosodic information, 0/1 coding information as the input of deep neural network model, phoneme duration is predicted Information；

Using phoneme information, prosodic information, 0/1 coding information and present frame relative in current phoneme positional information as depth The input of neural network model, predicts spectrum information, energy information and fundamental frequency information；

Using the spectrum information for predicting, the energy information and the fundamental frequency information as parameters,acoustic, to the acoustics Feature, enters line parameter generation by formula 2, to obtain smooth acoustic feature；

\log P (W C | Q, λ) = - \frac{1}{2} C^{T} W^{T} U^{- 1} W C + C^{T} W^{T} U^{- 1} M + c o n s t - - - (2)

Wherein, W is the window function matrix for calculating first-order difference and second differnce, and C is acoustic feature to be generated, and M is depth god The parameters,acoustic that Jing network models are predicted, U are that the global variance for obtaining is counted from training sound storehouse；

9. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that：It is described Neutral speech data includes the acoustic feature sequence and corresponding text data information of neutral voice, the acoustics of the neutral voice Characteristic sequence includes frequency spectrum, energy, fundamental frequency and duration.