CN106531150A - Emotion synthesis method based on deep neural network model - Google Patents

Emotion synthesis method based on deep neural network model Download PDF

Info

Publication number
CN106531150A
CN106531150A CN201611201686.6A CN201611201686A CN106531150A CN 106531150 A CN106531150 A CN 106531150A CN 201611201686 A CN201611201686 A CN 201611201686A CN 106531150 A CN106531150 A CN 106531150A
Authority
CN
China
Prior art keywords
speaker
emotion
neutral
model
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611201686.6A
Other languages
Chinese (zh)
Other versions
CN106531150B (en
Inventor
王鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Shanghai Intelligent Technology Co Ltd
Original Assignee
SHANGHAI YUZHIYI INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI YUZHIYI INFORMATION TECHNOLOGY Co Ltd filed Critical SHANGHAI YUZHIYI INFORMATION TECHNOLOGY Co Ltd
Priority to CN201611201686.6A priority Critical patent/CN106531150B/en
Publication of CN106531150A publication Critical patent/CN106531150A/en
Application granted granted Critical
Publication of CN106531150B publication Critical patent/CN106531150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an emotion synthesis method based on a deep neural network model. The emotion synthesis method comprises the steps of obtaining neutral acoustic feature data and emotional acoustic feature data of a first speaker; establishing an emotion conversion model for the neutral acoustic feature data and the emotional acoustic feature data of the first speaker through the deep neural network model; obtaining neutral speech data of a second speaker, and establishing a neutral speech synthesis model of the second speaker; and enabling the neutral speech synthesis model of the second speaker and the emotion conversion model to be connected in series through the deep neural network model to obtain an emotion speech synthesis model of the second speaker. By adoption of the emotion synthesis method, based on the emotion model of one speaker, the emotion model of any other speaker can be obtained through the neutrality and emotion conversion relation model of one speaker; and meanwhile, the emotion synthesis method has the advantages of less data size, high emotion model constructing speed, low cost and the like.

Description

A kind of emotion synthetic method based on deep neural network model
Technical field
The present invention relates to field of speech recognition, more particularly to a kind of emotion synthesis side based on deep neural network model Method.
Background technology
Phonetic synthesis, also known as literary periodicals (Text to Speech) technology, is that Word message can be converted into by one kind Voice the technology read aloud.Which is related to multiple subject technologies such as acoustics, linguistics, Digital Signal Processing, computer science, It is a cutting edge technology in Chinese information processing field, the subject matter of solution is how Word message to be converted into audible sound Message ceases.
Speech synthesis system is built upon on the voice of neutral bright read mode mostly, is the dullness nothing for solving neutral voice Interest, the emotion model introduced in speech synthesis system so that phonetic synthesis has affective characteristics, strengthen the human nature of synthesis voice Change.Under the individual requirement to speech synthesis system, speech synthesis system can adapt to generate acoustic mode corresponding with speaker Type, that is, need to record the speech data of substantial amounts of speaker and to should the text marking data of speech data carry out model instruction Practice, after emotion model is added, the speech data with different emotions of the substantial amounts of speaker that needs again to record and to should language The text marking data of sound data carry out the training of emotion model, but when having multiple different speakers, data volume can be very huge Greatly so that the development time is longer, and R & D Cost is too high.
The content of the invention
The technical problem to be solved is to provide a kind of emotion synthetic method based on deep neural network model, When solving the problems, such as that existing emotion model is generated, data volume is huge so that the development time is longer and R & D Cost is too high, it is therefore intended that For multiple different speakers, can be using a small amount of neutrality data, the corresponding emotion model of rapid build.
To realize above-mentioned technique effect, the invention discloses a kind of emotion synthesis side based on deep neural network model Method, including step:
Obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker;
The neutral acoustic feature data and emotion acoustics that first speaker is set up using deep neural network model are special Levy the emotion transformation model of data;
The neutral speech data of the second speaker is obtained, the neutral phonetic synthesis model of the second speaker is set up;And
The neutral phonetic synthesis model of second speaker is changed with the emotion using deep neural network model Model is connected, and obtains the emotional speech synthesis model of second speaker.
The emotion synthetic method based on deep neural network model is further improved by, and obtains by the following method Take the neutral acoustic feature data and emotion acoustic feature data of the first speaker, including step:
The a number of statement text of the first speaker is provided, the statement text includes the consistent neutrality of content of text Statement text and emotion statement text;
The neutral speech data of the first speaker is obtained from the neutral statement text;From the emotion statement text Obtain the emotional speech data of the first speaker;
The neutral acoustic feature data of the first speaker are extracted from the neutral speech data;
From the emotion acoustic feature data of first speaker of emotional speech extracting data.
The emotion synthetic method based on deep neural network model is further improved by, and obtains by the following method The neutral acoustic feature data and emotion acoustic feature data of the first speaker are taken, including:
Obtain the neutral speech data and emotional speech data of the first speaker;
Deep neural network model training is carried out using the neutral speech data of first speaker, described first is obtained The neutral phonetic synthesis model of speaker;
Deep neural network model training is carried out using the emotional speech data of first speaker, described first is obtained The emotional speech synthesis model of speaker;
A number of statement text is provided, the statement text is separately input to into the neutral language of first speaker Sound synthetic model and emotional speech synthesis model, obtain the neutral acoustic feature data and emotion of corresponding first speaker Acoustic feature data.
The emotion synthetic method based on deep neural network model is further improved by, and pronounces obtaining second After the neutral speech data of people, the neutral phonetic synthesis model of second speaker is set up by the following method, including:
Using the neutral speech data of the second speaker, the neutral phonetic synthesis model of the first speaker is instructed again Practice, obtain the neutral phonetic synthesis model of the second speaker.
The emotion synthetic method based on deep neural network model is further improved by, and pronounces obtaining second After the neutral speech data of people, the neutral phonetic synthesis model of second speaker is set up by the following method, including:
Deep neural network model training is carried out using the neutral speech data of the second speaker, the second speaker is obtained Neutral phonetic synthesis model.
The emotion synthetic method based on deep neural network model is further improved by, by the following method profit The neutral acoustic feature data and the feelings of emotion acoustic feature data of first speaker are set up with deep neural network model Sense transformation model, including:
Using the neutral acoustic feature data of the first speaker as the input data of deep neural network model;
Using the emotion acoustic feature data of the first speaker as the output data of deep neural network model;
The deep neural network model is trained, the neutral acoustic feature data and emotion acoustics for obtaining the first speaker are special Levy the emotion transformation model of data.
The emotion synthetic method based on deep neural network model is further improved by, and instructs by the following method The white silk deep neural network model, obtains the neutral acoustic feature data and the feelings of emotion acoustic feature data of the first speaker Sense transformation model, including:
Regression model is built using the neutral net in deep neural network model, hidden layer is encouraged using S sigmoid growth curves Function, output layer use linear incentive function;
Using randomization network parameter as initial parameter, the least-mean-square-error criterion based on formula 1 carries out model training;
L (y, z)=| | y-z | |2 (1)
Wherein, y is emotion acoustic feature data, and z is the emotion acoustic feature parameter of deep neural network model prediction, is instructed Experienced target is to update deep neural network model, cause L (y, z) minimum.
The emotion synthetic method based on deep neural network model is further improved by, and by the following method will The neutral phonetic synthesis model of second speaker is connected with the emotion transformation model, obtains the feelings of second speaker Sense phonetic synthesis model, including:
In synthesis phase, to text to be synthesized, using synthesis front end to text analyzing, corresponding text feature is obtained, The text feature includes the relative position letter in current phoneme of phoneme information, prosodic information, 0/1 coding information and present frame Breath;
Using phoneme information, prosodic information, 0/1 coding information as the input of deep neural network model, phoneme is predicted Duration information;
Using phoneme information, prosodic information, 0/1 coding information and present frame relative in current phoneme positional information as The input of deep neural network model, predicts spectrum information, energy information and fundamental frequency information;
Using the spectrum information for predicting, the energy information and the fundamental frequency information as parameters,acoustic, to described Acoustic feature, enters line parameter generation by formula 2, to obtain smooth acoustic feature;
Wherein, W is the window function matrix for calculating first-order difference and second differnce, and C is acoustic feature to be generated, and M is deep The parameters,acoustic that degree Neural Network model predictive goes out, U are that the global variance for obtaining is counted from training sound storehouse;
Using acoustic feature C, emotional speech synthesis model is synthesized by vocoder.
The emotion synthetic method based on deep neural network model is further improved by, the neutral voice number According to acoustic feature sequence and corresponding text data information including neutral voice, the acoustic feature sequence bag of the neutral voice Include frequency spectrum, energy, fundamental frequency and duration.
The present invention is as a result of above technical scheme so as to have the advantages that:
Emotion synthetic method of the present invention is special by obtaining the neutral acoustic feature data and emotion acoustics of a speaker Data are levied, the transformational relation of the neutrality and emotion acoustic feature of the speaker is set up using deep neural network model, is thus existed In the case of being input into a small amount of neutral speech data of other speakers, you can obtain corresponding emotion model;
When the neutral acoustic feature data and emotion acoustic feature data of speaker are obtained, using the neutrality of speaker The synthesis acoustic feature of same comments sentence is exported with emotional speech model, neutral and feelings is set up using the synthesis acoustics characteristic The transformational relation of sense acoustic feature;Can also pass through to record the consistent neutral sentence of content of text and emotion sentence obtains speaker Neutral speech data and emotional speech data, then the synthesis acoustic feature of neutral and emotion is therefrom extracted, set up neutral and feelings The transformational relation of sense acoustic feature;
Using the present invention, the emotion model based on a speaker can obtain the emotion model of all other men, utilize The transformational relation model of the neutrality and emotion of one speaker is capable of achieving, few with data volume, and component emotion model speed is fast, The low advantage of cost.
Description of the drawings
Fig. 1 is a kind of operational flowchart of the emotion synthetic method based on deep neural network model of the present invention.
Fig. 2 is a kind of data of the first embodiment of the emotion synthetic method based on deep neural network model of the present invention Form figure.
Fig. 3 is a kind of data of second embodiment of the emotion synthetic method based on deep neural network model of the present invention Form figure.
Fig. 4 is a kind of synthesis flow of the happiness emotion of the emotion synthetic method based on deep neural network model of the present invention Figure.
Fig. 5 is a kind of neutral language of the first speaker of the emotion synthetic method based on deep neural network model of the present invention The structural representation of sound synthetic model.
Fig. 6 is a kind of structure of the emotion transformation model of the emotion synthetic method based on deep neural network model of the present invention Schematic diagram.
Fig. 7 is a kind of emotion language of the second speaker of the emotion synthetic method based on deep neural network model of the present invention The structural representation of sound synthetic model.
Specific embodiment
Below in conjunction with the accompanying drawings and specific embodiment the present invention is further detailed explanation.
Embodiments of the present invention are illustrated below by way of specific instantiation, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through concrete realities different in addition The mode of applying is carried out or applies, the every details in this specification can also based on different viewpoints with application, without departing from Various modifications and changes are carried out under the spirit of the present invention.
It should be noted that structure, ratio, size depicted in this specification institute accompanying drawings etc., only to coordinate Content disclosed in bright book, so that those skilled in the art understands and reads, is not limited to enforceable limit of the invention Fixed condition, therefore do not have technical essential meaning, the modification of any structure, the change of proportionate relationship or the adjustment of size, not Affect, under effect that can be generated of the invention and the purpose that can be reached, still to fall In the range of covering.Meanwhile, in this specification it is cited as " on ", D score, "left", "right", " centre " and " one " etc. Term, is merely convenient to understanding for narration, and is not used to limit enforceable scope of the invention, the change of its relativeness or tune It is whole, under without essence change technology contents, when being also considered as enforceable category of the invention.
It is contemplated that proposing a kind of emotion synthetic method based on deep neural network model, existing emotion model is solved During generation, data volume is huge causes the problem that the development time is longer and R & D Cost is too high, it is therefore intended that for multiple different pronunciations People, can be using a small amount of neutrality data, the corresponding emotion model of rapid build.
First, refer to shown in Fig. 1, Fig. 1 is behaviour of the present invention based on the emotion synthetic method of deep neural network model Make flow chart, the present invention following steps are mainly included based on the emotion synthetic method of deep neural network model and realization with Lower function:
S001:Obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker (speaker A);
S002:Using deep neural network model set up the first speaker (speaker A) neutral acoustic feature data and The emotion transformation model of emotion acoustic feature data;
S003:The neutral speech data of the second speaker (speaker B) is obtained, the second speaker (speaker B) is set up Neutral phonetic synthesis model;
S004:Using deep neural network model by the neutral phonetic synthesis model and feelings of the second speaker (speaker B) Sense transformation model series connection, obtains the emotional speech synthesis model of the second speaker (speaker B).
The present invention based on the emotion synthetic method of deep neural network model be by obtain a speaker neutrality Acoustic feature data and emotion acoustic feature data, set up the neutrality and emotion sound of the speaker using deep neural network model The transformational relation of feature is learned, thus in the case where a small amount of neutral speech data of other speakers is input into, you can corresponded to Emotion model.Wherein, it is when the neutral acoustic feature data and emotion acoustic feature data of speaker are obtained, both available to send out The neutrality of sound people and emotional speech model export the synthesis acoustic feature of same comments sentence, are built using the synthesis acoustics characteristic The transformational relation of vertical neutral and emotion acoustic feature;Also can be obtained by recording the consistent neutral sentence of content of text and emotion sentence The neutral speech data and emotional speech data of speaker are taken, then therefrom extracts the neutral synthesis acoustic feature with emotion, built The transformational relation of vertical neutral and emotion acoustic feature.Therefore, using the present invention, the emotion model based on a speaker can be obtained The emotion model of all other men is obtained, is capable of achieving using the transformational relation model of the neutrality and emotion of a speaker, is had Data volume is few, and component emotion model speed is fast, the low advantage of cost.
For above-mentioned steps S001, the invention provides two kinds of neutral acoustics that can obtain the first speaker (speaker A) The mode of characteristic and emotion acoustic feature data, it is specific as follows:
Mode one:
Coordinate shown in Fig. 2, Fig. 2 is the first enforcement of the present invention based on the emotion synthetic method of deep neural network model The data of example form figure, and which includes:
The a number of statement text (such as 2000) of the first speaker (speaker A), those statement texts are provided Include the consistent neutral statement text of content of text (such as 2000) and emotion statement text (such as 2000);
The neutral speech data of the first speaker (speaker A) is obtained from those neutral statement texts;Such as using recording Those neutral statement texts, therefrom obtain the neutral speech data of the first speaker (speaker A);
The emotional speech data of the first speaker (speaker A) are obtained from those emotion statement texts;Such as using recording Those emotion statement texts, therefrom obtain the emotional speech data of the first speaker (speaker A);
The neutral acoustics that the first speaker is extracted from the neutral speech data of the first speaker (speaker A) for obtaining is special Levy data;
Emotion acoustics from first speaker of emotional speech extracting data of the first speaker (speaker A) for obtaining is special Levy data.
Mode two:
Coordinate shown in Fig. 3 again, Fig. 3 is real second of the emotion synthetic method based on deep neural network model of the invention The data formation figure of example is applied, which includes:
Obtain the emotional speech of the neutral speech data and the first speaker (speaker A) of the first speaker (speaker A) Data, such as obtain the feelings of the neutral speech data and the first speaker (speaker A) of the first speaker (speaker A) using recording Sense speech data;
Deep neural network model (Deep Neural are carried out using the neutral speech data of the first speaker (speaker A) Networks, abbreviation DNN) model training, obtain the neutral phonetic synthesis model of the first speaker (speaker A);
Deep neural network model (DNN) model instruction is carried out using the emotional speech data of the first speaker (speaker A) Practice, obtain the emotional speech synthesis model of the first speaker (speaker A);
A number of statement text (such as 5000) is provided, those statement texts are separately input to into the first speaker The emotional speech synthesis model of the neutral phonetic synthesis model and the first speaker (speaker A) of (speaker A), obtains corresponding The emotion acoustic feature data of the neutral acoustic feature data and the first speaker (speaker A) of the first speaker (speaker A).
Using above two mode can obtain the first speaker (speaker A) neutral acoustic feature data and first The emotion acoustic feature of sound people (speaker A), mode one more directly, are directly obtained from a number of statement text recorded The neutral speech data and emotional speech data of the first speaker (speaker A) are taken, then from those neutral speech datas and emotion Corresponding neutral acoustic feature data and emotion acoustic feature data are extracted in speech data, but in those statement texts In recording, must be requested that comprising the consistent neutral statement text of content of text and emotion statement text;And for mode two, then not Make the requirement, it is not necessary to being required to content of text when statement text is recorded, by a number of arbitrary statement text point It is not input in neutral phonetic synthesis model and emotional speech synthesis model, just using the neutral phonetic synthesis model and the feelings Sense phonetic synthesis model obtains corresponding neutral acoustic feature data and emotion acoustic feature data, by neutral phonetic synthesis mould Data precision that type and emotional speech synthesis model are obtained is higher, tightness is more preferable.
For above-mentioned steps S003, the present invention again may be used after the neutral speech data for obtaining the second speaker (speaker B) The neutral phonetic synthesis model of second speaker (speaker B) is set up by following two modes, is referred to shown in Fig. 3, Which specifically includes:
Mode one, which need to be based on the neutral acoustics that the first speaker (speaker A) is obtained with the above-mentioned second way The mode of characteristic and emotion acoustic feature data:
Using the neutral speech data of the second speaker (speaker B) recorded, in the first speaker (speaker A) Vertical phonetic synthesis model carries out retraining (retain), obtains the neutral phonetic synthesis model of the second speaker (speaker B), should Model training of the step based on deep neural network model (DNN) is realized.
Mode two, which can be applied to the neutral acoustics spy that above two obtains the first speaker (speaker A) simultaneously Levy the mode of data and emotion acoustic feature data:
Deep neural network model (DNN) is carried out using the neutral speech data of the second speaker (speaker B) recorded Model training, obtains the neutral phonetic synthesis model of the second speaker.
Above-mentioned steps S002 are the innovative points of the emotion synthetic method based on deep neural network model of the present invention, are passed through Two kinds of data are built using the neutral acoustic feature data and emotion acoustic feature data of the first speaker (speaker A) for obtaining Acoustics transformational relation, recycle deep neural network model (DNN) obtain the acoustics transformational relation corresponding to two kinds of data Emotion transformation model.Just correspondence speaker can be obtained based on deep neural network model (DNN) using the emotion transformation model Emotion model (i.e. emotional speech synthesis model, abbreviation emotion model).
The emotion model that the emotion synthetic method based on deep neural network model of the present invention is suitable for includes happiness, life The emotion model such as gas, indignation, sad.
Emotion model of the present invention based on a speaker can obtain the emotion model of all other men, be sent out using one The transformational relation model of the neutrality and emotion of sound people is capable of achieving, few with data volume, and component emotion model speed is fast, low cost Etc. advantage.
The emotion synthetic method based on deep neural network model of the present invention, is the neutral and feelings by a speaker Sense speech model exports the synthesis acoustic feature of same comments sentence, sets up neutral and emotion sound using the synthesis acoustics characteristic The transformational relation of feature is learned, and thus corresponding emotion mould can be obtained in a small amount of neutral speech data for being input into other speakers Type.
Illustrated as a example by obtaining happiness emotion model (i.e. the emotional speech synthesis model of happiness emotion) below, such as schemed Shown in 4, Fig. 4 is the synthetic schemes of the happiness emotion of the emotion synthetic method based on deep neural network model of the present invention, Comprising having the following steps:
(1) record the neutral speech data and happiness speech data for obtaining speaker A;
(2) carry out DNN (deep neural network model) model training using neutral speech data to obtain in speaker A Vertical phonetic synthesis model, as shown in figure 5, structural representations of the Fig. 5 for the neutral phonetic synthesis model of speaker A;Wherein, it is neutral Speech synthesis data includes the acoustic feature sequence and corresponding text data information of neutral voice, the sound of neutral voice therein Learning characteristic sequence includes frequency spectrum, energy, fundamental frequency and duration, specific as follows:
Step one:Obtain input data:
Correspondence text feature, specifically, obtains the information such as the corresponding traditional phoneme of text and the rhythm and carries out 01 and encode, 1114 dimension bi-level digitals are obtained;Meanwhile, present frame relative position information (regular between 0 and 1) in current phoneme is added, Including forward location and backward position, totally 2 tie up;Phoneme the information such as the rhythm 01 coding and positional information totally 1116 dimension, as DNN Network inputs;
Step 2:Obtain output data:
Including acoustic features such as frequency spectrum, energy, fundamental frequency and durations, acoustic feature is divided into two classes by us, is built respectively Mould, 1) frequency spectrum, energy and fundamental frequency, its intermediate frequency spectrum 40 is tieed up, energy 1 is tieed up, fundamental frequency 1 is tieed up, the pure and impure mark of fundamental frequency 1 is tieed up, and fundamental frequency is carried out Consider above 4 frames and the frame extension of 4 frames below, its first-order difference is considered to frequency spectrum and energy parameter and second differnce is believed Breath, 133 ties up totally;2) duration, is the frame number contained in phoneme duration, i.e. phoneme here, 1 dimension;
Step 3:Training DNN models:
Regression model is built used here as classical BP (Back Propagation) neutral net, hidden layer is used Sigmoid excitation functions (S type formation curve excitation functions), output layer use linear excitation functions (linear incentive function), Randomization network parameter first as initial parameter, be then based on following MMSE (Minimum Mean Square Error, Minimum Mean Square Error) criterion carries out model training:
L (y, z)=| | y-z | |2
Wherein y is natural target component, and z is the parameter of DNN model predictions, and the target of training is to update DNN networks, make Obtain L (y, z) minimum.
Here two classes acoustic feature above-mentioned is modeled respectively:
1) frequency spectrum, energy and fundamental frequency, 133 tie up totally, and network structure is:1116-1024-1024-133, the neutral language for obtaining Sound synthetic model is designated as MANS
2) duration, 1 ties up totally, and network inputs do not consider frame relative position information in current phoneme here, and network structure is: 1114-1024-1024-1, the neutral phonetic synthesis model for obtaining are designated as MAND
(3) the happiness phonetic synthesis model that DNN model trainings obtain speaker A is carried out using happiness speech data;The height Emerging speech data includes the characteristic sequence of happiness voice and corresponding text data information, the characteristic sequence of happiness voice therein Including frequency spectrum, energy, fundamental frequency and duration, concrete modeling pattern is similar with the neutral phonetic synthesis model of speaker A, and what is obtained sends out The DNN models of the emotional speech synthesis model of sound people A, are designated as MAESAnd MAED
(4) arbitrarily a collection of a number of statement text (such as 5000) is provided, the statement text is separately input to The happiness phonetic synthesis model of the neutral phonetic synthesis model and speaker A of speaker A, with the neutrality synthesis of corresponding acquisition A The glad synthesis acoustics characteristic of acoustic feature data and A, then builds the glad conjunction of the neutral synthesis acoustic feature and A of A Voice transformational relation into the acoustics transformational relation of acoustic feature, the neutrality and happiness obtains emotion transformation model using DNN, such as Shown in Fig. 6, Fig. 6 is that the structure of the emotion transformation model of the emotion synthetic method based on deep neural network model of the present invention is shown It is intended to, particular content is as follows:
Step one:Obtain input data:
According to the text of input, using the neutral phonetic synthesis model of speaker A, corresponding neutral acoustic feature number is obtained According to specifically, using the neutral phonetic synthesis model M of speaker AANSFrequency spectrum, energy and fundamental frequency feature are obtained, using speaker A Neutral phonetic synthesis model MANDObtain phoneme duration characteristics;
Step two:Obtain output data:
According to the text of input, using the emotional speech synthesis model of speaker A, corresponding acoustic feature is obtained, specifically , using the emotional speech synthesis model M of speaker AAESFrequency spectrum, energy and fundamental frequency feature are obtained, using the emotion of speaker A Phonetic synthesis model MAEDObtain phoneme duration characteristics;Two pairs of features are used as target emotion acoustic feature parameter.
Step three:Training DNN models:
Regression model (one kind of DNN models), hidden layer are built used here as BP (Back Propagation) neutral net Using sigmoid excitation functions, output layer uses linear excitation functions, randomization network parameter first as initial parameter, Being then based on following MMSE criterions carries out model training:
L (y, z)=| | y-z | |2
Wherein y is target emotion acoustic feature parameter, and z is the emotion acoustic feature parameter of DNN model predictions, the mesh of training Mark is to update DNN networks, cause L (y, z) minimum.
Here two classes acoustic feature above-mentioned is modeled respectively:
1) frequency spectrum, energy and fundamental frequency, 133 tie up totally, and network structure is:133-1024-1024-133, the model for obtaining are designated as MCS
2) duration, 1 ties up totally, and network structure is:1-1024-1024-1, the model for obtaining are designated as MCD
The model MCSAnd model MCDAs feelings of the neutral acoustic feature data of speaker A and emotion acoustic feature data Sense transformation model.
Then, obtain the neutral speech data of speaker B.
The neutral speech data of speaker B is recycled, retraining is carried out to the neutral phonetic synthesis model of speaker A (retrain), obtain the neutral phonetic synthesis model of speaker B;Or, it is also possible to using the neutral language of speaker B for obtaining Sound data directly carry out deep neural network model (DNN) model training, can equally obtain the neutral phonetic synthesis of speaker B Model, employs former scheme in the present embodiment.Neutral speech data therein includes the characteristic sequence of neutral voice and right The text data information answered, the characteristic sequence of neutral voice therein include frequency spectrum, energy, fundamental frequency and duration, concrete modeling side Formula is similar with the neutral phonetic synthesis model of speaker A, is not simply randomization network parameter here, but uses speaker A Neutral phonetic synthesis model as initial parameter, the DNN models of the neutral phonetic synthesis model of speaker B for obtaining, be designated as MBNSAnd MBND
By the neutral phonetic synthesis model M of speaker BBNSAnd MBND, respectively with emotion transformation model MCSAnd MCDGone here and there Connection, obtains the emotional speech synthesis model M of speaker BBNS-MCSAnd MBND-MCD, structure is as shown in fig. 7, Fig. 7 is speaker B The structural representation of emotional speech synthesis model.
In synthesis phase, for text to be synthesized, using synthesis front end to text analyzing, corresponding text is obtained special Levy, specifically, obtain the information such as the corresponding traditional phoneme of text and the rhythm, and carry out 01 and encode, 1114 dimension two-value numbers are obtained Word;Meanwhile, present frame relative position information (regular between 0 and 1) in current phoneme is added, including forward location and backward Position, 2 ties up totally;Phoneme the information such as the rhythm 01 coding and positional information totally 1116 dimension, as DNN network inputs;
Prediction steps are as follows:
1st, phoneme duration information is predicted, network inputs do not consider frame relative position information in current phoneme here, will 1114 dimension phonemes the information such as the rhythm 01 coding information as input, predict phoneme duration;
2nd, frequency spectrum, energy, fundamental frequency information are predicted, the 1116 dimension information that frontal chromatography above is obtained are used as input, prediction Go out frequency spectrum, energy, fundamental frequency information, totally 133 dimension;
3rd, to the parameters,acoustic for predicting, line parameter generation is entered by equation below, to obtain smooth parameters,acoustic:
Wherein, W is the window function matrix for calculating first-order difference and second differnce, and C is acoustic feature to be generated, and M is DNN The acoustic feature that neural network forecast goes out, U are that the global variance for obtaining is counted from training sound storehouse.
4th, using acoustic feature C, voice is synthesized by vocoder, obtains the emotional speech synthesis model of speaker B.
The above is only presently preferred embodiments of the present invention, and any pro forma restriction is not done to the present invention, though So the present invention is disclosed above with preferred embodiment, but is not limited to the present invention, any to be familiar with this professional technology people Member, in the range of without departing from technical solution of the present invention, when using the technology contents of the disclosure above make it is a little change or repair The Equivalent embodiments for equivalent variations are adornd, as long as being the content without departing from technical solution of the present invention, according to the technology reality of the present invention Any simple modification, equivalent variations and modification that confrontation above example is made, still fall within the scope of technical solution of the present invention It is interior.

Claims (9)

1. a kind of emotion synthetic method based on deep neural network model, it is characterised in that including step:
Obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker;
The neutral acoustic feature data and emotion acoustic feature number of first speaker are set up using deep neural network model According to emotion transformation model;
The neutral speech data of the second speaker is obtained, the neutral phonetic synthesis model of the second speaker is set up;And
Using deep neural network model by the neutral phonetic synthesis model of second speaker and the emotion transformation model Series connection, obtains the emotional speech synthesis model of second speaker.
2. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that pass through Following methods obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker, including step:
The a number of statement text of the first speaker is provided, the statement text includes the consistent neutral sentence of content of text Text and emotion statement text;
The neutral speech data of the first speaker is obtained from the neutral statement text;Obtain from the emotion statement text The emotional speech data of the first speaker;
The neutral acoustic feature data of the first speaker are extracted from the neutral speech data;
From the emotion acoustic feature data of first speaker of emotional speech extracting data.
3. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that pass through Following methods obtain the neutral acoustic feature data and emotion acoustic feature data of the first speaker, including:
Obtain the neutral speech data and emotional speech data of the first speaker;
Deep neural network model training is carried out using the neutral speech data of first speaker, first pronunciation is obtained The neutral phonetic synthesis model of people;
Deep neural network model training is carried out using the emotional speech data of first speaker, first pronunciation is obtained The emotional speech synthesis model of people;
A number of statement text is provided, the neutral voice that the statement text is separately input to first speaker is closed Into model and emotional speech synthesis model, the neutral acoustic feature data and emotion acoustics of corresponding first speaker are obtained Characteristic.
4. a kind of emotion synthetic method based on deep neural network model as claimed in claim 3, it is characterised in that obtaining After taking the neutral speech data of the second speaker, the neutral phonetic synthesis mould of second speaker is set up by the following method Type, including:
Using the neutral speech data of the second speaker, retraining is carried out to the neutral phonetic synthesis model of the first speaker, is obtained To the neutral phonetic synthesis model of the second speaker.
5. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that obtaining After taking the neutral speech data of the second speaker, the neutral phonetic synthesis mould of second speaker is set up by the following method Type, including:
Deep neural network model training is carried out using the neutral speech data of the second speaker, the neutrality of the second speaker is obtained Phonetic synthesis model.
6. a kind of synthetic method based on deep neural network emotion model as any one of Claims 1 to 5, which is special Levy and be, by the following method using deep neural network model set up first speaker neutral acoustic feature data and The emotion transformation model of emotion acoustic feature data, including:
Using the neutral acoustic feature data of the first speaker as the input data of deep neural network model;
Using the emotion acoustic feature data of the first speaker as the output data of deep neural network model;
The deep neural network model is trained, the neutral acoustic feature data and emotion acoustic feature number of the first speaker are obtained According to emotion transformation model.
7. a kind of emotion synthetic method based on deep neural network model as claimed in claim 6, it is characterised in that pass through Following methods train the deep neural network model, and the neutral acoustic feature data and emotion acoustics for obtaining the first speaker are special The emotion transformation model of data is levied, including:
Regression model is built using the neutral net in deep neural network model, hidden layer uses S sigmoid growth curve excitation functions, Output layer uses linear incentive function;
Using randomization network parameter as initial parameter, the least-mean-square-error criterion based on formula 1 carries out model training;
L (y, z)=| | y-z | |2 (1)
Wherein, y is emotion acoustic feature data, and z is the emotion acoustic feature parameter of deep neural network model prediction, training Target is to update deep neural network model, cause L (y, z) minimum.
8. a kind of emotion synthetic method based on deep neural network model as claimed in claim 2, it is characterised in that pass through The neutral phonetic synthesis model of second speaker is connected by following methods with the emotion transformation model, obtains described second The emotional speech synthesis model of speaker, including:
In synthesis phase, to text to be synthesized, using synthesis front end to text analyzing, corresponding text feature is obtained, it is described Text feature includes the relative positional information in current phoneme of phoneme information, prosodic information, 0/1 coding information and present frame;
Using phoneme information, prosodic information, 0/1 coding information as the input of deep neural network model, phoneme duration is predicted Information;
Using phoneme information, prosodic information, 0/1 coding information and present frame relative in current phoneme positional information as depth The input of neural network model, predicts spectrum information, energy information and fundamental frequency information;
Using the spectrum information for predicting, the energy information and the fundamental frequency information as parameters,acoustic, to the acoustics Feature, enters line parameter generation by formula 2, to obtain smooth acoustic feature;
log P ( W C | Q , λ ) = - 1 2 C T W T U - 1 W C + C T W T U - 1 M + c o n s t - - - ( 2 )
Wherein, W is the window function matrix for calculating first-order difference and second differnce, and C is acoustic feature to be generated, and M is depth god The parameters,acoustic that Jing network models are predicted, U are that the global variance for obtaining is counted from training sound storehouse;
Using acoustic feature C, emotional speech synthesis model is synthesized by vocoder.
9. a kind of emotion synthetic method based on deep neural network model as claimed in claim 1, it is characterised in that:It is described Neutral speech data includes the acoustic feature sequence and corresponding text data information of neutral voice, the acoustics of the neutral voice Characteristic sequence includes frequency spectrum, energy, fundamental frequency and duration.
CN201611201686.6A 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model Active CN106531150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611201686.6A CN106531150B (en) 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611201686.6A CN106531150B (en) 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model

Publications (2)

Publication Number Publication Date
CN106531150A true CN106531150A (en) 2017-03-22
CN106531150B CN106531150B (en) 2020-02-07

Family

ID=58337400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611201686.6A Active CN106531150B (en) 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model

Country Status (1)

Country Link
CN (1) CN106531150B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN108305641A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN108447470A (en) * 2017-12-28 2018-08-24 中南大学 A kind of emotional speech conversion method based on sound channel and prosodic features
CN108597492A (en) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN108763190A (en) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109102796A (en) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 A kind of phoneme synthesizing method and device
WO2019218773A1 (en) * 2018-05-15 2019-11-21 中兴通讯股份有限公司 Voice synthesis method and device, storage medium, and electronic device
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
WO2020073944A1 (en) * 2018-10-10 2020-04-16 华为技术有限公司 Speech synthesis method and device
WO2020098269A1 (en) * 2018-11-15 2020-05-22 华为技术有限公司 Speech synthesis method and speech synthesis device
CN111599338A (en) * 2020-04-09 2020-08-28 云知声智能科技股份有限公司 Stable and controllable end-to-end speech synthesis method and device
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device
US11538455B2 (en) 2018-02-16 2022-12-27 Dolby Laboratories Licensing Corporation Speech style transfer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN101308652A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 Synthesizing method of personalized singing voice
CN102005205A (en) * 2009-09-03 2011-04-06 株式会社东芝 Emotional speech synthesizing method and device
US20140067397A1 (en) * 2012-08-29 2014-03-06 Nuance Communications, Inc. Using emoticons for contextual text-to-speech expressivity
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
EP3046053A2 (en) * 2015-01-19 2016-07-20 Samsung Electronics Co., Ltd Method and apparatus for training language model, and method and apparatus for recongnizing language

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN101308652A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 Synthesizing method of personalized singing voice
CN102005205A (en) * 2009-09-03 2011-04-06 株式会社东芝 Emotional speech synthesizing method and device
US20140067397A1 (en) * 2012-08-29 2014-03-06 Nuance Communications, Inc. Using emoticons for contextual text-to-speech expressivity
EP3046053A2 (en) * 2015-01-19 2016-07-20 Samsung Electronics Co., Ltd Method and apparatus for training language model, and method and apparatus for recongnizing language
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN108364631B (en) * 2017-01-26 2021-01-22 北京搜狗科技发展有限公司 Speech synthesis method and device
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN108305641A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN108305641B (en) * 2017-06-30 2020-04-07 腾讯科技(深圳)有限公司 Method and device for determining emotion information
CN108447470A (en) * 2017-12-28 2018-08-24 中南大学 A kind of emotional speech conversion method based on sound channel and prosodic features
US11538455B2 (en) 2018-02-16 2022-12-27 Dolby Laboratories Licensing Corporation Speech style transfer
CN108763190A (en) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
CN108763190B (en) * 2018-04-12 2019-04-02 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
WO2019196306A1 (en) * 2018-04-12 2019-10-17 平安科技(深圳)有限公司 Device and method for speech-based mouth shape animation blending, and readable storage medium
CN108597492A (en) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
WO2019218773A1 (en) * 2018-05-15 2019-11-21 中兴通讯股份有限公司 Voice synthesis method and device, storage medium, and electronic device
CN110556092A (en) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 Speech synthesis method and device, storage medium and electronic device
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109036370B (en) * 2018-06-06 2021-07-20 安徽继远软件有限公司 Adaptive training method for speaker voice
CN109102796A (en) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 A kind of phoneme synthesizing method and device
WO2020073944A1 (en) * 2018-10-10 2020-04-16 华为技术有限公司 Speech synthesis method and device
US11361751B2 (en) 2018-10-10 2022-06-14 Huawei Technologies Co., Ltd. Speech synthesis method and device
US11282498B2 (en) 2018-11-15 2022-03-22 Huawei Technologies Co., Ltd. Speech synthesis method and speech synthesis apparatus
WO2020098269A1 (en) * 2018-11-15 2020-05-22 华为技术有限公司 Speech synthesis method and speech synthesis device
CN111192568A (en) * 2018-11-15 2020-05-22 华为技术有限公司 Speech synthesis method and speech synthesis device
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN111599338A (en) * 2020-04-09 2020-08-28 云知声智能科技股份有限公司 Stable and controllable end-to-end speech synthesis method and device
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device

Also Published As

Publication number Publication date
CN106531150B (en) 2020-02-07

Similar Documents

Publication Publication Date Title
CN106531150A (en) Emotion synthesis method based on deep neural network model
CN101578659B (en) Voice tone converting device and voice tone converting method
CN111667812B (en) Speech synthesis method, device, equipment and storage medium
CN105118498B (en) The training method and device of phonetic synthesis model
CN105185372B (en) Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
CN101064104B (en) Emotion voice creating method based on voice conversion
CN107464559A (en) Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108447486A (en) A kind of voice translation method and device
CN104916284B (en) Prosody and acoustics joint modeling method and device for voice synthesis system
CN107958433A (en) A kind of online education man-machine interaction method and system based on artificial intelligence
CN106653052A (en) Virtual human face animation generation method and device
CN106601228A (en) Sample marking method and device based on artificial intelligence prosody prediction
CN111048062A (en) Speech synthesis method and apparatus
CN106057192A (en) Real-time voice conversion method and apparatus
CN106971709A (en) Statistic parameter model method for building up and device, phoneme synthesizing method and device
CN106128450A (en) The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
CN107452379A (en) The identification technology and virtual reality teaching method and system of a kind of dialect language
Malcangi Text-driven avatars based on artificial neural networks and fuzzy logic
CN107871496A (en) Audio recognition method and device
CN110010136A (en) The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
Schröder et al. Synthesis of emotional speech
CN101887719A (en) Speech synthesis method, system and mobile terminal equipment with speech synthesis function
CN110459201B (en) Speech synthesis method for generating new tone

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20170929

Address after: 200233 Shanghai City, Xuhui District Guangxi 65 No. 1 Jinglu room 702 unit 03

Applicant after: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Address before: 200233 Shanghai, Qinzhou, North Road, No. 82, building 2, layer 1198,

Applicant before: SHANGHAI YUZHIYI INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An emotion synthesis method based on deep neural network model

Effective date of registration: 20201201

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY Co.,Ltd.

Registration number: Y2020310000047

PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220307

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2020310000047

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An emotion synthesis method based on deep neural network model

Effective date of registration: 20230210

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2023310000028

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2023310000028

PC01 Cancellation of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A sentiment synthesis method based on deep neural network models

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2024310000165

PE01 Entry into force of the registration of the contract for pledge of patent right