Disclosure of Invention
The invention aims to solve the technical problem of providing an emotion synthesis method based on a deep neural network model, solving the problems of long development time and overhigh research and development cost caused by huge data volume when the existing emotion model is generated, and aiming at quickly constructing a corresponding emotion model by utilizing a small amount of neutral data for a plurality of different speakers.
In order to realize the technical effect, the invention discloses an emotion synthesis method based on a deep neural network model, which comprises the following steps of:
acquiring neutral acoustic feature data and emotional acoustic feature data of a first speaker;
establishing an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker by using a deep neural network model;
acquiring neutral voice data of a second speaker, and establishing a neutral voice synthesis model of the second speaker; and
and connecting the neutral voice synthesis model of the second speaker and the emotion conversion model in series by using a deep neural network model to obtain an emotion voice synthesis model of the second speaker.
The emotion synthesis method based on the deep neural network model is further improved in that neutral acoustic feature data and emotion acoustic feature data of a first speaker are obtained through the following method, and the method comprises the following steps of:
providing a certain amount of sentence texts of a first speaker, wherein the sentence texts comprise neutral sentence texts and emotion sentence texts with consistent text contents;
acquiring neutral voice data of a first speaker from the neutral sentence text; obtaining emotion voice data of a first speaker from the emotion statement text;
extracting neutral acoustic feature data of a first speaker from the neutral speech data;
and extracting emotional acoustic feature data of the first speaker from the emotional voice data.
The emotion synthesis method based on the deep neural network model is further improved in that neutral acoustic feature data and emotion acoustic feature data of a first speaker are obtained through the following method, and the method comprises the following steps:
acquiring neutral voice data and emotion voice data of a first speaker;
carrying out deep neural network model training by utilizing the neutral voice data of the first speaker to obtain a neutral voice synthesis model of the first speaker;
carrying out deep neural network model training by utilizing the emotion voice data of the first speaker to obtain an emotion voice synthesis model of the first speaker;
providing a certain amount of sentence texts, and respectively inputting the sentence texts into the neutral voice synthesis model and the emotion voice synthesis model of the first speaker to obtain corresponding neutral acoustic feature data and emotion acoustic feature data of the first speaker.
The emotion synthesis method based on the deep neural network model is further improved in that after neutral voice data of a second speaker is acquired, the neutral voice synthesis model of the second speaker is established by the following method, and the method comprises the following steps:
and retraining the neutral voice synthesis model of the first speaker by using the neutral voice data of the second speaker to obtain the neutral voice synthesis model of the second speaker.
The emotion synthesis method based on the deep neural network model is further improved in that after neutral voice data of a second speaker is acquired, the neutral voice synthesis model of the second speaker is established by the following method, and the method comprises the following steps:
and carrying out deep neural network model training by using the neutral voice data of the second speaker to obtain a neutral voice synthesis model of the second speaker.
The emotion synthesis method based on the deep neural network model is further improved in that an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of the first speaker is established by the deep neural network model through the following method, and the method comprises the following steps:
taking neutral acoustic feature data of a first speaker as input data of the deep neural network model;
taking the emotional acoustic feature data of the first speaker as output data of the deep neural network model;
and training the deep neural network model to obtain an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker.
The emotion synthesis method based on the deep neural network model is further improved in that the deep neural network model is trained through the following method to obtain an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of a first speaker, and the emotion conversion model comprises the following steps:
a neural network in the deep neural network model is used for constructing a regression model, an S-shaped growth curve excitation function is used for a hidden layer, and a linear excitation function is used for an output layer;
taking the randomized network parameters as initial parameters, and carrying out model training based on the minimum mean square error criterion of formula 1;
L(y,z)=||y-z||2(1)
wherein y is the emotion acoustic feature data, z is the emotion acoustic feature parameter predicted by the deep neural network model, and the training aim is to update the deep neural network model so that L (y, z) is minimum.
The emotion synthesis method based on the deep neural network model is further improved in that the neutral speech synthesis model of the second speaker is connected with the emotion conversion model in series to obtain the emotion speech synthesis model of the second speaker by the following method, and the method comprises the following steps:
in the synthesis stage, analyzing a text to be synthesized by using a synthesis front end to obtain corresponding text characteristics, wherein the text characteristics comprise phoneme information, prosodic information, 0/1 encoding information and relative position information of a current frame in a current phoneme;
using the phoneme information, prosody information and 0/1 coding information as the input of a deep neural network model, and predicting phoneme duration information;
using the phoneme information, prosody information, 0/1 coding information and the relative position information of the current frame in the current phoneme as the input of a deep neural network model, and predicting frequency spectrum information, energy information and fundamental frequency information;
taking the predicted frequency spectrum information, the predicted energy information and the predicted fundamental frequency information as acoustic parameters, and performing parameter generation on the acoustic features through a formula 2 to obtain smooth acoustic features;
w is a window function matrix for calculating first-order difference and second-order difference, C is acoustic features to be generated, M is acoustic parameters predicted by a deep neural network model, and U is global variance obtained through statistics from a training sound library;
and synthesizing the emotional voice synthesis model through the vocoder by using the acoustic feature C.
The emotion synthesis method based on the deep neural network model is further improved in that the neutral voice data comprises an acoustic feature sequence of neutral voice and corresponding text data information, and the acoustic feature sequence of the neutral voice comprises frequency spectrum, energy, fundamental frequency and duration.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
the emotion synthesis method comprises the steps of acquiring neutral acoustic feature data and emotion acoustic feature data of a speaker, and establishing a conversion relation between the neutral and emotion acoustic features of the speaker by using a deep neural network model, so that a corresponding emotion model can be obtained under the condition that a small amount of neutral voice data of other speakers are input;
when neutral acoustic feature data and emotional acoustic feature data of a speaker are obtained, synthesized acoustic features of the same sentence can be output by using neutral and emotional voice models of the speaker, and a conversion relation between the neutral and emotional acoustic features is established by using the synthesized acoustic feature data; neutral voice data and emotional voice data of a speaker can be obtained by recording neutral sentences and emotional sentences with consistent text contents, and then the synthetic acoustic features of the neutral sentences and the emotional sentences are extracted from the neutral voice data and the emotional voice data to establish the conversion relation of the neutral acoustic features and the emotional acoustic features;
by adopting the invention, the emotion model of any other person can be obtained based on the emotion model of a speaker, and the method can be realized by utilizing the neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
It should be noted that the structures, ratios, sizes, and the like shown in the drawings attached to the present specification are only used for matching the disclosure of the present specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions of the present invention, so that the present invention has no technical essence, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.
The invention aims to provide an emotion synthesis method based on a deep neural network model, solves the problems that the development time is long and the research and development cost is overhigh due to the huge data volume when the existing emotion model is generated, and aims to quickly construct a corresponding emotion model by using a small amount of neutral data for a plurality of different speakers.
First, referring to fig. 1, fig. 1 is a flowchart illustrating an operation of the emotion synthesis method based on the deep neural network model according to the present invention, and the emotion synthesis method based on the deep neural network model mainly includes the following steps and realizes the following functions:
s001: acquiring neutral acoustic feature data and emotional acoustic feature data of a first speaker (speaker A);
s002: establishing an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of a first speaker (speaker A) by using a deep neural network model;
s003: acquiring neutral voice data of a second speaker (speaker B), and establishing a neutral voice synthesis model of the second speaker (speaker B);
s004: and connecting the neutral voice synthesis model of the second speaker (speaker B) and the emotion conversion model in series by using the deep neural network model to obtain an emotion voice synthesis model of the second speaker (speaker B).
The emotion synthesis method based on the deep neural network model is characterized in that neutral acoustic feature data and emotion acoustic feature data of a speaker are obtained, and the deep neural network model is utilized to establish the conversion relation between the neutral and emotion acoustic features of the speaker, so that the corresponding emotion model can be obtained under the condition that a small amount of neutral voice data of other speakers are input. When neutral acoustic feature data and emotional acoustic feature data of a speaker are obtained, synthetic acoustic features of the same sentence can be output by using neutral and emotional voice models of the speaker, and a conversion relation between the neutral and emotional acoustic features is established by using the synthetic acoustic feature data; and neutral voice data and emotional voice data of the speaker can be obtained by recording neutral sentences and emotional sentences with consistent text contents, and then the synthesized acoustic features of the neutrality and the emotion are extracted from the neutral sentences and the emotional sentences to establish the conversion relation of the neutral and the emotional acoustic features. Therefore, the invention can obtain the emotion model of any other person based on the emotion model of a speaker, can be realized by using the neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.
In view of the above step S001, the present invention provides two ways to obtain the neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a), specifically as follows:
the first method is as follows:
fig. 2 is a diagram showing data formation of a first embodiment of the emotion synthesis method based on a deep neural network model according to the present invention, and the diagram includes:
providing a certain amount of sentence texts (such as 2000 sentences) of a first speaker (speaker A), wherein the sentence texts comprise neutral sentence texts (such as 2000 sentences) and emotion sentence texts (such as 2000 sentences) with consistent text contents;
acquiring neutral voice data of a first speaker (speaker A) from the neutral sentence texts; if the neutral sentence texts are recorded, neutral voice data of a first speaker (speaker A) is obtained;
obtaining emotional voice data of a first speaker (speaker A) from the emotional statement texts; if the emotion sentence texts are recorded, acquiring emotion voice data of a first speaker (speaker A);
extracting neutral acoustic feature data of a first speaker (speaker A) from the acquired neutral voice data of the first speaker;
and extracting emotional acoustic feature data of the first speaker from the acquired emotional voice data of the first speaker (speaker A).
The second method comprises the following steps:
referring to fig. 3 again, fig. 3 is a diagram showing data formation of a second embodiment of the emotion synthesis method based on a deep neural network model according to the present invention, which includes:
acquiring neutral voice data of a first speaker (speaker A) and emotional voice data of the first speaker (speaker A), such as acquiring the neutral voice data of the first speaker (speaker A) and the emotional voice data of the first speaker (speaker A) by recording;
carrying out Deep neural network model (DNN for short) model training by utilizing neutral voice data of a first speaker (speaker A) to obtain a neutral voice synthesis model of the first speaker (speaker A);
carrying out deep neural network model (DNN) model training by utilizing the emotional voice data of a first speaker (speaker A) to obtain an emotional voice synthesis model of the first speaker (speaker A);
a certain number of sentence texts (for example, 5000 sentences) are provided, and the sentence texts are respectively input into a neutral speech synthesis model of a first speaker (speaker a) and an emotion speech synthesis model of the first speaker (speaker a), so that neutral acoustic feature data of the corresponding first speaker (speaker a) and emotion acoustic feature data of the first speaker (speaker a) are obtained.
The method comprises the following steps of acquiring neutral acoustic feature data of a first speaker (speaker A) and emotional acoustic features of the first speaker (speaker A) in both the two modes, wherein the first mode is relatively direct, the neutral voice data and the emotional voice data of the first speaker (speaker A) are directly acquired from a certain number of recorded sentence texts, and then the corresponding neutral acoustic feature data and the corresponding emotional acoustic feature data are extracted from the neutral voice data and the emotional voice data, but the neutral sentence texts and the emotional sentence texts with the same text contents are required to be included in the recording of the sentence texts; in the second mode, the requirement is not made, a certain amount of arbitrary sentence texts are respectively input into the neutral speech synthesis model and the emotion speech synthesis model without making a requirement on the text contents when the sentence texts are recorded, so that corresponding neutral acoustic feature data and emotion acoustic feature data can be obtained by using the neutral speech synthesis model and the emotion speech synthesis model, and the data obtained by the neutral speech synthesis model and the emotion speech synthesis model has higher precision and better tightness.
In step S003, after acquiring the neutral speech data of the second speaker (speaker B), the invention can establish a neutral speech synthesis model of the second speaker (speaker B) in the following two ways, as shown in fig. 3, which specifically includes:
the first method is based on the method of acquiring the neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a) in the second method:
the neutral speech synthesis model of the first speaker (speaker a) is retrained (retain) using the recorded neutral speech data of the second speaker (speaker B) to obtain the neutral speech synthesis model of the second speaker (speaker B), and the step is realized based on model training of a deep neural network model (DNN).
The second method is applicable to both of the above two methods of acquiring the neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a):
and carrying out deep neural network model (DNN) model training by utilizing the recorded neutral voice data of the second speaker (speaker B) to obtain a neutral voice synthesis model of the second speaker.
The step S002 is an innovative point of the emotion synthesis method based on the deep neural network model of the present invention, and the emotion conversion model corresponding to the acoustic conversion relationship between the two data is obtained by using the acquired neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a), and then using the deep neural network model (DNN). By using the emotion conversion model, an emotion model (namely, an emotion voice synthesis model, called an emotion model for short) of a corresponding speaker can be obtained based on a deep neural network model (DNN).
The emotion models suitable for the deep neural network model-based emotion synthesis method comprise happiness, anger, injury and other emotion models.
The invention can obtain the emotion models of any other person based on the emotion model of a speaker, can be realized by using a neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.
The emotion synthesis method based on the deep neural network model outputs the synthesized acoustic features of the same sentence through the neutral and emotion voice models of one speaker, and establishes the conversion relation of the neutral and emotion acoustic features by using the synthesized acoustic feature data, so that the corresponding emotion models can be obtained by inputting a small amount of neutral voice data of other speakers.
Taking the happy emotion model (i.e. the emotion speech synthesis model of happy emotion) as an example, as shown in fig. 4, fig. 4 is a synthesis flow chart of happy emotion of the emotion synthesis method based on the deep neural network model of the present invention, and includes the following steps:
recording and acquiring the neutral voice data and the happy voice data of the speaker A;
secondly, carrying out DNN (deep neural network model) model training by using the neutral voice data to obtain a neutral voice synthesis model of the speaker A, wherein as shown in FIG. 5, FIG. 5 is a schematic structural diagram of the neutral voice synthesis model of the speaker A; the neutral speech synthesis data includes an acoustic feature sequence of the neutral speech and corresponding text data information, where the acoustic feature sequence of the neutral speech includes a frequency spectrum, energy, a fundamental frequency, and a duration, and specifically as follows:
the method comprises the following steps: acquiring input data:
corresponding to the text characteristics, specifically, acquiring information such as traditional phonemes, prosody and the like corresponding to the text, and performing 0\1 coding to obtain 1114-dimensional binary digits; meanwhile, adding relative position information (normalized between 0 and 1) of the current frame in the current phoneme, including a forward position and a backward position, and sharing 2 dimensions; the coding and position information of 0\1 of the phoneme \ rhythm and the like are 1116-dimension as DNN network input;
step two: acquiring output data:
the method comprises the following steps of 1) modeling the acoustic features which are divided into two types, wherein the frequency spectrum, the energy and the fundamental frequency are respectively divided into 40-dimensional frequency spectrum, 1-dimensional energy, 1-dimensional fundamental frequency and 1-dimensional fundamental frequency unvoiced and turbid mark, frame expansion of the front 4 frames and the rear 4 frames is considered for the fundamental frequency, first-order difference information and second-order difference information of the frequency spectrum and energy parameters are considered, and 133-dimensional frequency spectrum and energy parameters are calculated; 2) duration, here phoneme duration, i.e. the number of frames contained in the phoneme, 1 d;
step three: training the DNN model:
here, a regression model is constructed using a classical bp (back propagation) neural network, the hidden layer uses a sigmoid excitation function (S-shaped generation curve excitation function), the output layer uses a linear excitation function (linear excitation function), network parameters are firstly randomized as initial parameters, and then model training is performed based on the following MMSE (Minimum Mean Square Error) criterion:
L(y,z)=||y-z||2
where y is a natural target parameter, z is a parameter predicted by the DNN model, and the goal of training is to update the DNN network so that L (y, z) is minimal.
Here, the two types of acoustic features mentioned above are modeled separately:
1) spectrum, energy and fundamental frequency, 133 dimensions total, and the network structure is: 1116-1024-133, the obtained neutral speech synthesis model is marked as MANS;
2) Duration, 1D, wherein the network input does not consider the relative position information of the frame in the current phoneme, and the network structure is as follows: 11141024-1, the obtained neutral speech synthesis model is marked as MAND;
Thirdly, carrying out DNN model training by using the happy speech data to obtain a happy speech synthesis model of the speaker A; the happy voice data comprises a characteristic sequence of happy voice and corresponding text data information, wherein the characteristic sequence of the happy voice comprises frequency spectrum, energy, fundamental frequency and duration, a specific modeling mode is similar to a neutral voice synthesis model of a speaker A, and a DNN model of an obtained emotional voice synthesis model of the speaker A is marked as MAESAnd MAED。
(IV) providing any batch of sentence texts with a certain quantity (for example, 5000 sentences), respectively inputting the sentence texts into a speaker A neutral speech synthesis model and a speaker A happy speech synthesis model, correspondingly obtaining the A neutral synthesized acoustic feature data and the A happy synthesized acoustic feature data, and then constructing an acoustic conversion relationship between the A neutral synthesized acoustic feature and the A happy synthesized acoustic feature, wherein the neutral and happy speech conversion relationship utilizes DNN to obtain an emotion conversion model, as shown in FIG. 6, FIG. 6 is a schematic structural diagram of the emotion conversion model of the emotion synthesis method based on the deep neural network model of the present invention, and the specific contents are as follows:
the method comprises the following steps: acquiring input data:
corresponding neutral acoustic feature data is obtained from the input text using a neutral speech synthesis model of the speaker A, in particular using a neutral speech synthesis model M of the speaker AANSObtaining the frequency spectrum, energy and fundamental frequency characteristics, and using the neutral speech synthesis model M of the speaker AANDObtaining phoneme duration characteristics;
step two: acquiring output data:
obtaining corresponding acoustic characteristics by using the emotional voice synthesis model of the speaker A according to the input text, in particular, by using the emotional voice synthesis model M of the speaker AAESObtaining frequency spectrum, energy and fundamental frequency characteristics, and synthesizing model M by using emotional speech of speaker AAEDObtaining phoneme duration characteristics; the two pairs of features are used as target emotion acoustic feature parameters.
Step three: training the DNN model:
here, a bp (back propagation) neural network is used to construct a regression model (one of DNN models), the hidden layer uses a sigmoid excitation function, the output layer uses a linear excitation function, network parameters are firstly randomized as initial parameters, and then model training is performed based on the following MMSE criterion:
L(y,z)=||y-z||2
where y is the target emotion acoustic feature parameter, z is the emotion acoustic feature parameter predicted by the DNN model, and the goal of training is to update the DNN network so that L (y, z) is minimized.
Here, the two types of acoustic features mentioned above are modeled separately:
1) spectrum, energy and fundamental frequency, 133 dimensions total, and the network structure is: 133-1024-133, the obtained model is marked as MCS;
2) The duration is 1 dimension in total, and the network structure is as follows: 1-1024-1, and the obtained model is marked as MCD;
The model MCSAnd model MCDNamely, the emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the speaker A.
Then, neutral voice data of the speaker B is acquired.
Retraining (retraining) the neutral voice synthesis model of the speaker A by utilizing the neutral voice data of the speaker B to obtain the neutral voice synthesis model of the speaker B; alternatively, deep neural network model (DNN) model training may be directly performed using the acquired neutral speech data of the speaker B, and a neutral speech synthesis model of the speaker B may also be obtained, in which the former scheme is adopted in this embodiment. The neutral speech data comprises a characteristic sequence of neutral speech and corresponding text data information, wherein the characteristic sequence of the neutral speech comprises frequency spectrum, energy, fundamental frequency and duration, the specific modeling mode is similar to that of a neutral speech synthesis model of a speaker A, except that network parameters are not randomized, the neutral speech synthesis model of the speaker A is used as initial parameters, and a DNN model of the neutral speech synthesis model of a speaker B is obtained by remembering the DNN model of the neutral speech synthesis model of the speaker BIs MBNSAnd MBND。
Synthesizing the neutral speech of the speaker B into a model MBNSAnd MBNDRespectively convert the model M with emotionCSAnd MCDSerially connecting to obtain an emotional voice synthesis model M of the speaker BBNS-MCSAnd MBND-MCDThe structure is shown in fig. 7, and fig. 7 is a schematic structural diagram of an emotion speech synthesis model of a speaker B.
In the synthesis stage, for a text to be synthesized, analyzing the text by using a synthesis front end to obtain corresponding text characteristics, specifically, obtaining information such as traditional phonemes, prosody and the like corresponding to the text, and performing 0\1 coding to obtain 1114-dimensional binary digits; meanwhile, adding relative position information (normalized between 0 and 1) of the current frame in the current phoneme, including a forward position and a backward position, and sharing 2 dimensions; the coding and position information of 0\1 of the phoneme \ rhythm and the like are 1116-dimension as DNN network input;
the prediction steps are as follows:
1. predicting phoneme duration information, wherein the relative position information of a frame in a current phoneme is not considered by network input, 0\1 coding information of 1114D phoneme \ prosody information is used as input, and the phoneme duration is predicted;
2. predicting frequency spectrum, energy and fundamental frequency information, taking 1116-dimensional information obtained by the front-end analysis as input, and predicting the frequency spectrum, energy and fundamental frequency information, wherein the total number of 133 dimensions is obtained;
3. for the predicted acoustic parameters, performing parameter generation by the following formula to obtain smoothed acoustic parameters:
w is a window function matrix for calculating first-order difference and second-order difference, C is acoustic features to be generated, M is acoustic features predicted by a DNN network, and U is global variance obtained through statistics in a training sound bank.
4. And synthesizing the voice by using the acoustic characteristics C through a vocoder to obtain an emotional voice synthesis model of the speaker B.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.