CN106531150B - Emotion synthesis method based on deep neural network model - Google Patents

Emotion synthesis method based on deep neural network model Download PDF

Info

Publication number
CN106531150B
CN106531150B CN201611201686.6A CN201611201686A CN106531150B CN 106531150 B CN106531150 B CN 106531150B CN 201611201686 A CN201611201686 A CN 201611201686A CN 106531150 B CN106531150 B CN 106531150B
Authority
CN
China
Prior art keywords
speaker
model
neutral
emotion
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611201686.6A
Other languages
Chinese (zh)
Other versions
CN106531150A (en
Inventor
王鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Shanghai Intelligent Technology Co Ltd
Original Assignee
Unisound Shanghai Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Shanghai Intelligent Technology Co Ltd filed Critical Unisound Shanghai Intelligent Technology Co Ltd
Priority to CN201611201686.6A priority Critical patent/CN106531150B/en
Publication of CN106531150A publication Critical patent/CN106531150A/en
Application granted granted Critical
Publication of CN106531150B publication Critical patent/CN106531150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an emotion synthesis method based on a deep neural network model, which comprises the following steps: acquiring neutral acoustic feature data and emotional acoustic feature data of a first speaker; establishing an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of the first speaker by using the deep neural network model; acquiring neutral voice data of a second speaker, and establishing a neutral voice synthesis model of the second speaker; and connecting the neutral voice synthesis model of the second speaker and the emotion conversion model in series by using the deep neural network model to obtain an emotion voice synthesis model of the second speaker. The invention can obtain the emotion models of any other person based on the emotion model of a speaker, can be realized by using a neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.

Description

Emotion synthesis method based on deep neural network model
Technical Field
The invention relates to the field of voice recognition, in particular to an emotion synthesis method based on a deep neural network model.
Background
Speech synthesis, also known as Text to Speech (Text to Speech) technology, is a technology that can convert Text information into Speech and read it aloud. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert character information into audible sound information.
The voice synthesis system is mostly established on the voice of a neutral reading mode, and in order to solve the monotonous and uninteresting problem of the neutral voice, an emotion model is introduced into the voice synthesis system, so that the voice synthesis has emotional characteristics, and the humanization of the synthesized voice is enhanced. Under the personalized requirement of the voice synthesis system, the voice synthesis system can adaptively generate an acoustic model corresponding to a speaker, namely, a large amount of voice data of the speaker and text label data corresponding to the voice data need to be recorded for model training, after an emotion model is added, a large amount of voice data with different emotions of the speaker and text label data corresponding to the voice data need to be recorded for emotion model training, but when a plurality of different speakers exist, the data volume can be huge, so that the development time is long, and the development cost is high.
Disclosure of Invention
The invention aims to solve the technical problem of providing an emotion synthesis method based on a deep neural network model, solving the problems of long development time and overhigh research and development cost caused by huge data volume when the existing emotion model is generated, and aiming at quickly constructing a corresponding emotion model by utilizing a small amount of neutral data for a plurality of different speakers.
In order to realize the technical effect, the invention discloses an emotion synthesis method based on a deep neural network model, which comprises the following steps of:
acquiring neutral acoustic feature data and emotional acoustic feature data of a first speaker;
establishing an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker by using a deep neural network model;
acquiring neutral voice data of a second speaker, and establishing a neutral voice synthesis model of the second speaker; and
and connecting the neutral voice synthesis model of the second speaker and the emotion conversion model in series by using a deep neural network model to obtain an emotion voice synthesis model of the second speaker.
The emotion synthesis method based on the deep neural network model is further improved in that neutral acoustic feature data and emotion acoustic feature data of a first speaker are obtained through the following method, and the method comprises the following steps of:
providing a certain amount of sentence texts of a first speaker, wherein the sentence texts comprise neutral sentence texts and emotion sentence texts with consistent text contents;
acquiring neutral voice data of a first speaker from the neutral sentence text; obtaining emotion voice data of a first speaker from the emotion statement text;
extracting neutral acoustic feature data of a first speaker from the neutral speech data;
and extracting emotional acoustic feature data of the first speaker from the emotional voice data.
The emotion synthesis method based on the deep neural network model is further improved in that neutral acoustic feature data and emotion acoustic feature data of a first speaker are obtained through the following method, and the method comprises the following steps:
acquiring neutral voice data and emotion voice data of a first speaker;
carrying out deep neural network model training by utilizing the neutral voice data of the first speaker to obtain a neutral voice synthesis model of the first speaker;
carrying out deep neural network model training by utilizing the emotion voice data of the first speaker to obtain an emotion voice synthesis model of the first speaker;
providing a certain amount of sentence texts, and respectively inputting the sentence texts into the neutral voice synthesis model and the emotion voice synthesis model of the first speaker to obtain corresponding neutral acoustic feature data and emotion acoustic feature data of the first speaker.
The emotion synthesis method based on the deep neural network model is further improved in that after neutral voice data of a second speaker is acquired, the neutral voice synthesis model of the second speaker is established by the following method, and the method comprises the following steps:
and retraining the neutral voice synthesis model of the first speaker by using the neutral voice data of the second speaker to obtain the neutral voice synthesis model of the second speaker.
The emotion synthesis method based on the deep neural network model is further improved in that after neutral voice data of a second speaker is acquired, the neutral voice synthesis model of the second speaker is established by the following method, and the method comprises the following steps:
and carrying out deep neural network model training by using the neutral voice data of the second speaker to obtain a neutral voice synthesis model of the second speaker.
The emotion synthesis method based on the deep neural network model is further improved in that an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of the first speaker is established by the deep neural network model through the following method, and the method comprises the following steps:
taking neutral acoustic feature data of a first speaker as input data of the deep neural network model;
taking the emotional acoustic feature data of the first speaker as output data of the deep neural network model;
and training the deep neural network model to obtain an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker.
The emotion synthesis method based on the deep neural network model is further improved in that the deep neural network model is trained through the following method to obtain an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of a first speaker, and the emotion conversion model comprises the following steps:
a neural network in the deep neural network model is used for constructing a regression model, an S-shaped growth curve excitation function is used for a hidden layer, and a linear excitation function is used for an output layer;
taking the randomized network parameters as initial parameters, and carrying out model training based on the minimum mean square error criterion of formula 1;
L(y,z)=||y-z||2(1)
wherein y is the emotion acoustic feature data, z is the emotion acoustic feature parameter predicted by the deep neural network model, and the training aim is to update the deep neural network model so that L (y, z) is minimum.
The emotion synthesis method based on the deep neural network model is further improved in that the neutral speech synthesis model of the second speaker is connected with the emotion conversion model in series to obtain the emotion speech synthesis model of the second speaker by the following method, and the method comprises the following steps:
in the synthesis stage, analyzing a text to be synthesized by using a synthesis front end to obtain corresponding text characteristics, wherein the text characteristics comprise phoneme information, prosodic information, 0/1 encoding information and relative position information of a current frame in a current phoneme;
using the phoneme information, prosody information and 0/1 coding information as the input of a deep neural network model, and predicting phoneme duration information;
using the phoneme information, prosody information, 0/1 coding information and the relative position information of the current frame in the current phoneme as the input of a deep neural network model, and predicting frequency spectrum information, energy information and fundamental frequency information;
taking the predicted frequency spectrum information, the predicted energy information and the predicted fundamental frequency information as acoustic parameters, and performing parameter generation on the acoustic features through a formula 2 to obtain smooth acoustic features;
Figure BDA0001189184720000041
w is a window function matrix for calculating first-order difference and second-order difference, C is acoustic features to be generated, M is acoustic parameters predicted by a deep neural network model, and U is global variance obtained through statistics from a training sound library;
and synthesizing the emotional voice synthesis model through the vocoder by using the acoustic feature C.
The emotion synthesis method based on the deep neural network model is further improved in that the neutral voice data comprises an acoustic feature sequence of neutral voice and corresponding text data information, and the acoustic feature sequence of the neutral voice comprises frequency spectrum, energy, fundamental frequency and duration.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
the emotion synthesis method comprises the steps of acquiring neutral acoustic feature data and emotion acoustic feature data of a speaker, and establishing a conversion relation between the neutral and emotion acoustic features of the speaker by using a deep neural network model, so that a corresponding emotion model can be obtained under the condition that a small amount of neutral voice data of other speakers are input;
when neutral acoustic feature data and emotional acoustic feature data of a speaker are obtained, synthesized acoustic features of the same sentence can be output by using neutral and emotional voice models of the speaker, and a conversion relation between the neutral and emotional acoustic features is established by using the synthesized acoustic feature data; neutral voice data and emotional voice data of a speaker can be obtained by recording neutral sentences and emotional sentences with consistent text contents, and then the synthetic acoustic features of the neutral sentences and the emotional sentences are extracted from the neutral voice data and the emotional voice data to establish the conversion relation of the neutral acoustic features and the emotional acoustic features;
by adopting the invention, the emotion model of any other person can be obtained based on the emotion model of a speaker, and the method can be realized by utilizing the neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.
Drawings
FIG. 1 is an operation flow chart of an emotion synthesis method based on a deep neural network model according to the present invention.
FIG. 2 is a diagram showing data formation of a first embodiment of an emotion synthesis method based on a deep neural network model according to the present invention.
FIG. 3 is a diagram showing data formation of a second embodiment of the emotion synthesis method based on a deep neural network model according to the present invention.
FIG. 4 is a flow chart of the synthesis of happy emotion of the emotion synthesis method based on the deep neural network model.
FIG. 5 is a schematic structural diagram of a first speaker's neutral speech synthesis model of the emotion synthesis method based on a deep neural network model according to the present invention.
FIG. 6 is a schematic structural diagram of an emotion conversion model of the emotion synthesis method based on the deep neural network model.
FIG. 7 is a schematic structural diagram of an emotion speech synthesis model of a second speaker in the emotion synthesis method based on a deep neural network model according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
It should be noted that the structures, ratios, sizes, and the like shown in the drawings attached to the present specification are only used for matching the disclosure of the present specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions of the present invention, so that the present invention has no technical essence, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the present invention without affecting the efficacy and the achievable purpose of the present invention. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.
The invention aims to provide an emotion synthesis method based on a deep neural network model, solves the problems that the development time is long and the research and development cost is overhigh due to the huge data volume when the existing emotion model is generated, and aims to quickly construct a corresponding emotion model by using a small amount of neutral data for a plurality of different speakers.
First, referring to fig. 1, fig. 1 is a flowchart illustrating an operation of the emotion synthesis method based on the deep neural network model according to the present invention, and the emotion synthesis method based on the deep neural network model mainly includes the following steps and realizes the following functions:
s001: acquiring neutral acoustic feature data and emotional acoustic feature data of a first speaker (speaker A);
s002: establishing an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of a first speaker (speaker A) by using a deep neural network model;
s003: acquiring neutral voice data of a second speaker (speaker B), and establishing a neutral voice synthesis model of the second speaker (speaker B);
s004: and connecting the neutral voice synthesis model of the second speaker (speaker B) and the emotion conversion model in series by using the deep neural network model to obtain an emotion voice synthesis model of the second speaker (speaker B).
The emotion synthesis method based on the deep neural network model is characterized in that neutral acoustic feature data and emotion acoustic feature data of a speaker are obtained, and the deep neural network model is utilized to establish the conversion relation between the neutral and emotion acoustic features of the speaker, so that the corresponding emotion model can be obtained under the condition that a small amount of neutral voice data of other speakers are input. When neutral acoustic feature data and emotional acoustic feature data of a speaker are obtained, synthetic acoustic features of the same sentence can be output by using neutral and emotional voice models of the speaker, and a conversion relation between the neutral and emotional acoustic features is established by using the synthetic acoustic feature data; and neutral voice data and emotional voice data of the speaker can be obtained by recording neutral sentences and emotional sentences with consistent text contents, and then the synthesized acoustic features of the neutrality and the emotion are extracted from the neutral sentences and the emotional sentences to establish the conversion relation of the neutral and the emotional acoustic features. Therefore, the invention can obtain the emotion model of any other person based on the emotion model of a speaker, can be realized by using the neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.
In view of the above step S001, the present invention provides two ways to obtain the neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a), specifically as follows:
the first method is as follows:
fig. 2 is a diagram showing data formation of a first embodiment of the emotion synthesis method based on a deep neural network model according to the present invention, and the diagram includes:
providing a certain amount of sentence texts (such as 2000 sentences) of a first speaker (speaker A), wherein the sentence texts comprise neutral sentence texts (such as 2000 sentences) and emotion sentence texts (such as 2000 sentences) with consistent text contents;
acquiring neutral voice data of a first speaker (speaker A) from the neutral sentence texts; if the neutral sentence texts are recorded, neutral voice data of a first speaker (speaker A) is obtained;
obtaining emotional voice data of a first speaker (speaker A) from the emotional statement texts; if the emotion sentence texts are recorded, acquiring emotion voice data of a first speaker (speaker A);
extracting neutral acoustic feature data of a first speaker (speaker A) from the acquired neutral voice data of the first speaker;
and extracting emotional acoustic feature data of the first speaker from the acquired emotional voice data of the first speaker (speaker A).
The second method comprises the following steps:
referring to fig. 3 again, fig. 3 is a diagram showing data formation of a second embodiment of the emotion synthesis method based on a deep neural network model according to the present invention, which includes:
acquiring neutral voice data of a first speaker (speaker A) and emotional voice data of the first speaker (speaker A), such as acquiring the neutral voice data of the first speaker (speaker A) and the emotional voice data of the first speaker (speaker A) by recording;
carrying out Deep neural network model (DNN for short) model training by utilizing neutral voice data of a first speaker (speaker A) to obtain a neutral voice synthesis model of the first speaker (speaker A);
carrying out deep neural network model (DNN) model training by utilizing the emotional voice data of a first speaker (speaker A) to obtain an emotional voice synthesis model of the first speaker (speaker A);
a certain number of sentence texts (for example, 5000 sentences) are provided, and the sentence texts are respectively input into a neutral speech synthesis model of a first speaker (speaker a) and an emotion speech synthesis model of the first speaker (speaker a), so that neutral acoustic feature data of the corresponding first speaker (speaker a) and emotion acoustic feature data of the first speaker (speaker a) are obtained.
The method comprises the following steps of acquiring neutral acoustic feature data of a first speaker (speaker A) and emotional acoustic features of the first speaker (speaker A) in both the two modes, wherein the first mode is relatively direct, the neutral voice data and the emotional voice data of the first speaker (speaker A) are directly acquired from a certain number of recorded sentence texts, and then the corresponding neutral acoustic feature data and the corresponding emotional acoustic feature data are extracted from the neutral voice data and the emotional voice data, but the neutral sentence texts and the emotional sentence texts with the same text contents are required to be included in the recording of the sentence texts; in the second mode, the requirement is not made, a certain amount of arbitrary sentence texts are respectively input into the neutral speech synthesis model and the emotion speech synthesis model without making a requirement on the text contents when the sentence texts are recorded, so that corresponding neutral acoustic feature data and emotion acoustic feature data can be obtained by using the neutral speech synthesis model and the emotion speech synthesis model, and the data obtained by the neutral speech synthesis model and the emotion speech synthesis model has higher precision and better tightness.
In step S003, after acquiring the neutral speech data of the second speaker (speaker B), the invention can establish a neutral speech synthesis model of the second speaker (speaker B) in the following two ways, as shown in fig. 3, which specifically includes:
the first method is based on the method of acquiring the neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a) in the second method:
the neutral speech synthesis model of the first speaker (speaker a) is retrained (retain) using the recorded neutral speech data of the second speaker (speaker B) to obtain the neutral speech synthesis model of the second speaker (speaker B), and the step is realized based on model training of a deep neural network model (DNN).
The second method is applicable to both of the above two methods of acquiring the neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a):
and carrying out deep neural network model (DNN) model training by utilizing the recorded neutral voice data of the second speaker (speaker B) to obtain a neutral voice synthesis model of the second speaker.
The step S002 is an innovative point of the emotion synthesis method based on the deep neural network model of the present invention, and the emotion conversion model corresponding to the acoustic conversion relationship between the two data is obtained by using the acquired neutral acoustic feature data and the emotion acoustic feature data of the first speaker (speaker a), and then using the deep neural network model (DNN). By using the emotion conversion model, an emotion model (namely, an emotion voice synthesis model, called an emotion model for short) of a corresponding speaker can be obtained based on a deep neural network model (DNN).
The emotion models suitable for the deep neural network model-based emotion synthesis method comprise happiness, anger, injury and other emotion models.
The invention can obtain the emotion models of any other person based on the emotion model of a speaker, can be realized by using a neutral and emotion conversion relation model of the speaker, and has the advantages of small data volume, high speed of constructing the emotion model, low cost and the like.
The emotion synthesis method based on the deep neural network model outputs the synthesized acoustic features of the same sentence through the neutral and emotion voice models of one speaker, and establishes the conversion relation of the neutral and emotion acoustic features by using the synthesized acoustic feature data, so that the corresponding emotion models can be obtained by inputting a small amount of neutral voice data of other speakers.
Taking the happy emotion model (i.e. the emotion speech synthesis model of happy emotion) as an example, as shown in fig. 4, fig. 4 is a synthesis flow chart of happy emotion of the emotion synthesis method based on the deep neural network model of the present invention, and includes the following steps:
recording and acquiring the neutral voice data and the happy voice data of the speaker A;
secondly, carrying out DNN (deep neural network model) model training by using the neutral voice data to obtain a neutral voice synthesis model of the speaker A, wherein as shown in FIG. 5, FIG. 5 is a schematic structural diagram of the neutral voice synthesis model of the speaker A; the neutral speech synthesis data includes an acoustic feature sequence of the neutral speech and corresponding text data information, where the acoustic feature sequence of the neutral speech includes a frequency spectrum, energy, a fundamental frequency, and a duration, and specifically as follows:
the method comprises the following steps: acquiring input data:
corresponding to the text characteristics, specifically, acquiring information such as traditional phonemes, prosody and the like corresponding to the text, and performing 0\1 coding to obtain 1114-dimensional binary digits; meanwhile, adding relative position information (normalized between 0 and 1) of the current frame in the current phoneme, including a forward position and a backward position, and sharing 2 dimensions; the coding and position information of 0\1 of the phoneme \ rhythm and the like are 1116-dimension as DNN network input;
step two: acquiring output data:
the method comprises the following steps of 1) modeling the acoustic features which are divided into two types, wherein the frequency spectrum, the energy and the fundamental frequency are respectively divided into 40-dimensional frequency spectrum, 1-dimensional energy, 1-dimensional fundamental frequency and 1-dimensional fundamental frequency unvoiced and turbid mark, frame expansion of the front 4 frames and the rear 4 frames is considered for the fundamental frequency, first-order difference information and second-order difference information of the frequency spectrum and energy parameters are considered, and 133-dimensional frequency spectrum and energy parameters are calculated; 2) duration, here phoneme duration, i.e. the number of frames contained in the phoneme, 1 d;
step three: training the DNN model:
here, a regression model is constructed using a classical bp (back propagation) neural network, the hidden layer uses a sigmoid excitation function (S-shaped generation curve excitation function), the output layer uses a linear excitation function (linear excitation function), network parameters are firstly randomized as initial parameters, and then model training is performed based on the following MMSE (Minimum Mean Square Error) criterion:
L(y,z)=||y-z||2
where y is a natural target parameter, z is a parameter predicted by the DNN model, and the goal of training is to update the DNN network so that L (y, z) is minimal.
Here, the two types of acoustic features mentioned above are modeled separately:
1) spectrum, energy and fundamental frequency, 133 dimensions total, and the network structure is: 1116-1024-133, the obtained neutral speech synthesis model is marked as MANS
2) Duration, 1D, wherein the network input does not consider the relative position information of the frame in the current phoneme, and the network structure is as follows: 11141024-1, the obtained neutral speech synthesis model is marked as MAND
Thirdly, carrying out DNN model training by using the happy speech data to obtain a happy speech synthesis model of the speaker A; the happy voice data comprises a characteristic sequence of happy voice and corresponding text data information, wherein the characteristic sequence of the happy voice comprises frequency spectrum, energy, fundamental frequency and duration, a specific modeling mode is similar to a neutral voice synthesis model of a speaker A, and a DNN model of an obtained emotional voice synthesis model of the speaker A is marked as MAESAnd MAED
(IV) providing any batch of sentence texts with a certain quantity (for example, 5000 sentences), respectively inputting the sentence texts into a speaker A neutral speech synthesis model and a speaker A happy speech synthesis model, correspondingly obtaining the A neutral synthesized acoustic feature data and the A happy synthesized acoustic feature data, and then constructing an acoustic conversion relationship between the A neutral synthesized acoustic feature and the A happy synthesized acoustic feature, wherein the neutral and happy speech conversion relationship utilizes DNN to obtain an emotion conversion model, as shown in FIG. 6, FIG. 6 is a schematic structural diagram of the emotion conversion model of the emotion synthesis method based on the deep neural network model of the present invention, and the specific contents are as follows:
the method comprises the following steps: acquiring input data:
corresponding neutral acoustic feature data is obtained from the input text using a neutral speech synthesis model of the speaker A, in particular using a neutral speech synthesis model M of the speaker AANSObtaining the frequency spectrum, energy and fundamental frequency characteristics, and using the neutral speech synthesis model M of the speaker AANDObtaining phoneme duration characteristics;
step two: acquiring output data:
obtaining corresponding acoustic characteristics by using the emotional voice synthesis model of the speaker A according to the input text, in particular, by using the emotional voice synthesis model M of the speaker AAESObtaining frequency spectrum, energy and fundamental frequency characteristics, and synthesizing model M by using emotional speech of speaker AAEDObtaining phoneme duration characteristics; the two pairs of features are used as target emotion acoustic feature parameters.
Step three: training the DNN model:
here, a bp (back propagation) neural network is used to construct a regression model (one of DNN models), the hidden layer uses a sigmoid excitation function, the output layer uses a linear excitation function, network parameters are firstly randomized as initial parameters, and then model training is performed based on the following MMSE criterion:
L(y,z)=||y-z||2
where y is the target emotion acoustic feature parameter, z is the emotion acoustic feature parameter predicted by the DNN model, and the goal of training is to update the DNN network so that L (y, z) is minimized.
Here, the two types of acoustic features mentioned above are modeled separately:
1) spectrum, energy and fundamental frequency, 133 dimensions total, and the network structure is: 133-1024-133, the obtained model is marked as MCS
2) The duration is 1 dimension in total, and the network structure is as follows: 1-1024-1, and the obtained model is marked as MCD
The model MCSAnd model MCDNamely, the emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the speaker A.
Then, neutral voice data of the speaker B is acquired.
Retraining (retraining) the neutral voice synthesis model of the speaker A by utilizing the neutral voice data of the speaker B to obtain the neutral voice synthesis model of the speaker B; alternatively, deep neural network model (DNN) model training may be directly performed using the acquired neutral speech data of the speaker B, and a neutral speech synthesis model of the speaker B may also be obtained, in which the former scheme is adopted in this embodiment. The neutral speech data comprises a characteristic sequence of neutral speech and corresponding text data information, wherein the characteristic sequence of the neutral speech comprises frequency spectrum, energy, fundamental frequency and duration, the specific modeling mode is similar to that of a neutral speech synthesis model of a speaker A, except that network parameters are not randomized, the neutral speech synthesis model of the speaker A is used as initial parameters, and a DNN model of the neutral speech synthesis model of a speaker B is obtained by remembering the DNN model of the neutral speech synthesis model of the speaker BIs MBNSAnd MBND
Synthesizing the neutral speech of the speaker B into a model MBNSAnd MBNDRespectively convert the model M with emotionCSAnd MCDSerially connecting to obtain an emotional voice synthesis model M of the speaker BBNS-MCSAnd MBND-MCDThe structure is shown in fig. 7, and fig. 7 is a schematic structural diagram of an emotion speech synthesis model of a speaker B.
In the synthesis stage, for a text to be synthesized, analyzing the text by using a synthesis front end to obtain corresponding text characteristics, specifically, obtaining information such as traditional phonemes, prosody and the like corresponding to the text, and performing 0\1 coding to obtain 1114-dimensional binary digits; meanwhile, adding relative position information (normalized between 0 and 1) of the current frame in the current phoneme, including a forward position and a backward position, and sharing 2 dimensions; the coding and position information of 0\1 of the phoneme \ rhythm and the like are 1116-dimension as DNN network input;
the prediction steps are as follows:
1. predicting phoneme duration information, wherein the relative position information of a frame in a current phoneme is not considered by network input, 0\1 coding information of 1114D phoneme \ prosody information is used as input, and the phoneme duration is predicted;
2. predicting frequency spectrum, energy and fundamental frequency information, taking 1116-dimensional information obtained by the front-end analysis as input, and predicting the frequency spectrum, energy and fundamental frequency information, wherein the total number of 133 dimensions is obtained;
3. for the predicted acoustic parameters, performing parameter generation by the following formula to obtain smoothed acoustic parameters:
Figure BDA0001189184720000121
w is a window function matrix for calculating first-order difference and second-order difference, C is acoustic features to be generated, M is acoustic features predicted by a DNN network, and U is global variance obtained through statistics in a training sound bank.
4. And synthesizing the voice by using the acoustic characteristics C through a vocoder to obtain an emotional voice synthesis model of the speaker B.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (4)

1. A deep neural network model-based emotion synthesis method is characterized by comprising the following steps:
acquiring neutral acoustic feature data and emotional acoustic feature data of a first speaker;
establishing an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker by using a deep neural network model;
acquiring neutral voice data of a second speaker, and establishing a neutral voice synthesis model of the second speaker; and
connecting the neutral voice synthesis model of the second speaker and the emotion conversion model in series by using a deep neural network model to obtain an emotion voice synthesis model of the second speaker;
the method for acquiring the neutral acoustic feature data and the emotional acoustic feature data of the first speaker comprises the following steps:
acquiring neutral voice data and emotion voice data of a first speaker;
carrying out deep neural network model training by utilizing the neutral voice data of the first speaker to obtain a neutral voice synthesis model of the first speaker;
carrying out deep neural network model training by utilizing the emotion voice data of the first speaker to obtain an emotion voice synthesis model of the first speaker;
providing a certain amount of sentence texts, and respectively inputting the sentence texts into a neutral voice synthesis model and an emotion voice synthesis model of the first speaker to obtain corresponding neutral acoustic feature data and emotion acoustic feature data of the first speaker;
establishing an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker by using a deep neural network model through the following method, wherein the method comprises the following steps of:
taking neutral acoustic feature data of a first speaker as input data of the deep neural network model;
taking the emotional acoustic feature data of the first speaker as output data of the deep neural network model;
training the deep neural network model to obtain an emotion conversion model of neutral acoustic feature data and emotion acoustic feature data of a first speaker;
further, training the deep neural network model by the following method to obtain an emotion conversion model of the neutral acoustic feature data and the emotion acoustic feature data of the first speaker, wherein the emotion conversion model comprises the following steps:
a neural network in the deep neural network model is used for constructing a regression model, an S-shaped growth curve excitation function is used for a hidden layer, and a linear excitation function is used for an output layer;
taking the randomized network parameters as initial parameters, and carrying out model training based on the minimum mean square error criterion of formula 1;
L(y,z)=||y-z||2(1)
wherein y is the emotional acoustic feature data, z is the emotional acoustic feature parameter predicted by the deep neural network model, and the training aims to update the deep neural network model to minimize L (y, z);
connecting the neutral voice synthesis model of the second speaker and the emotion conversion model in series to obtain an emotion voice synthesis model of the second speaker by the following method:
in the synthesis stage, analyzing a text to be synthesized by using a synthesis front end to obtain corresponding text characteristics, wherein the text characteristics comprise phoneme information, prosodic information, 0/1 encoding information and relative position information of a current frame in a current phoneme;
using the phoneme information, prosody information and 0/1 coding information as the input of a deep neural network model, and predicting phoneme duration information;
using the phoneme information, prosody information, 0/1 coding information and the relative position information of the current frame in the current phoneme as the input of a deep neural network model, and predicting frequency spectrum information, energy information and fundamental frequency information;
taking the predicted frequency spectrum information, the predicted energy information and the predicted fundamental frequency information as acoustic parameters, and performing parameter generation on the acoustic features through a formula 2 to obtain smooth acoustic features;
w is a window function matrix for calculating first-order difference and second-order difference, C is acoustic features to be generated, M is acoustic parameters predicted by a deep neural network model, and U is global variance obtained through statistics from a training sound library;
and synthesizing the emotional voice synthesis model through the vocoder by using the acoustic feature C.
2. The method for emotion synthesis based on deep neural network model as claimed in claim 1, wherein after acquiring the neutral speech data of the second speaker, establishing the neutral speech synthesis model of the second speaker by the following method includes:
and retraining the neutral voice synthesis model of the first speaker by using the neutral voice data of the second speaker to obtain the neutral voice synthesis model of the second speaker.
3. The method for emotion synthesis based on deep neural network model as claimed in claim 1, wherein after acquiring the neutral speech data of the second speaker, establishing the neutral speech synthesis model of the second speaker by the following method includes:
and carrying out deep neural network model training by using the neutral voice data of the second speaker to obtain a neutral voice synthesis model of the second speaker.
4. The emotion synthesis method based on a deep neural network model as claimed in claim 1, wherein: the neutral speech data includes an acoustic feature sequence of the neutral speech including a spectrum, an energy, a fundamental frequency, and a duration, and corresponding text data information.
CN201611201686.6A 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model Active CN106531150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611201686.6A CN106531150B (en) 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611201686.6A CN106531150B (en) 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model

Publications (2)

Publication Number Publication Date
CN106531150A CN106531150A (en) 2017-03-22
CN106531150B true CN106531150B (en) 2020-02-07

Family

ID=58337400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611201686.6A Active CN106531150B (en) 2016-12-23 2016-12-23 Emotion synthesis method based on deep neural network model

Country Status (1)

Country Link
CN (1) CN106531150B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364631B (en) * 2017-01-26 2021-01-22 北京搜狗科技发展有限公司 Speech synthesis method and device
CN107103900B (en) * 2017-06-06 2020-03-31 西北师范大学 Cross-language emotion voice synthesis method and system
CN108305641B (en) * 2017-06-30 2020-04-07 腾讯科技(深圳)有限公司 Method and device for determining emotion information
CN108447470A (en) * 2017-12-28 2018-08-24 中南大学 A kind of emotional speech conversion method based on sound channel and prosodic features
US11538455B2 (en) 2018-02-16 2022-12-27 Dolby Laboratories Licensing Corporation Speech style transfer
CN108763190B (en) * 2018-04-12 2019-04-02 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
CN108597492B (en) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN110556092A (en) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 Speech synthesis method and device, storage medium and electronic device
CN109036370B (en) * 2018-06-06 2021-07-20 安徽继远软件有限公司 Adaptive training method for speaker voice
CN109102796A (en) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 A kind of phoneme synthesizing method and device
CN111048062B (en) * 2018-10-10 2022-10-04 华为技术有限公司 Speech synthesis method and apparatus
CN111192568B (en) * 2018-11-15 2022-12-13 华为技术有限公司 Speech synthesis method and speech synthesis device
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN111599338B (en) * 2020-04-09 2023-04-18 云知声智能科技股份有限公司 Stable and controllable end-to-end speech synthesis method and device
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104B (en) * 2006-04-24 2011-02-02 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN101308652B (en) * 2008-07-17 2011-06-29 安徽科大讯飞信息科技股份有限公司 Synthesizing method of personalized singing voice
CN102005205B (en) * 2009-09-03 2012-10-03 株式会社东芝 Emotional speech synthesizing method and device
US9767789B2 (en) * 2012-08-29 2017-09-19 Nuance Communications, Inc. Using emoticons for contextual text-to-speech expressivity
KR102305584B1 (en) * 2015-01-19 2021-09-27 삼성전자주식회사 Method and apparatus for training language model, method and apparatus for recognizing language
CN105206258B (en) * 2015-10-19 2018-05-04 百度在线网络技术(北京)有限公司 The generation method and device and phoneme synthesizing method and device of acoustic model

Also Published As

Publication number Publication date
CN106531150A (en) 2017-03-22

Similar Documents

Publication Publication Date Title
CN106531150B (en) Emotion synthesis method based on deep neural network model
JP7106680B2 (en) Text-to-Speech Synthesis in Target Speaker's Voice Using Neural Networks
CN108573693B (en) Text-to-speech system and method, and storage medium therefor
CN111048062B (en) Speech synthesis method and apparatus
US11443733B2 (en) Contextual text-to-speech processing
CN101578659B (en) Voice tone converting device and voice tone converting method
US8571871B1 (en) Methods and systems for adaptation of synthetic speech in an environment
CN106688034A (en) Text-to-speech with emotional content
JP6342428B2 (en) Speech synthesis apparatus, speech synthesis method and program
JP6846237B2 (en) Speech synthesizer and program
US20100057435A1 (en) System and method for speech-to-speech translation
US11763797B2 (en) Text-to-speech (TTS) processing
CN109326280B (en) Singing synthesis method and device and electronic equipment
JP5411845B2 (en) Speech synthesis method, speech synthesizer, and speech synthesis program
US20240087558A1 (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
EP4266306A1 (en) A speech processing system and a method of processing a speech signal
CN110459201B (en) Speech synthesis method for generating new tone
JP6594251B2 (en) Acoustic model learning device, speech synthesizer, method and program thereof
Mei et al. A particular character speech synthesis system based on deep learning
JP5268731B2 (en) Speech synthesis apparatus, method and program
Le et al. Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning.
CN114446278A (en) Speech synthesis method and apparatus, device and storage medium
JP5722295B2 (en) Acoustic model generation method, speech synthesis method, apparatus and program thereof
Ronanki Prosody generation for text-to-speech synthesis
JP2021099454A (en) Speech synthesis device, speech synthesis program, and speech synthesis method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170929

Address after: 200233 Shanghai City, Xuhui District Guangxi 65 No. 1 Jinglu room 702 unit 03

Applicant after: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Address before: 200233 Shanghai, Qinzhou, North Road, No. 82, building 2, layer 1198,

Applicant before: SHANGHAI YUZHIYI INFORMATION TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An emotion synthesis method based on deep neural network model

Effective date of registration: 20201201

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY Co.,Ltd.

Registration number: Y2020310000047

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220307

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2020310000047

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An emotion synthesis method based on deep neural network model

Effective date of registration: 20230210

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2023310000028

PC01 Cancellation of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2023310000028

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A sentiment synthesis method based on deep neural network models

Granted publication date: 20200207

Pledgee: Bank of Hangzhou Limited by Share Ltd. Shanghai branch

Pledgor: YUNZHISHENG (SHANGHAI) INTELLIGENT TECHNOLOGY CO.,LTD.

Registration number: Y2024310000165