WO2020253509A1 - Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium - Google Patents

Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium Download PDF

Info

Publication number
WO2020253509A1
WO2020253509A1 PCT/CN2020/093564 CN2020093564W WO2020253509A1 WO 2020253509 A1 WO2020253509 A1 WO 2020253509A1 CN 2020093564 W CN2020093564 W CN 2020093564W WO 2020253509 A1 WO2020253509 A1 WO 2020253509A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
emotion
input
emotional state
synthesized
Prior art date
Application number
PCT/CN2020/093564
Other languages
French (fr)
Chinese (zh)
Inventor
彭话易
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020253509A1 publication Critical patent/WO2020253509A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a Chinese speech synthesis method, device and storage medium oriented to situations and emotions.
  • This application provides a context- and emotion-oriented Chinese speech synthesis method, device, and storage medium to solve the problem in the prior art that the human-computer interaction is difficult to continue due to the fixed tone and emotion synthesis of speech in the prior art.
  • one aspect of this application is to provide a context- and emotion-oriented Chinese speech synthesis method, including: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the input through the emotion analysis model The emotional state of the speech; the emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; the speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • an electronic device includes: a processor; a memory, the memory includes a context-oriented and emotional Chinese speech synthesis program, the Chinese speech synthesis program is
  • the steps of the Chinese speech synthesis method as described below are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
  • the scene and the emotional state of the input voice determine the emotional state of the synthesized voice; perform speech synthesis according to the emotional state of the synthesized voice and the text to be synthesized determined based on the input voice.
  • another aspect of the present application provides a computer device, wherein the computer device includes: a memory, a processor, and a context-oriented and emotionally oriented device stored in the memory and capable of running on the processor.
  • a Chinese speech synthesis program when the Chinese speech synthesis program is executed by a processor, the following steps are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; The emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • another aspect of the present application is to provide a computer-readable storage medium
  • the computer-readable storage medium includes a context-oriented and emotional Chinese speech synthesis program
  • the Chinese speech synthesis program is executed by a processor
  • the steps of the Chinese speech synthesis method as described below are realized: acquiring the input speech; inputting the input speech into the emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; according to the dialogue scene and the input
  • the emotional state of the speech determines the emotional state of the synthesized speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • This application analyzes the emotional state of the input speech and obtains the emotional state of the synthesized speech according to the emotional state of the input speech.
  • the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene.
  • the output synthesized voice is more like a real person, enhancing the user experience.
  • Fig. 1 is a schematic flow chart of the context-oriented and emotion-oriented Chinese speech synthesis method described in this application;
  • Figure 2 is a schematic diagram of the structure of the emotion recognition model in this application.
  • Figure 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application.
  • Figure 4 is a schematic structural diagram of the sentiment classification model in this application.
  • FIG. 5 is a schematic diagram of modules of the Chinese speech synthesis program oriented to the situation and emotion in this application.
  • Figure 1 is a schematic flow chart of the context- and emotion-oriented Chinese speech synthesis method described in this application. As shown in Figure 1, the context- and emotion-oriented Chinese speech synthesis method described in this application includes the following steps:
  • Step S1 Obtain the input voice, the input voice is the voice to be fed back.
  • the input voice is the user's inquiry, etc.
  • the synthesized voice is the feedback of the smart speaker to the user.
  • This application is According to the emotional state of the user's inquiry voice, the emotional state of the synthesized voice is obtained, so that the feedback of the smart speaker has a specific tone and emotion, which is consistent with the emotion of the user's input voice;
  • Step S2 Input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model;
  • Step S3 Determine the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech.
  • the emotional state of the synthesized speech increase the influencing factors of the dialogue scene, so that the feedback obtained by human-computer interaction is not only satisfied with the user's emotion
  • the feedback is more in line with actual application scenarios and avoid errors. For example, for a sales scene, even if the emotion expressed by the voice input by the customer is impatient, after adding the influence factor of the dialogue scene, the emotional state of the synthesized speech is also Should be happy and active, serve customers well;
  • Step S4 Perform speech synthesis according to the emotional state of the synthesized speech and the to-be-synthesized text determined based on the input speech, where the to-be-synthesized text is the text to be fed back determined by the intelligent system according to the input speech during human-computer interaction.
  • This application obtains the emotional state of the synthesized speech by analyzing the emotional state of the input speech.
  • the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene.
  • the output synthesized speech is more like a real person.
  • the Chinese speech synthesis method further includes: judging whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, setting the synthesis The emotional state of the voice or the default emotional state of the synthesized voice is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform emotional state analysis on the input voice .
  • the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction.
  • the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ⁇ °C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.
  • the emotion analysis model includes a voice-based emotion recognition model.
  • Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers.
  • CRNN convolutional Recurrent Neural Network
  • the input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector;
  • the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.
  • Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application.
  • the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer.
  • Pooling layer long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module.
  • Each pooling module is connected to the LSTM layer
  • Each neuron is connected.
  • the convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.
  • the step of outputting the emotional state of the input voice through the emotional analysis model includes: obtaining a spectrogram and voice feature parameters of the input voice; and converting the spectrogram of the input voice
  • the trained convolutional recurrent neural network (CRNN) in the emotion recognition model output a first feature vector through the convolutional recurrent neural network and the first fully connected layer; obtain statistics according to the speech feature parameters Features (High level Statistics Functions, HFS features, perform feature statistics on multiple frames of speech in a segment of speech, and obtain the average or maximum value of feature parameters, etc.), input into the emotion recognition model, and pass the three second totals in the emotion recognition model
  • the connection layer outputs a second feature vector; concatenates the first feature vector and the second feature vector to obtain a fusion feature vector; the fusion feature vector passes through the third full connection in the emotion recognition model
  • the layer and the normalization layer output the first probability vector of the emotion of the input speech; obtain the emotion state of the input speech according to the first probability vector.
  • the sentiment analysis model includes a text-based sentiment classification model.
  • FIG. 4 is a schematic structural diagram of the sentiment classification model in this application.
  • the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified;
  • the text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors.
  • an open source BERT model Bidirectional Encoder Representations from Transformers
  • the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer.
  • the input layer includes 256 input nodes for input sentence vectors
  • the hidden layer includes 128 hidden nodes
  • the output layer uses a softmax function to output emotion labels and probabilities.
  • the step of outputting the emotional state of the input voice through the emotion analysis model includes: converting the input voice into text to be classified through voice recognition; extracting text feature vectors of the text to be classified; Input the text feature vector into the deep neural network classifier in the emotion classification model; obtain the second probability vector of the emotion of the input voice through the classifier; obtain the input according to the second probability vector The emotional state of speech.
  • the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both.
  • the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.
  • the step of outputting the emotional state of the input voice through the emotion analysis model includes: obtaining a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and respectively according to the first probability vector Acquiring the first confidence levels of multiple speech emotions; acquiring the second probability vector of the emotion of the input speech through a text-based emotion classification model, and acquiring the second confidence levels of the multiple speech emotions according to the second probability vector; Add the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions; select among the confidence level vectors The speech emotion corresponding to the maximum confidence is used as the emotion state of the input speech.
  • the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc.
  • the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7
  • the two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.
  • two results are obtained through a speech-based emotion recognition model and a text-based emotion classification model, which respectively represent the confidence of various emotional states of the input speech, and are obtained by the two models
  • the result of setting different weight values, based on different weights add the obtained confidence to predict the final voice emotional state. For example, set a weight value of 0.6 for the emotion recognition model and a weight value of 0.4 for the emotion classification model.
  • the possible speech emotion states are happy, sad, impatient, excited, excited, etc.
  • the first confidence level corresponding to the above emotional state is 0.6, 0.2, 0.3, 0.4, 0.7
  • the second confidence level is 0.8, 0.3, 0.2, 0.5, 0.5, respectively
  • the corresponding first and second confidence levels are The degree is based on the different weight values set to be added to obtain the final confidence degrees of various speech emotions respectively 0.68, 0.24, 0.26, 0.44, 0.62, and the emotional state (happy) corresponding to the maximum confidence degree of 0.68 is selected as the emotional state of the input speech .
  • determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech includes:
  • the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene, wherein the dialogue scene and the corresponding emotional state can be manually labeled, and labeling is performed according to the specific scenario combined with human cognition, Predefine the emotional state required for synthesized speech in certain specific dialogue scenarios, because even if it is a response to the input voice of the same emotion, the emotional state of the synthesized voice that needs to be fed back in different dialogue scenarios may be different;
  • the emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
  • the emotion expressed by the voice input by the customer is impatient.
  • the emotional state of the synthesized voice should be happy and positive, so as to serve customers well.
  • the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.
  • performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized includes: embedding modal auxiliary words through the waveform splicing technology in the synthesized text; controlling the synthesized speech through the end-to-end synthesis technology Tone and prosody; according to the embedded modal auxiliary words, tone and prosody for speech synthesis, so that the synthesized speech can express the corresponding tone and emotion.
  • the context-oriented and emotion-oriented Chinese speech synthesis method described in this application is applied to an electronic device, and the electronic device may be a terminal device such as a TV, a smart phone, a tablet computer, and a computer.
  • the electronic device includes: a processor; a memory for storing a context-oriented and emotion-oriented Chinese speech synthesis program, and the processor executes the context-oriented and emotion-oriented Chinese speech synthesis program to realize the following context- and emotion-oriented Chinese speech synthesis
  • the steps of the method obtain the input speech; input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model; determine the emotion state of the synthesized speech according to the dialogue scene and the emotion state of the input speech ; Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • the electronic device also includes a network interface, a communication bus, and the like.
  • the network interface may include a standard wired interface and a wireless interface
  • the communication bus is used to realize the connection and communication between various components.
  • the memory includes at least one type of readable storage medium, which can be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited to this, and can be stored in a non-transitory manner Any device that provides instructions or software and any associated data files to the processor so that the processor can execute the instructions or software program.
  • the software program stored in the memory includes a context-oriented and emotional Chinese speech synthesis program, and the Chinese speech synthesis program can be provided to the processor, so that the processor can execute the Chinese speech synthesis program to realize the Chinese speech synthesis method. step.
  • the processor can be a central processing unit, a microprocessor, or other data processing chips, etc., and can run stored programs in the memory, for example, the context-oriented and emotional Chinese speech synthesis program in this application.
  • the electronic device may also include a display, and the display may also be called a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, etc.
  • the display is used to display the information processed in the electronic device and to display the visual work interface.
  • the electronic device may further include a user interface, and the user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
  • a user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
  • the context and emotion-oriented Chinese speech synthesis program can also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
  • Figure 5 is a schematic diagram of modules of the context- and emotion-oriented Chinese speech synthesis program in this application. As shown in Fig. 5, the Chinese speech synthesis program can be divided into: an acquisition module 1, an emotion analysis module 2, an emotion determination module 3, and Speech synthesis module 4.
  • the functions or operation steps implemented by the above modules are similar to the above, for example:
  • Emotion analysis module 2 inputting the input speech into an emotion analysis model, and outputting the emotional state of the input speech through the emotion analysis model;
  • the emotion determination module 3 determines the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech
  • the speech synthesis module 4 performs speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • the electronic device further includes a judgment module, before the step of inputting the input voice into the emotion analysis model, judge whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, set the The emotional state of the synthesized speech or the default emotional state of the synthesized speech is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform the emotional state on the input voice analysis.
  • the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction.
  • the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ⁇ °C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.
  • the emotion analysis model includes a voice-based emotion recognition model.
  • Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers.
  • CRNN convolutional Recurrent Neural Network
  • the input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector;
  • the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.
  • Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application.
  • the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer.
  • Pooling layer long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module.
  • Each pooling module is connected to the LSTM layer
  • Each neuron is connected.
  • the convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.
  • the sentiment analysis module includes: a parameter acquisition unit that acquires a spectrogram of the input voice and voice feature parameters; a first feature vector acquisition unit that inputs the spectrogram of the input voice
  • the first feature vector is output through the convolutional recurrent neural network and the first fully connected layer
  • the second feature vector acquisition unit is based on the
  • the voice feature parameters are used to obtain HSF features (feature statistics are performed on multiple frames of speech in a segment of speech, and the average or maximum value of the feature parameters, etc.) are input into the emotion recognition model, and three of the emotion recognition models
  • the second fully connected layer outputs the second feature vector;
  • the feature fusion unit concatenates the first feature vector and the second feature vector to obtain a fusion feature vector;
  • the first probability vector acquisition unit the fusion
  • the feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model; the emotion state output unit
  • the sentiment analysis model includes a text-based sentiment classification model.
  • FIG. 4 is a schematic structural diagram of the sentiment classification model in this application.
  • the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified;
  • the text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors.
  • an open source BERT model Bidirectional Encoder Representations from Transformers
  • the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer.
  • the input layer includes 256 input nodes for input sentence vectors
  • the hidden layer includes 128 hidden nodes
  • the output layer uses a softmax function to output emotion labels and probabilities.
  • the sentiment analysis module includes: a text conversion unit, which converts the input speech into text to be classified through voice recognition; a feature extraction unit, which extracts the text feature vector of the text to be classified; The text feature vector is input to a deep neural network classifier in the emotion classification model; a second probability vector acquisition unit, which acquires the second probability vector of the emotion of the input speech through the classifier; an emotion state output unit, Acquire the emotional state of the input speech according to the second probability vector.
  • the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both.
  • the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.
  • the emotion analysis module includes: a first confidence degree obtaining unit, which obtains a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and obtains multiple voice emotions respectively according to the first probability vector The first confidence degree; the second confidence degree obtaining unit obtains the second probability vector of the emotion of the input speech through the text-based emotion classification model, and obtains the second confidence degree of multiple speech emotions according to the second probability vector; The confidence vector obtaining unit adds the first confidence and the second confidence of the same speech emotion to obtain the confidence of the same speech emotion to obtain the confidence vectors of multiple speech emotions; the selection unit selects all The voice emotion corresponding to the maximum confidence in the confidence vector is used as the emotional state of the input speech.
  • a first confidence degree obtaining unit which obtains a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and obtains multiple voice emotions respectively according to the first probability vector The first confidence degree
  • the second confidence degree obtaining unit obtains the second probability vector of the emotion of the input speech through the text-based
  • the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc.
  • the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7
  • the two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.
  • the emotion determination module includes:
  • the construction unit is to construct a scene library, the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene. Among them, the dialogue scene and the corresponding emotional state can be manually annotated, based on the specific situation combined with human cognition. Labeling, predefine the emotional state required for synthesized speech in some specific dialogue scenarios, because even for the same emotional input voice response, in different dialogue scenarios, the emotional state of the synthesized speech needs to be fed back It may be different;
  • the context analysis unit performs context analysis according to the input speech and the text to be synthesized, and obtains the dialogue scene of the text to be synthesized;
  • a query unit to obtain the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library
  • the emotional state determining unit determines the emotional state of the synthesized speech according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
  • the emotion expressed by the voice input by the customer is impatient.
  • the emotional state of the synthesized voice should be happy and positive, so as to serve customers well.
  • the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.
  • the speech synthesis module includes: a modal particle embedding unit, which embeds modal auxiliary words in the synthesized text through a waveform splicing technology; a prosody control unit, which controls the tone and prosody of the synthesized speech through an end-to-end synthesis technology; The speech synthesis unit performs speech synthesis according to the embedded modal auxiliary words, tone and prosody, so that the synthesized speech can express the corresponding tone and emotion.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program or instruction.
  • the program can be executed, and the stored program instructs related hardware to realize the corresponding function.
  • the computer-readable storage medium may be a computer disk, a hard disk, a random access memory, a read-only memory, etc.
  • the application is not limited to this, and it can be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to enable the processor to execute the programs or instructions therein.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes a context-oriented and emotion-oriented Chinese speech synthesis program, and the context-oriented and emotion-oriented Chinese speech synthesis program When executed by the processor, the following Chinese speech synthesis method is realized:

Abstract

The present application belongs to the technical field of artificial intelligence, and discloses a situation- and emotion-oriented Chinese speech synthesis method, device, and storage medium, said method comprising: obtaining input speech; inputting said input speech into a sentiment analysis model, and outputting the emotional state of the input speech by means of said sentiment analysis model; determining the emotional state of synthesized speech according to a dialogue situation and the emotional state of the input speech; performing speech synthesis according to the emotional state of the synthesized speech and text to be synthesized which has been determined on the basis of the input speech. In the present application, the emotional state of the input speech is analyzed, and the emotional state of the synthesized speech is obtained according to the emotional state of the input speech; during speech synthesis, the emotional state and situation analysis is added to cause the intonation and emotion of the synthesized speech to match a current interaction situation rather than being of fixed intonation and emotion; during the process of human-computer interaction, the outputted synthesized speech is more similar to the speech of a real person, thus enhancing user experience.

Description

面向情景及情感的中文语音合成方法、装置及存储介质Situation and emotion-oriented Chinese speech synthesis method, device and storage medium
本申请基于巴黎公约申明享有2019年6月19日递交的申请号为CN201910531628.7、名称为“面向情景及情感的中文语音合成方法、装置及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application is based on the Paris Convention declaration to enjoy the priority of the Chinese patent application filed on June 19, 2019 with the application number CN201910531628.7 and the title "Scenario and Emotion-Oriented Chinese Speech Synthesis Method, Apparatus, and Storage Medium". The entire content of the patent application is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种面向情景及情感的中文语音合成方法、装置及存储介质。This application relates to the field of artificial intelligence technology, and in particular to a Chinese speech synthesis method, device and storage medium oriented to situations and emotions.
背景技术Background technique
随着计算机技术的快速发展,人们对语音合成系统的要求也越来越高,从最初的“能听明白”到如今的“希望和真人一样”。现有语音合成系统的技术方案主要有三种:参数合成技术、波形拼接技术以及基于深度学习的端到端型合成技术。其中,通过波形拼接合成的语音拥有非常高的音质,但是制作其所需的语音库是非常耗时耗力的,通常需要30小时以上的录音以及进行相关的切割标注工作。现有的端到端型语音合成技术也能够合成高音质以及拥有极为良好韵律的语音,且其所需的训练语音库通常只需要15个小时左右。与波形拼接技术相比,其合成速度稍微缓慢,并且其实现需要GPU,所以成本较为高昂。尽管现有的语音合成系统合成的语音在音质上良好,但是和真人相比,还是有所差距。而造成这种差距的最主要原因是同一个语音系统总是以同一种语气和同一种情感合成语音,但是人类在说话的时候,语气以及情绪是在不断变换的,它们是和说话的场景以及说话的内容息息相关的,当合成语音的语气和情绪不符合当前场景时,就算合成的语音音质很好,我们依然会觉得很假,因为这和我们的认知不符。例如,智能音箱现在已经广泛的存在于市场中,而语音合成系统则能够使智能音箱与人类进行交流,假设一位女生和智能音箱发生了如下对话:With the rapid development of computer technology, people's requirements for speech synthesis systems have become higher and higher, from the initial "can understand" to today's "hope to be like a real person." There are three main technical solutions for existing speech synthesis systems: parameter synthesis technology, waveform splicing technology, and end-to-end synthesis technology based on deep learning. Among them, the speech synthesized by waveform splicing has very high sound quality, but the production of the required speech library is very time-consuming and labor-intensive, and usually requires more than 30 hours of recording and related cutting and labeling work. The existing end-to-end speech synthesis technology can also synthesize speech with high sound quality and extremely good prosody, and the required training speech library usually only takes about 15 hours. Compared with the waveform splicing technology, its synthesis speed is slightly slower, and its implementation requires a GPU, so the cost is relatively high. Although the speech synthesized by the existing speech synthesis system is good in sound quality, there is still a gap compared with real people. The main reason for this gap is that the same speech system always synthesizes speech with the same tone and emotion, but when humans speak, the tone and emotions are constantly changing. They are related to the speaking scene and The content of the speech is closely related. When the tone and emotion of the synthesized speech do not match the current scene, even if the synthesized speech has good sound quality, we will still feel it is fake, because it does not match our cognition. For example, smart speakers are now widely available in the market, and speech synthesis systems can enable smart speakers to communicate with humans. Suppose a girl has the following conversation with smart speakers:
女生:今天我要穿这件衣服,你觉得好看吗?(激动开心的语气)Girl: I am going to wear this dress today, do you think it looks good? (Excited and happy tone)
智能音箱:我觉得非常好看。(非常平淡的固定语气)Smart speaker: I think it looks very good. (Very plain and fixed tone)
上述这样的对话现在常常发生于人类与机器的智能交互当中,发明人意识到当人类以某种情绪发出对话时,语音合成系统却以其固定的语气和情绪合成语音进行反馈,这样的体验会让人类觉得合成的语音不像真人,使得人机交互难以很好地继续进行,也会影响机器使用的用户体验。The above-mentioned dialogues now often occur in the intelligent interaction between humans and machines. The inventor realized that when humans send out a dialogue with a certain emotion, the speech synthesis system uses its fixed tone and emotion to synthesize speech for feedback. This experience will It makes humans feel that the synthesized speech does not look like a real person, making it difficult for human-computer interaction to continue well, and it will also affect the user experience of the machine.
技术问题technical problem
本申请提供一种面向情景及情感的中文语音合成方法、装置及存储介质,以解决现有技术中总是以固定的语气和情绪合成语音导致人机交互难以很好地继续进行的问题。This application provides a context- and emotion-oriented Chinese speech synthesis method, device, and storage medium to solve the problem in the prior art that the human-computer interaction is difficult to continue due to the fixed tone and emotion synthesis of speech in the prior art.
技术解决方案Technical solutions
为了实现上述目的,本申请的一个方面是提供一种面向情景及情感的中文语音合成方法,包括:获取输入语音;将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。In order to achieve the above objective, one aspect of this application is to provide a context- and emotion-oriented Chinese speech synthesis method, including: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the input through the emotion analysis model The emotional state of the speech; the emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; the speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
为了实现上述目的,本申请的另一个方面是提供一种电子装置,该电子装置包括:处理器;存储器,所述存储器中包括面向情景及情感的中文语音合成程序,所述中文语音合成程序被所述处理器执行时,实现如下所述的中文语音合成方法的步骤:获取输入语音;将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。In order to achieve the above object, another aspect of the present application is to provide an electronic device, the electronic device includes: a processor; a memory, the memory includes a context-oriented and emotional Chinese speech synthesis program, the Chinese speech synthesis program is When the processor is executed, the steps of the Chinese speech synthesis method as described below are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; The scene and the emotional state of the input voice determine the emotional state of the synthesized voice; perform speech synthesis according to the emotional state of the synthesized voice and the text to be synthesized determined based on the input voice.
为了实现上述目的,本申请的另一方面提供一种计算机设备,其中,该计算机设备包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的面向情景及情感的中文语音合成程序,所述中文语音合成程序被处理器执行时,实现如下步骤:获取输入语音;将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。In order to achieve the above-mentioned object, another aspect of the present application provides a computer device, wherein the computer device includes: a memory, a processor, and a context-oriented and emotionally oriented device stored in the memory and capable of running on the processor. A Chinese speech synthesis program, when the Chinese speech synthesis program is executed by a processor, the following steps are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; The emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
为了实现上述目的,本申请的再一个方面是提供一种计算机可读存储介质,所述计算机可读存储介质中包括面向情景及情感的中文语音合成程序,所述中文语音合成程序被处理器执行时,实现如下所述的中文语音合成方法的步骤:获取输入语音;将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。In order to achieve the above objective, another aspect of the present application is to provide a computer-readable storage medium, the computer-readable storage medium includes a context-oriented and emotional Chinese speech synthesis program, the Chinese speech synthesis program is executed by a processor At the time, the steps of the Chinese speech synthesis method as described below are realized: acquiring the input speech; inputting the input speech into the emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; according to the dialogue scene and the input The emotional state of the speech determines the emotional state of the synthesized speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
有益效果Beneficial effect
相对于现有技术,本申请具有以下优点和有益效果:Compared with the prior art, this application has the following advantages and beneficial effects:
本申请通过对输入语音进行情感状态分析,根据输入语音的情感状态获取合成语音的情感状态,在进行语音合成时,加入情感状态以及情景分析,使得合成语音的语气和情绪符合当前的交互场景,而不再是固定的语气和情绪,在人机交互过程中,输出的合成语音更像真人,增强用户体验。This application analyzes the emotional state of the input speech and obtains the emotional state of the synthesized speech according to the emotional state of the input speech. When performing speech synthesis, the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene. Instead of a fixed tone and emotion, in the process of human-computer interaction, the output synthesized voice is more like a real person, enhancing the user experience.
附图说明Description of the drawings
图1为本申请所述面向情景及情感的中文语音合成方法的流程示意图;Fig. 1 is a schematic flow chart of the context-oriented and emotion-oriented Chinese speech synthesis method described in this application;
图2为本申请中情感识别模型的结构示意图;Figure 2 is a schematic diagram of the structure of the emotion recognition model in this application;
图3为本申请中卷积循环神经网络的结构示意图;Figure 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application;
图4为本申请中情感分类模型的结构示意图;Figure 4 is a schematic structural diagram of the sentiment classification model in this application;
图5为本申请中面向情景及情感的中文语音合成程序的模块示意图。FIG. 5 is a schematic diagram of modules of the Chinese speech synthesis program oriented to the situation and emotion in this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的实施方式Embodiments of the invention
下面将参考附图来描述本申请所述的实施例。本领域的普通技术人员可以认识到,在不偏离本申请的精神和范围的情况下,可以用各种不同的方式或其组合对所描述的实施例进行修正。因此,附图和描述在本质上是说明性的,仅仅用以解释本申请,而不是用于限制权利要求的保护范围。此外,在本说明书中,附图未按比例画出,并且相同的附图标记表示相同的部分。The embodiments described in this application will be described below with reference to the drawings. A person of ordinary skill in the art may realize that the described embodiments can be modified in various different ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and description are illustrative in nature, and are only used to explain the application, rather than to limit the protection scope of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.
图1为本申请所述面向情景及情感的中文语音合成方法的流程示意图,如图1所示,本申请所述面向情景及情感的中文语音合成方法,包括以下步骤:Figure 1 is a schematic flow chart of the context- and emotion-oriented Chinese speech synthesis method described in this application. As shown in Figure 1, the context- and emotion-oriented Chinese speech synthesis method described in this application includes the following steps:
步骤S1、获取输入语音,输入语音为待反馈的语音,例如,在人机交互系统中,对于智能音箱,输入语音就是用户的询问等,合成语音为智能音箱对用户的反馈,本申请即为根据用户的询问语音的情感状态,得出合成语音的情感状态,使得智能音箱的反馈具有特定的语气和情绪,符合用户输入语音的情绪;Step S1: Obtain the input voice, the input voice is the voice to be fed back. For example, in the human-computer interaction system, for the smart speaker, the input voice is the user's inquiry, etc., and the synthesized voice is the feedback of the smart speaker to the user. This application is According to the emotional state of the user's inquiry voice, the emotional state of the synthesized voice is obtained, so that the feedback of the smart speaker has a specific tone and emotion, which is consistent with the emotion of the user's input voice;
步骤S2、将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Step S2: Input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model;
步骤S3、根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态,在确定合成语音的情感状态时,增加对话场景的影响因素,使得人机交互得到的反馈不仅满足对用户情感上的反馈,且更加符合实际应用场景,避免出错,例如,对于一个推销场景,即使客户输入的语音表达的情感是不耐烦的,加入对话场景这一影响因素之后,得出合成语音的情感状态也应该是开心积极的,以良好的服务客户;Step S3: Determine the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech. When determining the emotional state of the synthesized speech, increase the influencing factors of the dialogue scene, so that the feedback obtained by human-computer interaction is not only satisfied with the user's emotion The feedback is more in line with actual application scenarios and avoid errors. For example, for a sales scene, even if the emotion expressed by the voice input by the customer is impatient, after adding the influence factor of the dialogue scene, the emotional state of the synthesized speech is also Should be happy and active, serve customers well;
步骤S4、根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成,其中,待合成文本是在进行人机交互时,智能系统根据输入语音确定的待反馈文本。Step S4: Perform speech synthesis according to the emotional state of the synthesized speech and the to-be-synthesized text determined based on the input speech, where the to-be-synthesized text is the text to be fed back determined by the intelligent system according to the input speech during human-computer interaction.
本申请通过对输入的语音进行情感状态分析,获取合成语音的情感状态,在进行语音合成时,加入情感状态以及情景分析,使得合成语音的语气和情绪符合当前交互场景,在人机交互过程中,输出的合成语音更像真人。This application obtains the emotional state of the synthesized speech by analyzing the emotional state of the input speech. When performing speech synthesis, the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene. In the process of human-computer interaction , The output synthesized speech is more like a real person.
优选地,将所述输入语音输入情感分析模型的步骤之前,所述中文语音合成方法还包括:根据输入语音和待合成文本判断是否存在交互场景,若不存在交互场景,则设定所述合成语音的情感状态或采用合成语音的默认情感状态,不再对输入语音进行情感分析;若存在交互场景,则进行下一步,将所述输入语音输入情感分析模型中,对输入语音进行情感状态分析。其中,设定的合成语音的情感状态可以是慷慨激昂的,也可以是平缓温和的,具体可以根据人机交互所起的作用或目的而设定。例如,对于智能问答系统,反馈的合成语音的默认情感状态为平缓温和的,若输入语音仅涉及到对某一问题的咨询,而不涉及交互场景,则根据输入语音确定待合成文本的内容即可,以平缓温和的情绪输出待合成文本即可满足用户需求。例如,用户询问“今天的北京气温是多少”,问答系统只需以默认的语气情绪回复“今天的北京气温为××摄氏度”即可,而不必对输入语音进行情感分析。Preferably, before the step of inputting the input speech into the emotion analysis model, the Chinese speech synthesis method further includes: judging whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, setting the synthesis The emotional state of the voice or the default emotional state of the synthesized voice is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform emotional state analysis on the input voice . Among them, the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction. For example, for an intelligent question answering system, the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ××°C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.
本申请的一个实施例中,所述情感分析模型包括基于语音的情感识别模型,图2为本申请中情感识别模型的结构示意图,如图2所示,所述情感识别模型分为三个部分,第一部分包括卷积循环神经网络(Convolutional Recurrent Neural Network,CRNN)部分和一层第一全连接层(fully connected layers,FC层),输入的是声谱图,通过CRNN层和第一FC层输出一个第一特征向量;第二部分包括三个依次连接的第二FC层,第二部分的输入为语音特征参数(LLDs),包括:基频,能量,过零率,梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),线性预测倒谱系数(Linear Predictive Cepstral Coefficient,LPCC)等特征,输出一个第二特征向量;第三部分包括一个第三FC层和归一化层,第三部分的输入为第一特征向量和第二特征向量的融合特征向量,输出为表征情感分类的概率向量。In an embodiment of the present application, the emotion analysis model includes a voice-based emotion recognition model. Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers. The input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector; the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.
图3为本申请中卷积循环神经网络的结构示意图,如图3所示,所述卷积循环神经网络包括:第一卷积层、第一池化层、第二卷积层、第二池化层、长短期记忆网络层(Long Short-Term Memory,LSTM)和第三池化层,其中,第三池化层包括三个池化模块,最小池化模块、平均池化模块和最大池化模块,每个池化模块均与LSTM层中的每个神经元连接。所述卷积循环神经网络采用情感语音数据集进行训练,所述情感语音数据集中包括多人共约15个小时的语音数据以及相应的情感标签。Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application. As shown in Fig. 3, the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer. Pooling layer, long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module. Each pooling module is connected to the LSTM layer Each neuron is connected. The convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.
本申请的一个实施例中,通过所述情感分析模型输出所述输入语音的情感状态的步骤,包括:获取所述输入语音的声谱图和语音特征参数;将所述输入语音的声谱图输入所述情感识别模型中的经过训练得到的卷积循环神经网络(CRNN)中,通过所述卷积循环神经网络和第一全连接层输出第一特征向量;根据所述语音特征参数获取统计特征(High level Statistics Functions,HFS特征,对一段语音中的多帧语音进行特征统计,获取特征参数的平均值或最大值等),输入所述情感识别模型中,通过所述情感识别模型中的三个第二全连接层输出第二特征向量;将所述第一特征向量与所述第二特征向量进行融合(concatenate),得到融合特征向量;所述融合特征向量通过所述情感识别模型中的第三全连接层和归一化层输出所述输入语音的情感的第一概率向量;根据所述第一概率向量获取所述输入语音的情感状态。In an embodiment of the present application, the step of outputting the emotional state of the input voice through the emotional analysis model includes: obtaining a spectrogram and voice feature parameters of the input voice; and converting the spectrogram of the input voice Input the trained convolutional recurrent neural network (CRNN) in the emotion recognition model, output a first feature vector through the convolutional recurrent neural network and the first fully connected layer; obtain statistics according to the speech feature parameters Features (High level Statistics Functions, HFS features, perform feature statistics on multiple frames of speech in a segment of speech, and obtain the average or maximum value of feature parameters, etc.), input into the emotion recognition model, and pass the three second totals in the emotion recognition model The connection layer outputs a second feature vector; concatenates the first feature vector and the second feature vector to obtain a fusion feature vector; the fusion feature vector passes through the third full connection in the emotion recognition model The layer and the normalization layer output the first probability vector of the emotion of the input speech; obtain the emotion state of the input speech according to the first probability vector.
本申请的一个实施例中,所述情感分析模型包括基于文本的情感分类模型,图4为本申请中情感分类模型的结构示意图,如图4所示,所述情感分类模型包括:特征提取层和分类器,其中,所述特征提取层用于提取输入文本的特征,将输入文本进行向量化表示,所述特征提取层包括输入层,用于输入待分类文本;嵌入层,用于将所述待分类文本转化为多个词向量,并根据多个词向量构建形成句向量,例如,可以采用开源BERT模型(Bidirectional Encoder Representations from Transformers);所述分类器是LSTM神经网络构成,包括输入层、隐藏层和输出层,所述输入层包括256个输入节点,用于输入句向量,所述隐藏层包括128个隐藏节点,所述输出层采用softmax函数,输出情感标签以及概率。In an embodiment of the present application, the sentiment analysis model includes a text-based sentiment classification model. FIG. 4 is a schematic structural diagram of the sentiment classification model in this application. As shown in FIG. 4, the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified; The text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors. For example, an open source BERT model (Bidirectional Encoder Representations from Transformers) can be used; the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer. The input layer includes 256 input nodes for input sentence vectors, the hidden layer includes 128 hidden nodes, and the output layer uses a softmax function to output emotion labels and probabilities.
本申请的一个实施例中,通过所述情感分析模型输出所述输入语音的情感状态的步骤,包括:通过语音识别将输入语音转化为待分类文本;提取所述待分类文本的文本特征向量;将所述文本特征向量输入所述情感分类模型中的深度神经网络分类器中;通过所述分类器获取所述输入语音的情感的第二概率向量;根据所述第二概率向量获取所述输入语音的情感状态。In an embodiment of the present application, the step of outputting the emotional state of the input voice through the emotion analysis model includes: converting the input voice into text to be classified through voice recognition; extracting text feature vectors of the text to be classified; Input the text feature vector into the deep neural network classifier in the emotion classification model; obtain the second probability vector of the emotion of the input voice through the classifier; obtain the input according to the second probability vector The emotional state of speech.
本申请中,所述情感分析模型可以仅包括基于语音的情感识别模型和基于文本的情感分类模型中的一种,也可以两种均包括。优选地,所述情感分析模型既包括基于语音的情感分析模型,又包括基于文本的情感分类模型,通过两个模型分别得到表征语音情感的概率向量,根据两个模型的结果进行综合分析,提高情感分析的准确性。In this application, the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both. Preferably, the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.
优选地,通过所述情感分析模型输出所述输入语音的情感状态的步骤,包括:通过基于语音的情感识别模型获取所述输入语音的情感的第一概率向量,根据所述第一概率向量分别获取多种语音情感的第一置信度;通过基于文本的情感分类模型获取所述输入语音的情感的第二概率向量,根据所述第二概率向量分别获取多种语音情感的第二置信度;将同一种语音情感的所述第一置信度与所述第二置信度相加,获取所述同一种语音情感的置信度,得到多种语音情感的置信度向量;选择所述置信度向量中最大置信度所对应的语音情感作为所述输入语音的情感状态。例如,对于一段输入语音,可能具有的语音情感状态为开心、难过、不耐烦、兴奋、激动等,得到与上述情感状态对应的第一置信度分别为0.6、0.2、0.3、0.4、0.7,第二置信度分别为0.8、0.3、0.2、0.5、0.5,将对应的第一置信度和第二置信度分别相加得到各种语音情感的最终置信度分别为1.4、0.5、0.5、0.9、1.2,选择最大置信度1.4对应的情感状态(开心)作为输入语音的情感状态。Preferably, the step of outputting the emotional state of the input voice through the emotion analysis model includes: obtaining a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and respectively according to the first probability vector Acquiring the first confidence levels of multiple speech emotions; acquiring the second probability vector of the emotion of the input speech through a text-based emotion classification model, and acquiring the second confidence levels of the multiple speech emotions according to the second probability vector; Add the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions; select among the confidence level vectors The speech emotion corresponding to the maximum confidence is used as the emotion state of the input speech. For example, for a piece of input speech, the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc., and the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7, The two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.
本申请的一个实施例中,对于同一输入语音,通过基于语音的情感识别模型和基于文本的情感分类模型得到两种结果,分别表征输入语音的各种情感状态的置信度,为两种模型得到的结果设定不同的权重值,基于不同的权重,将得到的置信度相加,预测最终的语音情感状态。例如,为情感识别模型设定权重值为0.6,为情感分类模型设定权重值为0.4,则若对于一段输入语音,可能具有的语音情感状态为开心、难过、不耐烦、兴奋、激动等,得到与上述情感状态对应的第一置信度分别为0.6、0.2、0.3、0.4、0.7,第二置信度分别为0.8、0.3、0.2、0.5、0.5,将对应的第一置信度和第二置信度基于设定的不同权重值分别相加得到各种语音情感的最终置信度分别为0.68、0.24、0.26、0.44、0.62,选择最大置信度0.68对应的情感状态(开心)作为输入语音的情感状态。In an embodiment of the present application, for the same input speech, two results are obtained through a speech-based emotion recognition model and a text-based emotion classification model, which respectively represent the confidence of various emotional states of the input speech, and are obtained by the two models The result of setting different weight values, based on different weights, add the obtained confidence to predict the final voice emotional state. For example, set a weight value of 0.6 for the emotion recognition model and a weight value of 0.4 for the emotion classification model. For a piece of input speech, the possible speech emotion states are happy, sad, impatient, excited, excited, etc. The first confidence level corresponding to the above emotional state is 0.6, 0.2, 0.3, 0.4, 0.7, and the second confidence level is 0.8, 0.3, 0.2, 0.5, 0.5, respectively, and the corresponding first and second confidence levels are The degree is based on the different weight values set to be added to obtain the final confidence degrees of various speech emotions respectively 0.68, 0.24, 0.26, 0.44, 0.62, and the emotional state (happy) corresponding to the maximum confidence degree of 0.68 is selected as the emotional state of the input speech .
进一步地,置信度向量中最大置信度对应的语音情感有两种或两种以上时,从中随机选择其中的一种作为输入语音的情感状态。Further, when there are two or more speech emotions corresponding to the maximum confidence in the confidence vector, one of them is randomly selected as the emotional state of the input speech.
需要理解的是,本申请中的“第一”、“第二”“第三”和“第四”等仅用于区分相同或类似的对象,并不表示先后次序或优选顺序等含义。It should be understood that the "first", "second", "third", and "fourth" in this application are only used to distinguish the same or similar objects, and do not represent the meanings such as sequence or preferred order.
本申请的一个可选实施例中,根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态,包括:In an optional embodiment of the present application, determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech includes:
构建场景库,所述场景库包括多种对话场景以及每种对话场景所对应的情感状态,其中,对话场景以及对应的情感状态可以是人工标注,根据具体的情景结合人类认知进行标签标注,预先定义某些特定的对话场景下,合成语音所需要的情感状态,因为即使是对于同一种情感的输入语音的回复,在不同的对话场景下,所需要反馈的合成语音的情感状态也可能是不同的;Construct a scene library, the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene, wherein the dialogue scene and the corresponding emotional state can be manually labeled, and labeling is performed according to the specific scenario combined with human cognition, Predefine the emotional state required for synthesized speech in certain specific dialogue scenarios, because even if it is a response to the input voice of the same emotion, the emotional state of the synthesized voice that needs to be fed back in different dialogue scenarios may be different;
根据所述输入语音和待合成文本进行情景分析,获取所述待合成文本的对话场景,其中,待合成文本是在进行人机交互时,智能系统根据输入语音确定的待反馈文本;Performing context analysis according to the input speech and the text to be synthesized to obtain a dialogue scene of the text to be synthesized, where the text to be synthesized is the text to be fed back determined by the intelligent system according to the input voice during human-computer interaction;
根据所述场景库获取所述待合成文本的对话场景对应的情感状态;Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;
根据所述对话场景对应的情感状态以及所述输入语音的情感状态确定合成语音的情感状态。The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
例如,对于一个推销场景,客户输入的语音表达的情感是不耐烦的,通过对待合成文本进行情景分析,结合对话场景,得出合成语音的情感状态应该是开心积极的,以良好的服务客户。通过增加对对话场景的分析,使得人机交互时得到的反馈更加符合真实的应用场景,增强用户体验。For example, for a sales promotion scenario, the emotion expressed by the voice input by the customer is impatient. By analyzing the synthesized text and combining the dialogue scenario, it is concluded that the emotional state of the synthesized voice should be happy and positive, so as to serve customers well. By increasing the analysis of the dialogue scene, the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.
本申请的一个可选实施例中,根据所述合成语音的情感状态以及待合成文本进行语音合成,包括:通过波形拼接技术对待合成文本进行语气助词嵌入;通过端到端合成技术控制合成语音的语气和韵律;根据嵌入的语气助词、语气和韵律进行语音合成,使得合成的语音能够表达出相应的语气和情绪。In an optional embodiment of the present application, performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized includes: embedding modal auxiliary words through the waveform splicing technology in the synthesized text; controlling the synthesized speech through the end-to-end synthesis technology Tone and prosody; according to the embedded modal auxiliary words, tone and prosody for speech synthesis, so that the synthesized speech can express the corresponding tone and emotion.
本申请所述面向情景及情感的中文语音合成方法应用于电子装置,所述电子装置可以是电视机、智能手机、平板电脑、计算机等终端设备。The context-oriented and emotion-oriented Chinese speech synthesis method described in this application is applied to an electronic device, and the electronic device may be a terminal device such as a TV, a smart phone, a tablet computer, and a computer.
所述电子装置包括:处理器;存储器,用于存储面向情景及情感的中文语音合成程序,处理器执行所述面向情景及情感的中文语音合成程序,实现以下的面向情景及情感的中文语音合成方法的步骤:获取输入语音;将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。The electronic device includes: a processor; a memory for storing a context-oriented and emotion-oriented Chinese speech synthesis program, and the processor executes the context-oriented and emotion-oriented Chinese speech synthesis program to realize the following context- and emotion-oriented Chinese speech synthesis The steps of the method: obtain the input speech; input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model; determine the emotion state of the synthesized speech according to the dialogue scene and the emotion state of the input speech ; Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
所述电子装置还包括网络接口和通信总线等。其中,网络接口可以包括标准的有线接口、无线接口,通信总线用于实现各个组件之间的连接通信。The electronic device also includes a network interface, a communication bus, and the like. Among them, the network interface may include a standard wired interface and a wireless interface, and the communication bus is used to realize the connection and communication between various components.
存储器包括至少一种类型的可读存储介质,可以是闪存、硬盘、光盘等非易失性存储介质,也可以是插接式硬盘等,且并不限于此,可以是以非暂时性方式存储指令或软件以及任何相关联的数据文件并向处理器提供指令或软件程序以使该处理器能够执行指令或软件程序的任何装置。本申请中,存储器存储的软件程序包括面向情景及情感的中文语音合成程序,并可以向处理器提供该中文语音合成程序,以使得处理器可以执行该中文语音合成程序,实现中文语音合成方法的步骤。The memory includes at least one type of readable storage medium, which can be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited to this, and can be stored in a non-transitory manner Any device that provides instructions or software and any associated data files to the processor so that the processor can execute the instructions or software program. In this application, the software program stored in the memory includes a context-oriented and emotional Chinese speech synthesis program, and the Chinese speech synthesis program can be provided to the processor, so that the processor can execute the Chinese speech synthesis program to realize the Chinese speech synthesis method. step.
处理器可以是中央处理器、微处理器或其他数据处理芯片等,可以运行存储器中的存储程序,例如,本申请中面向情景及情感的中文语音合成程序。The processor can be a central processing unit, a microprocessor, or other data processing chips, etc., and can run stored programs in the memory, for example, the context-oriented and emotional Chinese speech synthesis program in this application.
所述电子装置还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子装置中处理的信息以及用于显示可视化的工作界面。The electronic device may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, etc. The display is used to display the information processed in the electronic device and to display the visual work interface.
所述电子装置还可以包括用户接口,用户接口可以包括输入单元(比如键盘)、语音输出装置(比如音响、耳机)等。The electronic device may further include a user interface, and the user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
在其他实施例中,面向情景及情感的中文语音合成程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器中,并由处理器执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。图5为本申请中面向情景及情感的中文语音合成程序的模块示意图,如图5所示,所述中文语音合成程序可以被分割为:获取模块1、情感分析模块2、情感确定模块3和语音合成模块4。上述模块所实现的功能或操作步骤均与上文类似,例如其中:In other embodiments, the context and emotion-oriented Chinese speech synthesis program can also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Figure 5 is a schematic diagram of modules of the context- and emotion-oriented Chinese speech synthesis program in this application. As shown in Fig. 5, the Chinese speech synthesis program can be divided into: an acquisition module 1, an emotion analysis module 2, an emotion determination module 3, and Speech synthesis module 4. The functions or operation steps implemented by the above modules are similar to the above, for example:
获取模块1,获取输入语音;Obtaining module 1, to obtain the input voice;
情感分析模块2,将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Emotion analysis module 2, inputting the input speech into an emotion analysis model, and outputting the emotional state of the input speech through the emotion analysis model;
情感确定模块3,根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;The emotion determination module 3 determines the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;
语音合成模块4,根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。The speech synthesis module 4 performs speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
优选地,所述电子装置还包括判断模块,在将所述输入语音输入情感分析模型的步骤之前,根据输入语音和待合成文本判断是否存在交互场景,若不存在交互场景,则设定所述合成语音的情感状态或采用合成语音的默认情感状态,不再对输入语音进行情感分析;若存在交互场景,则进行下一步,将所述输入语音输入情感分析模型中,对输入语音进行情感状态分析。其中,设定的合成语音的情感状态可以是慷慨激昂的,也可以是平缓温和的,具体可以根据人机交互所起的作用或目的而设定。例如,对于智能问答系统,反馈的合成语音的默认情感状态为平缓温和的,若输入语音仅涉及到对某一问题的咨询,而不涉及交互场景,则根据输入语音确定待合成文本的内容即可,以平缓温和的情绪输出待合成文本即可满足用户需求。例如,用户询问“今天的北京气温是多少”,问答系统只需以默认的语气情绪回复“今天的北京气温为××摄氏度”即可,而不必对输入语音进行情感分析。Preferably, the electronic device further includes a judgment module, before the step of inputting the input voice into the emotion analysis model, judge whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, set the The emotional state of the synthesized speech or the default emotional state of the synthesized speech is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform the emotional state on the input voice analysis. Among them, the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction. For example, for an intelligent question answering system, the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ××°C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.
本申请的一个实施例中,所述情感分析模型包括基于语音的情感识别模型,图2为本申请中情感识别模型的结构示意图,如图2所示,所述情感识别模型分为三个部分,第一部分包括卷积循环神经网络(Convolutional Recurrent Neural Network,CRNN)部分和一层第一全连接层(fully connected layers,FC层),输入的是声谱图,通过CRNN层和第一FC层输出一个第一特征向量;第二部分包括三个依次连接的第二FC层,第二部分的输入为语音特征参数(LLDs),包括:基频,能量,过零率,梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),线性预测倒谱系数(Linear Predictive Cepstral Coefficient,LPCC)等特征,输出一个第二特征向量;第三部分包括一个第三FC层和归一化层,第三部分的输入为第一特征向量和第二特征向量的融合特征向量,输出为表征情感分类的概率向量。In an embodiment of the present application, the emotion analysis model includes a voice-based emotion recognition model. Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers. The input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector; the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.
图3为本申请中卷积循环神经网络的结构示意图,如图3所示,所述卷积循环神经网络包括:第一卷积层、第一池化层、第二卷积层、第二池化层、长短期记忆网络层(Long Short-Term Memory,LSTM)和第三池化层,其中,第三池化层包括三个池化模块,最小池化模块、平均池化模块和最大池化模块,每个池化模块均与LSTM层中的每个神经元连接。所述卷积循环神经网络采用情感语音数据集进行训练,所述情感语音数据集中包括多人共约15个小时的语音数据以及相应的情感标签。Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application. As shown in Fig. 3, the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer. Pooling layer, long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module. Each pooling module is connected to the LSTM layer Each neuron is connected. The convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.
本申请的一个实施例中,所述情感分析模块包括:参数获取单元,获取所述输入语音的声谱图和语音特征参数;第一特征向量获取单元,将所述输入语音的声谱图输入所述情感识别模型中的经过训练得到的卷积循环神经网络(CRNN)中,通过所述卷积循环神经网络和第一全连接层输出第一特征向量;第二特征向量获取单元,根据所述语音特征参数获取HSF特征(对一段语音中的多帧语音进行特征统计,获取特征参数的平均值或最大值等),输入所述情感识别模型中,通过所述情感识别模型中的三个第二全连接层输出第二特征向量;特征融合单元,将所述第一特征向量与所述第二特征向量进行融合(concatenate),得到融合特征向量;第一概率向量获取单元,所述融合特征向量通过所述情感识别模型中的第三全连接层和归一化层输出所述输入语音的情感的第一概率向量;情感状态输出单元,根据所述第一概率向量获取所述输入语音的情感状态。In an embodiment of the present application, the sentiment analysis module includes: a parameter acquisition unit that acquires a spectrogram of the input voice and voice feature parameters; a first feature vector acquisition unit that inputs the spectrogram of the input voice In the trained convolutional recurrent neural network (CRNN) in the emotion recognition model, the first feature vector is output through the convolutional recurrent neural network and the first fully connected layer; the second feature vector acquisition unit is based on the The voice feature parameters are used to obtain HSF features (feature statistics are performed on multiple frames of speech in a segment of speech, and the average or maximum value of the feature parameters, etc.) are input into the emotion recognition model, and three of the emotion recognition models The second fully connected layer outputs the second feature vector; the feature fusion unit concatenates the first feature vector and the second feature vector to obtain a fusion feature vector; the first probability vector acquisition unit, the fusion The feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model; the emotion state output unit obtains the input speech according to the first probability vector Emotional state.
本申请的一个实施例中,所述情感分析模型包括基于文本的情感分类模型,图4为本申请中情感分类模型的结构示意图,如图4所示,所述情感分类模型包括:特征提取层和分类器,其中,所述特征提取层用于提取输入文本的特征,将输入文本进行向量化表示,所述特征提取层包括输入层,用于输入待分类文本;嵌入层,用于将所述待分类文本转化为多个词向量,并根据多个词向量构建形成句向量,例如,可以采用开源BERT模型(Bidirectional Encoder Representations from Transformers);所述分类器是LSTM神经网络构成,包括输入层、隐藏层和输出层,所述输入层包括256个输入节点,用于输入句向量,所述隐藏层包括128个隐藏节点,所述输出层采用softmax函数,输出情感标签以及概率。In an embodiment of the present application, the sentiment analysis model includes a text-based sentiment classification model. FIG. 4 is a schematic structural diagram of the sentiment classification model in this application. As shown in FIG. 4, the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified; The text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors. For example, an open source BERT model (Bidirectional Encoder Representations from Transformers) can be used; the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer. The input layer includes 256 input nodes for input sentence vectors, the hidden layer includes 128 hidden nodes, and the output layer uses a softmax function to output emotion labels and probabilities.
本申请的一个实施例中,所述情感分析模块包括:文本转化单元,通过语音识别将输入语音转化为待分类文本;特征提取单元,提取所述待分类文本的文本特征向量;输入单元,将所述文本特征向量输入所述情感分类模型中的深度神经网络分类器中;第二概率向量获取单元,通过所述分类器获取所述输入语音的情感的第二概率向量;情感状态输出单元,根据所述第二概率向量获取所述输入语音的情感状态。In an embodiment of the present application, the sentiment analysis module includes: a text conversion unit, which converts the input speech into text to be classified through voice recognition; a feature extraction unit, which extracts the text feature vector of the text to be classified; The text feature vector is input to a deep neural network classifier in the emotion classification model; a second probability vector acquisition unit, which acquires the second probability vector of the emotion of the input speech through the classifier; an emotion state output unit, Acquire the emotional state of the input speech according to the second probability vector.
本申请中,所述情感分析模型可以仅包括基于语音的情感识别模型和基于文本的情感分类模型中的一种,也可以两种均包括。优选地,所述情感分析模型既包括基于语音的情感分析模型,又包括基于文本的情感分类模型,通过两个模型分别得到表征语音情感的概率向量,根据两个模型的结果进行综合分析,提高情感分析的准确性。In this application, the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both. Preferably, the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.
优选地,所述情感分析模块包括:第一置信度获取单元,通过基于语音的情感识别模型获取所述输入语音的情感的第一概率向量,根据所述第一概率向量分别获取多种语音情感的第一置信度;第二置信度获取单元,通过基于文本的情感分类模型获取输入语音的情感的第二概率向量,根据所述第二概率向量分别获取多种语音情感的第二置信度;置信度向量获取单元,将同一种语音情感的第一置信度与第二置信度相加,获取所述同一种语音情感的置信度,得到多种语音情感的置信度向量;选择单元,选择所述置信度向量中最大置信度所对应的语音情感作为所述输入语音的情感状态。例如,对于一段输入语音,可能具有的语音情感状态为开心、难过、不耐烦、兴奋、激动等,得到与上述情感状态对应的第一置信度分别为0.6、0.2、0.3、0.4、0.7,第二置信度分别为0.8、0.3、0.2、0.5、0.5,将对应的第一置信度和第二置信度分别相加得到各种语音情感的最终置信度分别为1.4、0.5、0.5、0.9、1.2,选择最大置信度1.4对应的情感状态(开心)作为输入语音的情感状态。Preferably, the emotion analysis module includes: a first confidence degree obtaining unit, which obtains a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and obtains multiple voice emotions respectively according to the first probability vector The first confidence degree; the second confidence degree obtaining unit obtains the second probability vector of the emotion of the input speech through the text-based emotion classification model, and obtains the second confidence degree of multiple speech emotions according to the second probability vector; The confidence vector obtaining unit adds the first confidence and the second confidence of the same speech emotion to obtain the confidence of the same speech emotion to obtain the confidence vectors of multiple speech emotions; the selection unit selects all The voice emotion corresponding to the maximum confidence in the confidence vector is used as the emotional state of the input speech. For example, for a piece of input speech, the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc., and the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7, The two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.
进一步地,置信度向量中最大置信度对应的语音情感有两种或两种以上时,从中随机选择其中的一种作为输入语音的情感状态。Further, when there are two or more speech emotions corresponding to the maximum confidence in the confidence vector, one of them is randomly selected as the emotional state of the input speech.
需要理解的是,本申请中的“第一”、“第二”“第三”和“第四”等仅用于区分相同或类似的对象,并不表示先后次序或优选顺序等含义。It should be understood that the "first", "second", "third", and "fourth" in this application are only used to distinguish the same or similar objects, and do not represent the meanings such as sequence or preferred order.
本申请的一个可选实施例中,情感确定模块包括:In an optional embodiment of the present application, the emotion determination module includes:
构建单元,构建场景库,所述场景库包括多种对话场景以及每种对话场景所对应的情感状态,其中,对话场景以及对应的情感状态可以是人工标注,根据具体的情景结合人类认知进行标签标注,预先定义某些特定的对话场景下,合成语音所需要的情感状态,因为即使是对于同一种情感的输入语音的回复,在不同的对话场景下,所需要反馈的合成语音的情感状态也可能是不同的;The construction unit is to construct a scene library, the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene. Among them, the dialogue scene and the corresponding emotional state can be manually annotated, based on the specific situation combined with human cognition. Labeling, predefine the emotional state required for synthesized speech in some specific dialogue scenarios, because even for the same emotional input voice response, in different dialogue scenarios, the emotional state of the synthesized speech needs to be fed back It may be different;
情景分析单元,根据所述输入语音和待合成文本进行情景分析,获取所述待合成文本的对话场景;The context analysis unit performs context analysis according to the input speech and the text to be synthesized, and obtains the dialogue scene of the text to be synthesized;
查询单元,根据所述场景库获取所述待合成文本的对话场景对应的情感状态;A query unit to obtain the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;
情感状态确定单元,根据所述对话场景对应的情感状态以及所述输入语音的情感状态确定合成语音的情感状态。The emotional state determining unit determines the emotional state of the synthesized speech according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
例如,对于一个推销场景,客户输入的语音表达的情感是不耐烦的,通过对待合成文本进行情景分析,结合对话场景,得出合成语音的情感状态应该是开心积极的,以良好的服务客户。通过增加对对话场景的分析,使得人机交互时得到的反馈更加符合真实的应用场景,增强用户体验。For example, for a sales promotion scenario, the emotion expressed by the voice input by the customer is impatient. By analyzing the synthesized text and combining the dialogue scenario, it is concluded that the emotional state of the synthesized voice should be happy and positive, so as to serve customers well. By increasing the analysis of the dialogue scene, the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.
本申请的一个可选实施例中,语音合成模块包括:语气词嵌入单元,通过波形拼接技术对待合成文本进行语气助词嵌入;韵律控制单元,通过端到端合成技术控制合成语音的语气和韵律;语音合成单元,根据嵌入的语气助词、语气和韵律进行语音合成,使得合成的语音能够表达出相应的语气和情绪。In an optional embodiment of the present application, the speech synthesis module includes: a modal particle embedding unit, which embeds modal auxiliary words in the synthesized text through a waveform splicing technology; a prosody control unit, which controls the tone and prosody of the synthesized speech through an end-to-end synthesis technology; The speech synthesis unit performs speech synthesis according to the embedded modal auxiliary words, tone and prosody, so that the synthesized speech can express the corresponding tone and emotion.
本申请的一个实施例中,计算机可读存储介质可以是任何包含或存储程序或指令的有形介质,其中的程序可以被执行,通过存储的程序指令相关的硬件实现相应的功能。例如,计算机可读存储介质可以是计算机磁盘、硬盘、随机存取存储器、只读存储器等。本申请并不限于此,可以是以非暂时性方式存储指令或软件以及任何相关数据文件或数据结构并且可提供给处理器以使处理器执行其中的程序或指令的任何装置。所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质中包括面向情景及情感的中文语音合成程序,所述面向情景及情感的中文语音合成程序被处理器执行时,实现如下的中文语音合成方法:In an embodiment of the present application, the computer-readable storage medium may be any tangible medium that contains or stores a program or instruction. The program can be executed, and the stored program instructs related hardware to realize the corresponding function. For example, the computer-readable storage medium may be a computer disk, a hard disk, a random access memory, a read-only memory, etc. The application is not limited to this, and it can be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to enable the processor to execute the programs or instructions therein. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes a context-oriented and emotion-oriented Chinese speech synthesis program, and the context-oriented and emotion-oriented Chinese speech synthesis program When executed by the processor, the following Chinese speech synthesis method is realized:
获取输入语音;Get input voice;
将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;
根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
本申请之计算机可读存储介质的具体实施方式与上述面向情景及情感的中文语音合成方法、电子装置的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned context- and emotion-oriented Chinese speech synthesis method and electronic device, and will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种面向情景及情感的中文语音合成方法,应用于电子装置,其中,包括:A Chinese speech synthesis method oriented to situations and emotions, applied to electronic devices, including:
    获取输入语音;Get input voice;
    将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
    根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;
    根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  2. 根据权利要求1所述的面向情景及情感的中文语音合成方法,其中,所述情感分析模型包括基于语音的情感识别模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:The context- and emotion-oriented Chinese speech synthesis method according to claim 1, wherein the emotion analysis model comprises a speech-based emotion recognition model, and the step of outputting the emotion state of the input speech through the emotion analysis model comprises:
    获取所述输入语音的声谱图和语音特征参数;Acquiring a spectrogram and voice feature parameters of the input voice;
    将所述输入语音的声谱图输入所述情感识别模型中的经过训练得到的卷积循环神经网络中,通过所述卷积循环神经网络和第一全连接层输出第一特征向量;Input the spectrogram of the input speech into the trained convolutional recurrent neural network in the emotion recognition model, and output a first feature vector through the convolutional recurrent neural network and the first fully connected layer;
    根据所述语音特征参数获取统计特征,输入所述情感识别模型中,通过所述情感识别模型中的三个第二全连接层输出第二特征向量;Acquiring statistical features according to the speech feature parameters, inputting them into the emotion recognition model, and outputting second feature vectors through the three second fully connected layers in the emotion recognition model;
    将所述第一特征向量与所述第二特征向量进行融合,得到融合特征向量;Fusing the first feature vector and the second feature vector to obtain a fusion feature vector;
    所述融合特征向量通过所述情感识别模型中的第三全连接层和归一化层输出所述输入语音的情感的第一概率向量;The fusion feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model;
    根据所述第一概率向量获取所述输入语音的情感状态。Acquire the emotional state of the input speech according to the first probability vector.
  3. 根据权利要求1所述的面向情景及情感的中文语音合成方法,其中,所述情感分析模型包括基于文本的情感分类模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:The context- and emotion-oriented Chinese speech synthesis method of claim 1, wherein the sentiment analysis model comprises a text-based sentiment classification model, and the step of outputting the sentiment state of the input speech through the sentiment analysis model comprises:
    通过语音识别将所述输入语音转化为待分类文本;Converting the input speech into text to be classified through speech recognition;
    提取所述待分类文本的文本特征向量;Extracting the text feature vector of the text to be classified;
    将所述文本特征向量输入所述情感分类模型中的深度神经网络分类器中;Inputting the text feature vector into the deep neural network classifier in the sentiment classification model;
    通过所述分类器获取所述输入语音的情感的第二概率向量;Obtaining the second probability vector of the emotion of the input speech by the classifier;
    根据所述第二概率向量获取所述输入语音的情感状态。Acquire the emotional state of the input speech according to the second probability vector.
  4. 根据权利要求1所述的面向情景及情感的中文语音合成方法,其中,所述情感分析模型包括基于语音的情感识别模型和基于文本的情感分类模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:The context- and emotion-oriented Chinese speech synthesis method according to claim 1, wherein the emotion analysis model includes a speech-based emotion recognition model and a text-based emotion classification model, and the input speech is output through the emotion analysis model The steps of the emotional state include:
    通过所述情感识别模型获取所述输入语音的情感的第一概率向量,根据所述第一概率向量分别获取多种语音情感的第一置信度;Obtaining the first probability vector of the emotion of the input speech through the emotion recognition model, and obtaining the first confidence degrees of multiple speech emotions according to the first probability vector;
    通过所述情感分类模型获取所述输入语音的情感的第二概率向量,根据所述第二概率向量分别获取多种语音情感的第二置信度;Obtaining the second probability vector of the emotion of the input speech through the emotion classification model, and obtaining the second confidence degrees of multiple speech emotions according to the second probability vector;
    将同一种语音情感的所述第一置信度与所述第二置信度相加,获取所述同一种语音情感的置信度,得到多种语音情感的置信度向量;Adding the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions;
    选择所述置信度向量中最大置信度所对应的语音情感作为所述输入语音的情感状态。The voice emotion corresponding to the maximum confidence in the confidence vector is selected as the emotional state of the input voice.
  5. 根据权利要求1所述的面向情景及情感的中文语音合成方法,其中,根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态的步骤包括:The context- and emotion-oriented Chinese speech synthesis method of claim 1, wherein the step of determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech comprises:
    构建场景库,所述场景库包括多种对话场景以及每种对话场景所对应的情感状态;Constructing a scene library, which includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene;
    根据所述输入语音和待合成文本进行情景分析,获取所述待合成文本的对话场景;Perform context analysis according to the input voice and the text to be synthesized, and obtain the dialogue scene of the text to be synthesized;
    根据所述场景库获取所述待合成文本的对话场景对应的情感状态;Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;
    根据所述对话场景对应的情感状态以及所述输入语音的情感状态确定合成语音的情感状态。The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
  6. 根据权利要求1至5中任一项所述的面向情景及情感的中文语音合成方法,其中,根据所述合成语音的情感状态以及待合成文本进行语音合成的步骤包括:The context- and emotion-oriented Chinese speech synthesis method according to any one of claims 1 to 5, wherein the step of performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized comprises:
    通过波形拼接技术对待合成文本进行语气助词嵌入;To embed modal auxiliary words in the synthesized text through waveform splicing technology
    通过端到端合成技术控制合成语音的语气和韵律;Control the tone and rhythm of synthesized speech through end-to-end synthesis technology;
    根据嵌入的语气助词、语气和韵律进行语音合成。According to the embedded modal auxiliary words, tone and prosody, speech synthesis is performed.
  7. 根据权利要求1所述的面向情景及情感的中文语音合成方法,其中,将所述输入语音输入情感分析模型的步骤之前,还包括:The context- and emotion-oriented Chinese speech synthesis method according to claim 1, wherein before the step of inputting the input speech into an emotion analysis model, the method further comprises:
    根据输入语音和待合成文本判断是否存在交互场景,若不存在交互场景,则设定所述合成语音的情感状态,不再对输入语音进行情感分析;若存在交互场景,则将所述输入语音输入情感分析模型中。Determine whether there is an interaction scene based on the input speech and the text to be synthesized. If there is no interaction scene, set the emotional state of the synthesized speech and no longer perform sentiment analysis on the input speech; if there is an interaction scene, the input speech Enter the sentiment analysis model.
  8. 一种电子装置,其中,该电子装置包括:An electronic device, wherein the electronic device includes:
    处理器;processor;
    存储器,所述存储器中包括面向情景及情感的中文语音合成程序,所述中文语音合成程序被所述处理器执行时实现如下所述的中文语音合成方法的步骤:A memory, the memory includes a context-oriented and emotion-oriented Chinese speech synthesis program, and when the Chinese speech synthesis program is executed by the processor, the following steps of the Chinese speech synthesis method are implemented:
    获取输入语音;Get input voice;
    将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
    根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;
    根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  9. 一种计算机设备,其中,该计算机设备包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的面向情景及情感的中文语音合成程序,所述中文语音合成程序被处理器执行时,实现如下步骤:A computer device, wherein the computer device includes: a memory, a processor, and a context- and emotion-oriented Chinese speech synthesis program stored in the memory and running on the processor, and the Chinese speech synthesis program is When the processor executes, the following steps are implemented:
    获取输入语音;Get input voice;
    将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
    根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;
    根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  10. 如权利要求9所述的计算机设备,其中,所述情感分析模型包括基于语音的情感识别模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:9. The computer device of claim 9, wherein the emotion analysis model comprises a voice-based emotion recognition model, and the step of outputting the emotion state of the input voice through the emotion analysis model comprises:
    获取所述输入语音的声谱图和语音特征参数;Acquiring a spectrogram and voice feature parameters of the input voice;
    将所述输入语音的声谱图输入所述情感识别模型中的经过训练得到的卷积循环神经网络中,通过所述卷积循环神经网络和第一全连接层输出第一特征向量;Input the spectrogram of the input speech into the trained convolutional recurrent neural network in the emotion recognition model, and output a first feature vector through the convolutional recurrent neural network and the first fully connected layer;
    根据所述语音特征参数获取统计特征,输入所述情感识别模型中,通过所述情感识别模型中的三个第二全连接层输出第二特征向量;Acquiring statistical features according to the speech feature parameters, inputting them into the emotion recognition model, and outputting second feature vectors through the three second fully connected layers in the emotion recognition model;
    将所述第一特征向量与所述第二特征向量进行融合,得到融合特征向量;Fusing the first feature vector and the second feature vector to obtain a fusion feature vector;
    所述融合特征向量通过所述情感识别模型中的第三全连接层和归一化层输出所述输入语音的情感的第一概率向量;The fusion feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model;
    根据所述第一概率向量获取所述输入语音的情感状态。Acquire the emotional state of the input speech according to the first probability vector.
  11. 如权利要求9所述的计算机设备,其中,所述情感分析模型包括基于文本的情感分类模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:9. The computer device according to claim 9, wherein the sentiment analysis model comprises a text-based sentiment classification model, and the step of outputting the sentiment state of the input speech through the sentiment analysis model comprises:
    通过语音识别将所述输入语音转化为待分类文本;Converting the input speech into text to be classified through speech recognition;
    提取所述待分类文本的文本特征向量;Extracting the text feature vector of the text to be classified;
    将所述文本特征向量输入所述情感分类模型中的深度神经网络分类器中;Inputting the text feature vector into the deep neural network classifier in the sentiment classification model;
    通过所述分类器获取所述输入语音的情感的第二概率向量;Obtaining the second probability vector of the emotion of the input speech by the classifier;
    根据所述第二概率向量获取所述输入语音的情感状态。Acquire the emotional state of the input speech according to the second probability vector.
  12. 如权利要求9所述的计算机设备,其中,所述情感分析模型包括基于语音的情感识别模型和基于文本的情感分类模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:9. The computer device according to claim 9, wherein the emotion analysis model includes a speech-based emotion recognition model and a text-based emotion classification model, and the step of outputting the emotion state of the input speech through the emotion analysis model comprises:
    通过所述情感识别模型获取所述输入语音的情感的第一概率向量,根据所述第一概率向量分别获取多种语音情感的第一置信度;Obtaining the first probability vector of the emotion of the input speech through the emotion recognition model, and obtaining the first confidence degrees of multiple speech emotions according to the first probability vector;
    通过所述情感分类模型获取所述输入语音的情感的第二概率向量,根据所述第二概率向量分别获取多种语音情感的第二置信度;Obtaining the second probability vector of the emotion of the input speech through the emotion classification model, and obtaining the second confidence degrees of multiple speech emotions according to the second probability vector;
    将同一种语音情感的所述第一置信度与所述第二置信度相加,获取所述同一种语音情感的置信度,得到多种语音情感的置信度向量;Adding the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions;
    选择所述置信度向量中最大置信度所对应的语音情感作为所述输入语音的情感状态。The voice emotion corresponding to the maximum confidence in the confidence vector is selected as the emotional state of the input voice.
  13. 如权利要求9所述的计算机设备,其中,根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态的步骤包括:9. The computer device of claim 9, wherein the step of determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech comprises:
    构建场景库,所述场景库包括多种对话场景以及每种对话场景所对应的情感状态;Constructing a scene library, which includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene;
    根据所述输入语音和待合成文本进行情景分析,获取所述待合成文本的对话场景;Perform context analysis according to the input voice and the text to be synthesized, and obtain the dialogue scene of the text to be synthesized;
    根据所述场景库获取所述待合成文本的对话场景对应的情感状态;Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;
    根据所述对话场景对应的情感状态以及所述输入语音的情感状态确定合成语音的情感状态。The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
  14. 如权利要求9至13中任一项所述的计算机设备,其中,根据所述合成语音的情感状态以及待合成文本进行语音合成的步骤包括:The computer device according to any one of claims 9 to 13, wherein the step of performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized comprises:
    通过波形拼接技术对待合成文本进行语气助词嵌入;To embed modal auxiliary words in the synthesized text through waveform splicing technology
    通过端到端合成技术控制合成语音的语气和韵律;Control the tone and rhythm of synthesized speech through end-to-end synthesis technology;
    根据嵌入的语气助词、语气和韵律进行语音合成。According to the embedded modal auxiliary words, tone and prosody, speech synthesis is performed.
  15. 如权利要求9所述的计算机设备,其中,将所述输入语音输入情感分析模型的步骤之前,还包括:9. The computer device of claim 9, wherein before the step of inputting the input voice into the sentiment analysis model, it further comprises:
    根据输入语音和待合成文本判断是否存在交互场景,若不存在交互场景,则设定所述合成语音的情感状态,不再对输入语音进行情感分析;若存在交互场景,则将所述输入语音输入情感分析模型中。Determine whether there is an interaction scene based on the input speech and the text to be synthesized. If there is no interaction scene, set the emotional state of the synthesized speech and no longer perform sentiment analysis on the input speech; if there is an interaction scene, the input speech Enter the sentiment analysis model.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质中包括面向情景及情感的中文语音合成程序,所述中文语音合成程序被处理器执行时,实现如下步骤:A computer-readable storage medium, wherein the computer-readable storage medium includes a context-oriented and emotional Chinese speech synthesis program, and when the Chinese speech synthesis program is executed by a processor, the following steps are implemented:
    获取输入语音;Get input voice;
    将所述输入语音输入情感分析模型,通过所述情感分析模型输出所述输入语音的情感状态;Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
    根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态;Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;
    根据所述合成语音的情感状态以及基于所述输入语音确定的待合成文本进行语音合成。Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述情感分析模型包括基于语音的情感识别模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:16. The computer-readable storage medium of claim 16, wherein the emotion analysis model comprises a speech-based emotion recognition model, and the step of outputting the emotion state of the input speech through the emotion analysis model comprises:
    获取所述输入语音的声谱图和语音特征参数;Acquiring a spectrogram and voice feature parameters of the input voice;
    将所述输入语音的声谱图输入所述情感识别模型中的经过训练得到的卷积循环神经网络中,通过所述卷积循环神经网络和第一全连接层输出第一特征向量;Input the spectrogram of the input speech into the trained convolutional recurrent neural network in the emotion recognition model, and output a first feature vector through the convolutional recurrent neural network and the first fully connected layer;
    根据所述语音特征参数获取统计特征,输入所述情感识别模型中,通过所述情感识别模型中的三个第二全连接层输出第二特征向量;Acquiring statistical features according to the speech feature parameters, inputting them into the emotion recognition model, and outputting second feature vectors through the three second fully connected layers in the emotion recognition model;
    将所述第一特征向量与所述第二特征向量进行融合,得到融合特征向量;Fusing the first feature vector and the second feature vector to obtain a fusion feature vector;
    所述融合特征向量通过所述情感识别模型中的第三全连接层和归一化层输出所述输入语音的情感的第一概率向量;The fusion feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model;
    根据所述第一概率向量获取所述输入语音的情感状态。Acquire the emotional state of the input speech according to the first probability vector.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述情感分析模型包括基于文本的情感分类模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:16. The computer-readable storage medium of claim 16, wherein the sentiment analysis model comprises a text-based sentiment classification model, and the step of outputting the sentiment state of the input speech through the sentiment analysis model comprises:
    通过语音识别将所述输入语音转化为待分类文本;Converting the input speech into text to be classified through speech recognition;
    提取所述待分类文本的文本特征向量;Extracting the text feature vector of the text to be classified;
    将所述文本特征向量输入所述情感分类模型中的深度神经网络分类器中;Inputting the text feature vector into the deep neural network classifier in the sentiment classification model;
    通过所述分类器获取所述输入语音的情感的第二概率向量;Obtaining the second probability vector of the emotion of the input speech by the classifier;
    根据所述第二概率向量获取所述输入语音的情感状态。Acquire the emotional state of the input speech according to the second probability vector.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述情感分析模型包括基于语音的情感识别模型和基于文本的情感分类模型,通过所述情感分析模型输出所述输入语音的情感状态的步骤包括:The computer-readable storage medium according to claim 16, wherein the emotion analysis model includes a speech-based emotion recognition model and a text-based emotion classification model, and the emotional state of the input speech is output through the emotion analysis model The steps include:
    通过所述情感识别模型获取所述输入语音的情感的第一概率向量,根据所述第一概率向量分别获取多种语音情感的第一置信度;Obtaining the first probability vector of the emotion of the input speech through the emotion recognition model, and obtaining the first confidence degrees of multiple speech emotions according to the first probability vector;
    通过所述情感分类模型获取所述输入语音的情感的第二概率向量,根据所述第二概率向量分别获取多种语音情感的第二置信度;Obtaining the second probability vector of the emotion of the input speech through the emotion classification model, and obtaining the second confidence degrees of multiple speech emotions according to the second probability vector;
    将同一种语音情感的所述第一置信度与所述第二置信度相加,获取所述同一种语音情感的置信度,得到多种语音情感的置信度向量;Adding the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions;
    选择所述置信度向量中最大置信度所对应的语音情感作为所述输入语音的情感状态。The voice emotion corresponding to the maximum confidence in the confidence vector is selected as the emotional state of the input voice.
  20. 如权利要求16所述的计算机可读存储介质,其中,根据对话场景以及所述输入语音的情感状态确定合成语音的情感状态的步骤包括:16. The computer-readable storage medium of claim 16, wherein the step of determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech comprises:
    构建场景库,所述场景库包括多种对话场景以及每种对话场景所对应的情感状态;Constructing a scene library, which includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene;
    根据所述输入语音和待合成文本进行情景分析,获取所述待合成文本的对话场景;Perform context analysis according to the input voice and the text to be synthesized, and obtain the dialogue scene of the text to be synthesized;
    根据所述场景库获取所述待合成文本的对话场景对应的情感状态;Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;
    根据所述对话场景对应的情感状态以及所述输入语音的情感状态确定合成语音的情感状态。The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
PCT/CN2020/093564 2019-06-19 2020-05-30 Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium WO2020253509A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910531628.7A CN110211563A (en) 2019-06-19 2019-06-19 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion
CN201910531628.7 2019-06-19

Publications (1)

Publication Number Publication Date
WO2020253509A1 true WO2020253509A1 (en) 2020-12-24

Family

ID=67793522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093564 WO2020253509A1 (en) 2019-06-19 2020-05-30 Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium

Country Status (2)

Country Link
CN (1) CN110211563A (en)
WO (1) WO2020253509A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion
CN110675853B (en) * 2019-09-10 2022-07-05 苏宁云计算有限公司 Emotion voice synthesis method and device based on deep learning
CN110570879A (en) * 2019-09-11 2019-12-13 深圳壹账通智能科技有限公司 Intelligent conversation method and device based on emotion recognition and computer equipment
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN112233648A (en) * 2019-12-09 2021-01-15 北京来也网络科技有限公司 Data processing method, device, equipment and storage medium combining RPA and AI
CN111276119B (en) * 2020-01-17 2023-08-22 平安科技(深圳)有限公司 Speech generation method, system and computer equipment
CN111274807B (en) * 2020-02-03 2022-05-10 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium
CN111445906A (en) * 2020-02-28 2020-07-24 深圳壹账通智能科技有限公司 Big data-based voice generation method, device, equipment and medium
CN111312210B (en) * 2020-03-05 2023-03-21 云知声智能科技股份有限公司 Text-text fused voice synthesis method and device
CN111489734B (en) * 2020-04-03 2023-08-22 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN111627420B (en) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 Method and device for synthesizing emotion voice of specific speaker under extremely low resource
CN111862030B (en) 2020-07-15 2024-02-09 北京百度网讯科技有限公司 Face synthetic image detection method and device, electronic equipment and storage medium
CN112148846A (en) * 2020-08-25 2020-12-29 北京来也网络科技有限公司 Reply voice determination method, device, equipment and storage medium combining RPA and AI
CN112423106A (en) * 2020-11-06 2021-02-26 四川长虹电器股份有限公司 Method and system for automatically translating accompanying sound
CN112466314A (en) * 2020-11-27 2021-03-09 平安科技(深圳)有限公司 Emotion voice data conversion method and device, computer equipment and storage medium
CN112837700A (en) * 2021-01-11 2021-05-25 网易(杭州)网络有限公司 Emotional audio generation method and device
CN113066473A (en) * 2021-03-31 2021-07-02 建信金融科技有限责任公司 Voice synthesis method and device, storage medium and electronic equipment
CN113593521B (en) * 2021-07-29 2022-09-20 北京三快在线科技有限公司 Speech synthesis method, device, equipment and readable storage medium
CN114373444B (en) * 2022-03-23 2022-05-27 广东电网有限责任公司佛山供电局 Method, system and equipment for synthesizing voice based on montage

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
CN108305643A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN109256150A (en) * 2018-10-12 2019-01-22 北京创景咨询有限公司 Speech emotion recognition system and method based on machine learning
CN109559760A (en) * 2018-12-29 2019-04-02 北京京蓝宇科技有限公司 A kind of sentiment analysis method and system based on voice messaging
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI221574B (en) * 2000-09-13 2004-10-01 Agi Inc Sentiment sensing method, perception generation method and device thereof and software
CN109817246B (en) * 2019-02-27 2023-04-18 平安科技(深圳)有限公司 Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
CN108305643A (en) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
CN109256150A (en) * 2018-10-12 2019-01-22 北京创景咨询有限公司 Speech emotion recognition system and method based on machine learning
CN109559760A (en) * 2018-12-29 2019-04-02 北京京蓝宇科技有限公司 A kind of sentiment analysis method and system based on voice messaging
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion

Also Published As

Publication number Publication date
CN110211563A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
WO2020253509A1 (en) Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
US11475897B2 (en) Method and apparatus for response using voice matching user category
US20220246149A1 (en) Proactive command framework
US11361753B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
US20220180872A1 (en) Electronic apparatus and method for controlling thereof
CN107657017A (en) Method and apparatus for providing voice service
US10224030B1 (en) Dynamic gazetteers for personalized entity recognition
US20220076674A1 (en) Cross-device voiceprint recognition
US11276403B2 (en) Natural language speech processing application selection
JP2018146715A (en) Voice interactive device, processing method of the same and program
KR20200044388A (en) Device and method to recognize voice and device and method to train voice recognition model
CN113330511B (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
CN114495927A (en) Multi-modal interactive virtual digital person generation method and device, storage medium and terminal
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
WO2022252904A1 (en) Artificial intelligence-based audio processing method and apparatus, device, storage medium, and computer program product
CN114627856A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
US11600261B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
CN114242093A (en) Voice tone conversion method and device, computer equipment and storage medium
US20240029708A1 (en) Visual responses to user inputs
KR102226427B1 (en) Apparatus for determining title of user, system including the same, terminal and method for the same
CN114708849A (en) Voice processing method and device, computer equipment and computer readable storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
US11792365B1 (en) Message data analysis for response recommendations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20825708

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20825708

Country of ref document: EP

Kind code of ref document: A1