WO2020253509A1 - Procédé, dispositif et support d'informations de synthèse de la parole chinoise orientée sur la situation et l'émotion - Google Patents

Procédé, dispositif et support d'informations de synthèse de la parole chinoise orientée sur la situation et l'émotion Download PDF

Info

Publication number
WO2020253509A1
WO2020253509A1 PCT/CN2020/093564 CN2020093564W WO2020253509A1 WO 2020253509 A1 WO2020253509 A1 WO 2020253509A1 CN 2020093564 W CN2020093564 W CN 2020093564W WO 2020253509 A1 WO2020253509 A1 WO 2020253509A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
emotion
input
emotional state
synthesized
Prior art date
Application number
PCT/CN2020/093564
Other languages
English (en)
Chinese (zh)
Inventor
彭话易
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020253509A1 publication Critical patent/WO2020253509A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a Chinese speech synthesis method, device and storage medium oriented to situations and emotions.
  • This application provides a context- and emotion-oriented Chinese speech synthesis method, device, and storage medium to solve the problem in the prior art that the human-computer interaction is difficult to continue due to the fixed tone and emotion synthesis of speech in the prior art.
  • one aspect of this application is to provide a context- and emotion-oriented Chinese speech synthesis method, including: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the input through the emotion analysis model The emotional state of the speech; the emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; the speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • an electronic device includes: a processor; a memory, the memory includes a context-oriented and emotional Chinese speech synthesis program, the Chinese speech synthesis program is
  • the steps of the Chinese speech synthesis method as described below are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;
  • the scene and the emotional state of the input voice determine the emotional state of the synthesized voice; perform speech synthesis according to the emotional state of the synthesized voice and the text to be synthesized determined based on the input voice.
  • another aspect of the present application provides a computer device, wherein the computer device includes: a memory, a processor, and a context-oriented and emotionally oriented device stored in the memory and capable of running on the processor.
  • a Chinese speech synthesis program when the Chinese speech synthesis program is executed by a processor, the following steps are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; The emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • another aspect of the present application is to provide a computer-readable storage medium
  • the computer-readable storage medium includes a context-oriented and emotional Chinese speech synthesis program
  • the Chinese speech synthesis program is executed by a processor
  • the steps of the Chinese speech synthesis method as described below are realized: acquiring the input speech; inputting the input speech into the emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; according to the dialogue scene and the input
  • the emotional state of the speech determines the emotional state of the synthesized speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • This application analyzes the emotional state of the input speech and obtains the emotional state of the synthesized speech according to the emotional state of the input speech.
  • the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene.
  • the output synthesized voice is more like a real person, enhancing the user experience.
  • Fig. 1 is a schematic flow chart of the context-oriented and emotion-oriented Chinese speech synthesis method described in this application;
  • Figure 2 is a schematic diagram of the structure of the emotion recognition model in this application.
  • Figure 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application.
  • Figure 4 is a schematic structural diagram of the sentiment classification model in this application.
  • FIG. 5 is a schematic diagram of modules of the Chinese speech synthesis program oriented to the situation and emotion in this application.
  • Figure 1 is a schematic flow chart of the context- and emotion-oriented Chinese speech synthesis method described in this application. As shown in Figure 1, the context- and emotion-oriented Chinese speech synthesis method described in this application includes the following steps:
  • Step S1 Obtain the input voice, the input voice is the voice to be fed back.
  • the input voice is the user's inquiry, etc.
  • the synthesized voice is the feedback of the smart speaker to the user.
  • This application is According to the emotional state of the user's inquiry voice, the emotional state of the synthesized voice is obtained, so that the feedback of the smart speaker has a specific tone and emotion, which is consistent with the emotion of the user's input voice;
  • Step S2 Input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model;
  • Step S3 Determine the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech.
  • the emotional state of the synthesized speech increase the influencing factors of the dialogue scene, so that the feedback obtained by human-computer interaction is not only satisfied with the user's emotion
  • the feedback is more in line with actual application scenarios and avoid errors. For example, for a sales scene, even if the emotion expressed by the voice input by the customer is impatient, after adding the influence factor of the dialogue scene, the emotional state of the synthesized speech is also Should be happy and active, serve customers well;
  • Step S4 Perform speech synthesis according to the emotional state of the synthesized speech and the to-be-synthesized text determined based on the input speech, where the to-be-synthesized text is the text to be fed back determined by the intelligent system according to the input speech during human-computer interaction.
  • This application obtains the emotional state of the synthesized speech by analyzing the emotional state of the input speech.
  • the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene.
  • the output synthesized speech is more like a real person.
  • the Chinese speech synthesis method further includes: judging whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, setting the synthesis The emotional state of the voice or the default emotional state of the synthesized voice is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform emotional state analysis on the input voice .
  • the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction.
  • the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ⁇ °C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.
  • the emotion analysis model includes a voice-based emotion recognition model.
  • Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers.
  • CRNN convolutional Recurrent Neural Network
  • the input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector;
  • the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.
  • Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application.
  • the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer.
  • Pooling layer long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module.
  • Each pooling module is connected to the LSTM layer
  • Each neuron is connected.
  • the convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.
  • the step of outputting the emotional state of the input voice through the emotional analysis model includes: obtaining a spectrogram and voice feature parameters of the input voice; and converting the spectrogram of the input voice
  • the trained convolutional recurrent neural network (CRNN) in the emotion recognition model output a first feature vector through the convolutional recurrent neural network and the first fully connected layer; obtain statistics according to the speech feature parameters Features (High level Statistics Functions, HFS features, perform feature statistics on multiple frames of speech in a segment of speech, and obtain the average or maximum value of feature parameters, etc.), input into the emotion recognition model, and pass the three second totals in the emotion recognition model
  • the connection layer outputs a second feature vector; concatenates the first feature vector and the second feature vector to obtain a fusion feature vector; the fusion feature vector passes through the third full connection in the emotion recognition model
  • the layer and the normalization layer output the first probability vector of the emotion of the input speech; obtain the emotion state of the input speech according to the first probability vector.
  • the sentiment analysis model includes a text-based sentiment classification model.
  • FIG. 4 is a schematic structural diagram of the sentiment classification model in this application.
  • the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified;
  • the text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors.
  • an open source BERT model Bidirectional Encoder Representations from Transformers
  • the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer.
  • the input layer includes 256 input nodes for input sentence vectors
  • the hidden layer includes 128 hidden nodes
  • the output layer uses a softmax function to output emotion labels and probabilities.
  • the step of outputting the emotional state of the input voice through the emotion analysis model includes: converting the input voice into text to be classified through voice recognition; extracting text feature vectors of the text to be classified; Input the text feature vector into the deep neural network classifier in the emotion classification model; obtain the second probability vector of the emotion of the input voice through the classifier; obtain the input according to the second probability vector The emotional state of speech.
  • the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both.
  • the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.
  • the step of outputting the emotional state of the input voice through the emotion analysis model includes: obtaining a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and respectively according to the first probability vector Acquiring the first confidence levels of multiple speech emotions; acquiring the second probability vector of the emotion of the input speech through a text-based emotion classification model, and acquiring the second confidence levels of the multiple speech emotions according to the second probability vector; Add the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions; select among the confidence level vectors The speech emotion corresponding to the maximum confidence is used as the emotion state of the input speech.
  • the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc.
  • the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7
  • the two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.
  • two results are obtained through a speech-based emotion recognition model and a text-based emotion classification model, which respectively represent the confidence of various emotional states of the input speech, and are obtained by the two models
  • the result of setting different weight values, based on different weights add the obtained confidence to predict the final voice emotional state. For example, set a weight value of 0.6 for the emotion recognition model and a weight value of 0.4 for the emotion classification model.
  • the possible speech emotion states are happy, sad, impatient, excited, excited, etc.
  • the first confidence level corresponding to the above emotional state is 0.6, 0.2, 0.3, 0.4, 0.7
  • the second confidence level is 0.8, 0.3, 0.2, 0.5, 0.5, respectively
  • the corresponding first and second confidence levels are The degree is based on the different weight values set to be added to obtain the final confidence degrees of various speech emotions respectively 0.68, 0.24, 0.26, 0.44, 0.62, and the emotional state (happy) corresponding to the maximum confidence degree of 0.68 is selected as the emotional state of the input speech .
  • determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech includes:
  • the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene, wherein the dialogue scene and the corresponding emotional state can be manually labeled, and labeling is performed according to the specific scenario combined with human cognition, Predefine the emotional state required for synthesized speech in certain specific dialogue scenarios, because even if it is a response to the input voice of the same emotion, the emotional state of the synthesized voice that needs to be fed back in different dialogue scenarios may be different;
  • the emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
  • the emotion expressed by the voice input by the customer is impatient.
  • the emotional state of the synthesized voice should be happy and positive, so as to serve customers well.
  • the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.
  • performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized includes: embedding modal auxiliary words through the waveform splicing technology in the synthesized text; controlling the synthesized speech through the end-to-end synthesis technology Tone and prosody; according to the embedded modal auxiliary words, tone and prosody for speech synthesis, so that the synthesized speech can express the corresponding tone and emotion.
  • the context-oriented and emotion-oriented Chinese speech synthesis method described in this application is applied to an electronic device, and the electronic device may be a terminal device such as a TV, a smart phone, a tablet computer, and a computer.
  • the electronic device includes: a processor; a memory for storing a context-oriented and emotion-oriented Chinese speech synthesis program, and the processor executes the context-oriented and emotion-oriented Chinese speech synthesis program to realize the following context- and emotion-oriented Chinese speech synthesis
  • the steps of the method obtain the input speech; input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model; determine the emotion state of the synthesized speech according to the dialogue scene and the emotion state of the input speech ; Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • the electronic device also includes a network interface, a communication bus, and the like.
  • the network interface may include a standard wired interface and a wireless interface
  • the communication bus is used to realize the connection and communication between various components.
  • the memory includes at least one type of readable storage medium, which can be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited to this, and can be stored in a non-transitory manner Any device that provides instructions or software and any associated data files to the processor so that the processor can execute the instructions or software program.
  • the software program stored in the memory includes a context-oriented and emotional Chinese speech synthesis program, and the Chinese speech synthesis program can be provided to the processor, so that the processor can execute the Chinese speech synthesis program to realize the Chinese speech synthesis method. step.
  • the processor can be a central processing unit, a microprocessor, or other data processing chips, etc., and can run stored programs in the memory, for example, the context-oriented and emotional Chinese speech synthesis program in this application.
  • the electronic device may also include a display, and the display may also be called a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, etc.
  • the display is used to display the information processed in the electronic device and to display the visual work interface.
  • the electronic device may further include a user interface, and the user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
  • a user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
  • the context and emotion-oriented Chinese speech synthesis program can also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
  • Figure 5 is a schematic diagram of modules of the context- and emotion-oriented Chinese speech synthesis program in this application. As shown in Fig. 5, the Chinese speech synthesis program can be divided into: an acquisition module 1, an emotion analysis module 2, an emotion determination module 3, and Speech synthesis module 4.
  • the functions or operation steps implemented by the above modules are similar to the above, for example:
  • Emotion analysis module 2 inputting the input speech into an emotion analysis model, and outputting the emotional state of the input speech through the emotion analysis model;
  • the emotion determination module 3 determines the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech
  • the speech synthesis module 4 performs speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
  • the electronic device further includes a judgment module, before the step of inputting the input voice into the emotion analysis model, judge whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, set the The emotional state of the synthesized speech or the default emotional state of the synthesized speech is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform the emotional state on the input voice analysis.
  • the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction.
  • the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ⁇ °C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.
  • the emotion analysis model includes a voice-based emotion recognition model.
  • Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers.
  • CRNN convolutional Recurrent Neural Network
  • the input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector;
  • the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.
  • Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application.
  • the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer.
  • Pooling layer long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module.
  • Each pooling module is connected to the LSTM layer
  • Each neuron is connected.
  • the convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.
  • the sentiment analysis module includes: a parameter acquisition unit that acquires a spectrogram of the input voice and voice feature parameters; a first feature vector acquisition unit that inputs the spectrogram of the input voice
  • the first feature vector is output through the convolutional recurrent neural network and the first fully connected layer
  • the second feature vector acquisition unit is based on the
  • the voice feature parameters are used to obtain HSF features (feature statistics are performed on multiple frames of speech in a segment of speech, and the average or maximum value of the feature parameters, etc.) are input into the emotion recognition model, and three of the emotion recognition models
  • the second fully connected layer outputs the second feature vector;
  • the feature fusion unit concatenates the first feature vector and the second feature vector to obtain a fusion feature vector;
  • the first probability vector acquisition unit the fusion
  • the feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model; the emotion state output unit
  • the sentiment analysis model includes a text-based sentiment classification model.
  • FIG. 4 is a schematic structural diagram of the sentiment classification model in this application.
  • the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified;
  • the text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors.
  • an open source BERT model Bidirectional Encoder Representations from Transformers
  • the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer.
  • the input layer includes 256 input nodes for input sentence vectors
  • the hidden layer includes 128 hidden nodes
  • the output layer uses a softmax function to output emotion labels and probabilities.
  • the sentiment analysis module includes: a text conversion unit, which converts the input speech into text to be classified through voice recognition; a feature extraction unit, which extracts the text feature vector of the text to be classified; The text feature vector is input to a deep neural network classifier in the emotion classification model; a second probability vector acquisition unit, which acquires the second probability vector of the emotion of the input speech through the classifier; an emotion state output unit, Acquire the emotional state of the input speech according to the second probability vector.
  • the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both.
  • the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.
  • the emotion analysis module includes: a first confidence degree obtaining unit, which obtains a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and obtains multiple voice emotions respectively according to the first probability vector The first confidence degree; the second confidence degree obtaining unit obtains the second probability vector of the emotion of the input speech through the text-based emotion classification model, and obtains the second confidence degree of multiple speech emotions according to the second probability vector; The confidence vector obtaining unit adds the first confidence and the second confidence of the same speech emotion to obtain the confidence of the same speech emotion to obtain the confidence vectors of multiple speech emotions; the selection unit selects all The voice emotion corresponding to the maximum confidence in the confidence vector is used as the emotional state of the input speech.
  • a first confidence degree obtaining unit which obtains a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and obtains multiple voice emotions respectively according to the first probability vector The first confidence degree
  • the second confidence degree obtaining unit obtains the second probability vector of the emotion of the input speech through the text-based
  • the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc.
  • the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7
  • the two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.
  • the emotion determination module includes:
  • the construction unit is to construct a scene library, the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene. Among them, the dialogue scene and the corresponding emotional state can be manually annotated, based on the specific situation combined with human cognition. Labeling, predefine the emotional state required for synthesized speech in some specific dialogue scenarios, because even for the same emotional input voice response, in different dialogue scenarios, the emotional state of the synthesized speech needs to be fed back It may be different;
  • the context analysis unit performs context analysis according to the input speech and the text to be synthesized, and obtains the dialogue scene of the text to be synthesized;
  • a query unit to obtain the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library
  • the emotional state determining unit determines the emotional state of the synthesized speech according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
  • the emotion expressed by the voice input by the customer is impatient.
  • the emotional state of the synthesized voice should be happy and positive, so as to serve customers well.
  • the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.
  • the speech synthesis module includes: a modal particle embedding unit, which embeds modal auxiliary words in the synthesized text through a waveform splicing technology; a prosody control unit, which controls the tone and prosody of the synthesized speech through an end-to-end synthesis technology; The speech synthesis unit performs speech synthesis according to the embedded modal auxiliary words, tone and prosody, so that the synthesized speech can express the corresponding tone and emotion.
  • the computer-readable storage medium may be any tangible medium that contains or stores a program or instruction.
  • the program can be executed, and the stored program instructs related hardware to realize the corresponding function.
  • the computer-readable storage medium may be a computer disk, a hard disk, a random access memory, a read-only memory, etc.
  • the application is not limited to this, and it can be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to enable the processor to execute the programs or instructions therein.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes a context-oriented and emotion-oriented Chinese speech synthesis program, and the context-oriented and emotion-oriented Chinese speech synthesis program When executed by the processor, the following Chinese speech synthesis method is realized:

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention appartient au domaine technique de l'intelligence artificielle et concerne un procédé, un dispositif et un support d'informations de synthèse de la parole chinoise orientée sur la situation et l'émotion, ledit procédé comprenant les étapes consistant : à obtenir une parole d'entrée ; à entrer ladite parole d'entrée dans un modèle d'analyse de sentiment, et à fournir en sortie l'état émotionnel de la parole d'entrée au moyen dudit modèle d'analyse de sentiment ; à déterminer l'état émotionnel de la parole synthétisée en fonction d'une situation de dialogue et de l'état émotionnel de la parole d'entrée ; à réaliser une synthèse de la parole en fonction de l'état émotionnel de la parole synthétisée et du texte à synthétiser qui a été déterminé en fonction de la parole d'entrée. Dans la présente invention, l'état émotionnel de la parole d'entrée est analysé, et l'état émotionnel de la parole synthétisée est obtenu en fonction de l'état émotionnel de la parole d'entrée ; pendant la synthèse de la parole, l'analyse d'état émotionnel et de situation est ajoutée pour provoquer l'intonation et l'émotion de la parole synthétisée pour correspondre à une situation d'interaction actuelle plutôt que d'être d'intonation et d'émotion fixes ; pendant le processus d'interaction personne-ordinateur, la parole synthétisée émise est plus semblable à la parole d'une personne réelle, ce qui permet d'améliorer l'expérience de l'utilisateur.
PCT/CN2020/093564 2019-06-19 2020-05-30 Procédé, dispositif et support d'informations de synthèse de la parole chinoise orientée sur la situation et l'émotion WO2020253509A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910531628.7 2019-06-19
CN201910531628.7A CN110211563B (zh) 2019-06-19 2019-06-19 面向情景及情感的中文语音合成方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2020253509A1 true WO2020253509A1 (fr) 2020-12-24

Family

ID=67793522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093564 WO2020253509A1 (fr) 2019-06-19 2020-05-30 Procédé, dispositif et support d'informations de synthèse de la parole chinoise orientée sur la situation et l'émotion

Country Status (2)

Country Link
CN (1) CN110211563B (fr)
WO (1) WO2020253509A1 (fr)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110211563B (zh) * 2019-06-19 2024-05-24 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质
CN110675853B (zh) * 2019-09-10 2022-07-05 苏宁云计算有限公司 一种基于深度学习的情感语音合成方法及装置
CN110570879A (zh) * 2019-09-11 2019-12-13 深圳壹账通智能科技有限公司 基于情绪识别的智能会话方法、装置及计算机设备
CN110853616A (zh) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 一种基于神经网络的语音合成方法、系统与存储介质
CN112233648A (zh) * 2019-12-09 2021-01-15 北京来也网络科技有限公司 结合rpa及ai的数据的处理方法、装置、设备及存储介质
CN111276119B (zh) * 2020-01-17 2023-08-22 平安科技(深圳)有限公司 语音生成方法、系统和计算机设备
CN111274807B (zh) * 2020-02-03 2022-05-10 华为技术有限公司 文本信息的处理方法及装置、计算机设备和可读存储介质
CN111445906A (zh) * 2020-02-28 2020-07-24 深圳壹账通智能科技有限公司 基于大数据的语音生成方法、装置、设备及介质
CN111312210B (zh) * 2020-03-05 2023-03-21 云知声智能科技股份有限公司 一种融合图文的语音合成方法及装置
CN111489734B (zh) * 2020-04-03 2023-08-22 支付宝(杭州)信息技术有限公司 基于多说话人的模型训练方法以及装置
CN111627420B (zh) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 极低资源下的特定发音人情感语音合成方法及装置
CN111862030B (zh) * 2020-07-15 2024-02-09 北京百度网讯科技有限公司 一种人脸合成图检测方法、装置、电子设备及存储介质
CN112148846A (zh) * 2020-08-25 2020-12-29 北京来也网络科技有限公司 结合rpa和ai的回复语音确定方法、装置、设备及存储介质
CN112423106A (zh) * 2020-11-06 2021-02-26 四川长虹电器股份有限公司 一种自动翻译伴音的方法及系统
CN112466314A (zh) * 2020-11-27 2021-03-09 平安科技(深圳)有限公司 情感语音数据转换方法、装置、计算机设备及存储介质
CN112837700A (zh) * 2021-01-11 2021-05-25 网易(杭州)网络有限公司 一种情感化的音频生成方法和装置
CN113066473A (zh) * 2021-03-31 2021-07-02 建信金融科技有限责任公司 一种语音合成方法、装置、存储介质及电子设备
CN113593521B (zh) * 2021-07-29 2022-09-20 北京三快在线科技有限公司 语音合成方法、装置、设备及可读存储介质
CN114373444B (zh) * 2022-03-23 2022-05-27 广东电网有限责任公司佛山供电局 一种基于蒙太奇的语音合成方法、系统及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
CN108305643A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN109256150A (zh) * 2018-10-12 2019-01-22 北京创景咨询有限公司 基于机器学习的语音情感识别系统及方法
CN109559760A (zh) * 2018-12-29 2019-04-02 北京京蓝宇科技有限公司 一种基于语音信息的情感分析方法及系统
CN109754779A (zh) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 可控情感语音合成方法、装置、电子设备及可读存储介质
CN110211563A (zh) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI221574B (en) * 2000-09-13 2004-10-01 Agi Inc Sentiment sensing method, perception generation method and device thereof and software
CN109817246B (zh) * 2019-02-27 2023-04-18 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5860064A (en) * 1993-05-13 1999-01-12 Apple Computer, Inc. Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
CN108305643A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN109256150A (zh) * 2018-10-12 2019-01-22 北京创景咨询有限公司 基于机器学习的语音情感识别系统及方法
CN109559760A (zh) * 2018-12-29 2019-04-02 北京京蓝宇科技有限公司 一种基于语音信息的情感分析方法及系统
CN109754779A (zh) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 可控情感语音合成方法、装置、电子设备及可读存储介质
CN110211563A (zh) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质

Also Published As

Publication number Publication date
CN110211563A (zh) 2019-09-06
CN110211563B (zh) 2024-05-24

Similar Documents

Publication Publication Date Title
WO2020253509A1 (fr) Procédé, dispositif et support d'informations de synthèse de la parole chinoise orientée sur la situation et l'émotion
US11475897B2 (en) Method and apparatus for response using voice matching user category
US20220246149A1 (en) Proactive command framework
US11361753B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
WO2022178969A1 (fr) Procédé et appareil de traitement de données vocales de conversation, dispositif informatique et support de stockage
CN107657017A (zh) 用于提供语音服务的方法和装置
US20220076674A1 (en) Cross-device voiceprint recognition
US11276403B2 (en) Natural language speech processing application selection
JP2018146715A (ja) 音声対話装置、その処理方法及びプログラム
KR20200044388A (ko) 음성을 인식하는 장치 및 방법, 음성 인식 모델을 트레이닝하는 장치 및 방법
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN114495927A (zh) 多模态交互的虚拟数字人的生成方法及装置、存储介质、终端
CN114627856A (zh) 语音识别方法、装置、存储介质及电子设备
US11600261B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
CN114242093A (zh) 语音音色转换方法、装置、计算机设备和存储介质
US11996081B2 (en) Visual responses to user inputs
CN117219046A (zh) 一种交互语音情感控制方法及系统
KR102226427B1 (ko) 호칭 결정 장치, 이를 포함하는 대화 서비스 제공 시스템, 호칭 결정을 위한 단말 장치 및 호칭 결정 방법
CN114708849A (zh) 语音处理方法、装置、计算机设备及计算机可读存储介质
CN114373443A (zh) 语音合成方法和装置、计算设备、存储介质及程序产品
CN112037772A (zh) 基于多模态的响应义务检测方法、系统及装置
US11792365B1 (en) Message data analysis for response recommendations
CN116959421B (zh) 处理音频数据的方法及装置、音频数据处理设备和介质
KR102574311B1 (ko) 음성 합성 서비스를 제공하는 장치, 단말기 및 방법
KR102426020B1 (ko) 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20825708

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20825708

Country of ref document: EP

Kind code of ref document: A1