WO2020253509A1

WO2020253509A1 - Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium

Info

Publication number: WO2020253509A1
Application number: PCT/CN2020/093564
Authority: WO
Inventors: 彭话易; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-19
Filing date: 2020-05-30
Publication date: 2020-12-24
Also published as: CN110211563A

Abstract

The present application belongs to the technical field of artificial intelligence, and discloses a situation- and emotion-oriented Chinese speech synthesis method, device, and storage medium, said method comprising: obtaining input speech; inputting said input speech into a sentiment analysis model, and outputting the emotional state of the input speech by means of said sentiment analysis model; determining the emotional state of synthesized speech according to a dialogue situation and the emotional state of the input speech; performing speech synthesis according to the emotional state of the synthesized speech and text to be synthesized which has been determined on the basis of the input speech. In the present application, the emotional state of the input speech is analyzed, and the emotional state of the synthesized speech is obtained according to the emotional state of the input speech; during speech synthesis, the emotional state and situation analysis is added to cause the intonation and emotion of the synthesized speech to match a current interaction situation rather than being of fixed intonation and emotion; during the process of human-computer interaction, the outputted synthesized speech is more similar to the speech of a real person, thus enhancing user experience.

Description

Situation and emotion-oriented Chinese speech synthesis method, device and storage medium

This application is based on the Paris Convention declaration to enjoy the priority of the Chinese patent application filed on June 19, 2019 with the application number CN201910531628.7 and the title "Scenario and Emotion-Oriented Chinese Speech Synthesis Method, Apparatus, and Storage Medium". The entire content of the patent application is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a Chinese speech synthesis method, device and storage medium oriented to situations and emotions.

Background technique

With the rapid development of computer technology, people's requirements for speech synthesis systems have become higher and higher, from the initial "can understand" to today's "hope to be like a real person." There are three main technical solutions for existing speech synthesis systems: parameter synthesis technology, waveform splicing technology, and end-to-end synthesis technology based on deep learning. Among them, the speech synthesized by waveform splicing has very high sound quality, but the production of the required speech library is very time-consuming and labor-intensive, and usually requires more than 30 hours of recording and related cutting and labeling work. The existing end-to-end speech synthesis technology can also synthesize speech with high sound quality and extremely good prosody, and the required training speech library usually only takes about 15 hours. Compared with the waveform splicing technology, its synthesis speed is slightly slower, and its implementation requires a GPU, so the cost is relatively high. Although the speech synthesized by the existing speech synthesis system is good in sound quality, there is still a gap compared with real people. The main reason for this gap is that the same speech system always synthesizes speech with the same tone and emotion, but when humans speak, the tone and emotions are constantly changing. They are related to the speaking scene and The content of the speech is closely related. When the tone and emotion of the synthesized speech do not match the current scene, even if the synthesized speech has good sound quality, we will still feel it is fake, because it does not match our cognition. For example, smart speakers are now widely available in the market, and speech synthesis systems can enable smart speakers to communicate with humans. Suppose a girl has the following conversation with smart speakers:

Girl: I am going to wear this dress today, do you think it looks good? (Excited and happy tone)

Smart speaker: I think it looks very good. (Very plain and fixed tone)

The above-mentioned dialogues now often occur in the intelligent interaction between humans and machines. The inventor realized that when humans send out a dialogue with a certain emotion, the speech synthesis system uses its fixed tone and emotion to synthesize speech for feedback. This experience will It makes humans feel that the synthesized speech does not look like a real person, making it difficult for human-computer interaction to continue well, and it will also affect the user experience of the machine.

technical problem

This application provides a context- and emotion-oriented Chinese speech synthesis method, device, and storage medium to solve the problem in the prior art that the human-computer interaction is difficult to continue due to the fixed tone and emotion synthesis of speech in the prior art.

Technical solutions

In order to achieve the above objective, one aspect of this application is to provide a context- and emotion-oriented Chinese speech synthesis method, including: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the input through the emotion analysis model The emotional state of the speech; the emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; the speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.

In order to achieve the above object, another aspect of the present application is to provide an electronic device, the electronic device includes: a processor; a memory, the memory includes a context-oriented and emotional Chinese speech synthesis program, the Chinese speech synthesis program is When the processor is executed, the steps of the Chinese speech synthesis method as described below are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; The scene and the emotional state of the input voice determine the emotional state of the synthesized voice; perform speech synthesis according to the emotional state of the synthesized voice and the text to be synthesized determined based on the input voice.

In order to achieve the above-mentioned object, another aspect of the present application provides a computer device, wherein the computer device includes: a memory, a processor, and a context-oriented and emotionally oriented device stored in the memory and capable of running on the processor. A Chinese speech synthesis program, when the Chinese speech synthesis program is executed by a processor, the following steps are realized: obtaining input speech; inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; The emotional state of the synthesized speech is determined according to the dialogue scene and the emotional state of the input speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.

In order to achieve the above objective, another aspect of the present application is to provide a computer-readable storage medium, the computer-readable storage medium includes a context-oriented and emotional Chinese speech synthesis program, the Chinese speech synthesis program is executed by a processor At the time, the steps of the Chinese speech synthesis method as described below are realized: acquiring the input speech; inputting the input speech into the emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model; according to the dialogue scene and the input The emotional state of the speech determines the emotional state of the synthesized speech; speech synthesis is performed according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.

Beneficial effect

Compared with the prior art, this application has the following advantages and beneficial effects:

This application analyzes the emotional state of the input speech and obtains the emotional state of the synthesized speech according to the emotional state of the input speech. When performing speech synthesis, the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene. Instead of a fixed tone and emotion, in the process of human-computer interaction, the output synthesized voice is more like a real person, enhancing the user experience.

Description of the drawings

Fig. 1 is a schematic flow chart of the context-oriented and emotion-oriented Chinese speech synthesis method described in this application;

Figure 2 is a schematic diagram of the structure of the emotion recognition model in this application;

Figure 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application;

Figure 4 is a schematic structural diagram of the sentiment classification model in this application;

FIG. 5 is a schematic diagram of modules of the Chinese speech synthesis program oriented to the situation and emotion in this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Embodiments of the invention

The embodiments described in this application will be described below with reference to the drawings. A person of ordinary skill in the art may realize that the described embodiments can be modified in various different ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and description are illustrative in nature, and are only used to explain the application, rather than to limit the protection scope of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.

Figure 1 is a schematic flow chart of the context- and emotion-oriented Chinese speech synthesis method described in this application. As shown in Figure 1, the context- and emotion-oriented Chinese speech synthesis method described in this application includes the following steps:

Step S1: Obtain the input voice, the input voice is the voice to be fed back. For example, in the human-computer interaction system, for the smart speaker, the input voice is the user's inquiry, etc., and the synthesized voice is the feedback of the smart speaker to the user. This application is According to the emotional state of the user's inquiry voice, the emotional state of the synthesized voice is obtained, so that the feedback of the smart speaker has a specific tone and emotion, which is consistent with the emotion of the user's input voice;

Step S2: Input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model;

Step S3: Determine the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech. When determining the emotional state of the synthesized speech, increase the influencing factors of the dialogue scene, so that the feedback obtained by human-computer interaction is not only satisfied with the user's emotion The feedback is more in line with actual application scenarios and avoid errors. For example, for a sales scene, even if the emotion expressed by the voice input by the customer is impatient, after adding the influence factor of the dialogue scene, the emotional state of the synthesized speech is also Should be happy and active, serve customers well;

Step S4: Perform speech synthesis according to the emotional state of the synthesized speech and the to-be-synthesized text determined based on the input speech, where the to-be-synthesized text is the text to be fed back determined by the intelligent system according to the input speech during human-computer interaction.

This application obtains the emotional state of the synthesized speech by analyzing the emotional state of the input speech. When performing speech synthesis, the emotional state and context analysis are added to make the tone and emotion of the synthesized speech conform to the current interaction scene. In the process of human-computer interaction , The output synthesized speech is more like a real person.

Preferably, before the step of inputting the input speech into the emotion analysis model, the Chinese speech synthesis method further includes: judging whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, setting the synthesis The emotional state of the voice or the default emotional state of the synthesized voice is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform emotional state analysis on the input voice . Among them, the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction. For example, for an intelligent question answering system, the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ××°C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.

In an embodiment of the present application, the emotion analysis model includes a voice-based emotion recognition model. Figure 2 is a schematic structural diagram of the emotion recognition model in this application. As shown in Figure 2, the emotion recognition model is divided into three parts. , The first part includes the convolutional recurrent neural network (Convolutional Recurrent Neural Network (CRNN) part and a first fully connected layer (fully connected layers, FC layer), the input is the spectrogram, and a first feature vector is output through the CRNN layer and the first FC layer; the second part includes Three successively connected second FC layers. The input of the second part is speech feature parameters (LLDs), including: fundamental frequency, energy, zero-crossing rate, Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction cepstrum coefficient (Linear Predictive Cepstral Coefficient, LPCC) and other features, output a second feature vector; the third part includes a third FC layer and a normalization layer, the input of the third part is the fusion feature vector of the first feature vector and the second feature vector , And the output is a probability vector representing sentiment classification.

Fig. 3 is a schematic diagram of the structure of the convolutional recurrent neural network in this application. As shown in Fig. 3, the convolution recurrent neural network includes: a first convolutional layer, a first pooling layer, a second convolutional layer, and a second convolutional layer. Pooling layer, long short-term memory network layer (Long Short-Term Memory, LSTM) and the third pooling layer, where the third pooling layer includes three pooling modules, a minimum pooling module, an average pooling module, and a maximum pooling module. Each pooling module is connected to the LSTM layer Each neuron is connected. The convolutional recurrent neural network is trained using an emotional speech data set, and the emotional speech data set includes about 15 hours of speech data from multiple people and corresponding emotional tags.

In an embodiment of the present application, the step of outputting the emotional state of the input voice through the emotional analysis model includes: obtaining a spectrogram and voice feature parameters of the input voice; and converting the spectrogram of the input voice Input the trained convolutional recurrent neural network (CRNN) in the emotion recognition model, output a first feature vector through the convolutional recurrent neural network and the first fully connected layer; obtain statistics according to the speech feature parameters Features (High level Statistics Functions, HFS features, perform feature statistics on multiple frames of speech in a segment of speech, and obtain the average or maximum value of feature parameters, etc.), input into the emotion recognition model, and pass the three second totals in the emotion recognition model The connection layer outputs a second feature vector; concatenates the first feature vector and the second feature vector to obtain a fusion feature vector; the fusion feature vector passes through the third full connection in the emotion recognition model The layer and the normalization layer output the first probability vector of the emotion of the input speech; obtain the emotion state of the input speech according to the first probability vector.

In an embodiment of the present application, the sentiment analysis model includes a text-based sentiment classification model. FIG. 4 is a schematic structural diagram of the sentiment classification model in this application. As shown in FIG. 4, the sentiment classification model includes: a feature extraction layer And a classifier, wherein the feature extraction layer is used to extract features of the input text to vectorize the input text, and the feature extraction layer includes an input layer for inputting the text to be classified; The text to be classified is transformed into multiple word vectors, and a sentence vector is constructed based on multiple word vectors. For example, an open source BERT model (Bidirectional Encoder Representations from Transformers) can be used; the classifier is composed of an LSTM neural network, including an input layer , A hidden layer and an output layer. The input layer includes 256 input nodes for input sentence vectors, the hidden layer includes 128 hidden nodes, and the output layer uses a softmax function to output emotion labels and probabilities.

In an embodiment of the present application, the step of outputting the emotional state of the input voice through the emotion analysis model includes: converting the input voice into text to be classified through voice recognition; extracting text feature vectors of the text to be classified; Input the text feature vector into the deep neural network classifier in the emotion classification model; obtain the second probability vector of the emotion of the input voice through the classifier; obtain the input according to the second probability vector The emotional state of speech.

In this application, the sentiment analysis model may include only one of a speech-based emotion recognition model and a text-based emotion classification model, or may include both. Preferably, the sentiment analysis model includes both a voice-based sentiment analysis model and a text-based sentiment classification model. The probability vectors representing the voice emotions are obtained through the two models, and a comprehensive analysis is performed according to the results of the two models to improve The accuracy of sentiment analysis.

Preferably, the step of outputting the emotional state of the input voice through the emotion analysis model includes: obtaining a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and respectively according to the first probability vector Acquiring the first confidence levels of multiple speech emotions; acquiring the second probability vector of the emotion of the input speech through a text-based emotion classification model, and acquiring the second confidence levels of the multiple speech emotions according to the second probability vector; Add the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions; select among the confidence level vectors The speech emotion corresponding to the maximum confidence is used as the emotion state of the input speech. For example, for a piece of input speech, the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc., and the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7, The two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.

In an embodiment of the present application, for the same input speech, two results are obtained through a speech-based emotion recognition model and a text-based emotion classification model, which respectively represent the confidence of various emotional states of the input speech, and are obtained by the two models The result of setting different weight values, based on different weights, add the obtained confidence to predict the final voice emotional state. For example, set a weight value of 0.6 for the emotion recognition model and a weight value of 0.4 for the emotion classification model. For a piece of input speech, the possible speech emotion states are happy, sad, impatient, excited, excited, etc. The first confidence level corresponding to the above emotional state is 0.6, 0.2, 0.3, 0.4, 0.7, and the second confidence level is 0.8, 0.3, 0.2, 0.5, 0.5, respectively, and the corresponding first and second confidence levels are The degree is based on the different weight values set to be added to obtain the final confidence degrees of various speech emotions respectively 0.68, 0.24, 0.26, 0.44, 0.62, and the emotional state (happy) corresponding to the maximum confidence degree of 0.68 is selected as the emotional state of the input speech .

Further, when there are two or more speech emotions corresponding to the maximum confidence in the confidence vector, one of them is randomly selected as the emotional state of the input speech.

It should be understood that the "first", "second", "third", and "fourth" in this application are only used to distinguish the same or similar objects, and do not represent the meanings such as sequence or preferred order.

In an optional embodiment of the present application, determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech includes:

Construct a scene library, the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene, wherein the dialogue scene and the corresponding emotional state can be manually labeled, and labeling is performed according to the specific scenario combined with human cognition, Predefine the emotional state required for synthesized speech in certain specific dialogue scenarios, because even if it is a response to the input voice of the same emotion, the emotional state of the synthesized voice that needs to be fed back in different dialogue scenarios may be different;

Performing context analysis according to the input speech and the text to be synthesized to obtain a dialogue scene of the text to be synthesized, where the text to be synthesized is the text to be fed back determined by the intelligent system according to the input voice during human-computer interaction;

Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;

The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.

For example, for a sales promotion scenario, the emotion expressed by the voice input by the customer is impatient. By analyzing the synthesized text and combining the dialogue scenario, it is concluded that the emotional state of the synthesized voice should be happy and positive, so as to serve customers well. By increasing the analysis of the dialogue scene, the feedback obtained during human-computer interaction is more in line with the real application scene and the user experience is enhanced.

In an optional embodiment of the present application, performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized includes: embedding modal auxiliary words through the waveform splicing technology in the synthesized text; controlling the synthesized speech through the end-to-end synthesis technology Tone and prosody; according to the embedded modal auxiliary words, tone and prosody for speech synthesis, so that the synthesized speech can express the corresponding tone and emotion.

The context-oriented and emotion-oriented Chinese speech synthesis method described in this application is applied to an electronic device, and the electronic device may be a terminal device such as a TV, a smart phone, a tablet computer, and a computer.

The electronic device includes: a processor; a memory for storing a context-oriented and emotion-oriented Chinese speech synthesis program, and the processor executes the context-oriented and emotion-oriented Chinese speech synthesis program to realize the following context- and emotion-oriented Chinese speech synthesis The steps of the method: obtain the input speech; input the input speech into an emotion analysis model, and output the emotion state of the input speech through the emotion analysis model; determine the emotion state of the synthesized speech according to the dialogue scene and the emotion state of the input speech ; Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.

The electronic device also includes a network interface, a communication bus, and the like. Among them, the network interface may include a standard wired interface and a wireless interface, and the communication bus is used to realize the connection and communication between various components.

The memory includes at least one type of readable storage medium, which can be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited to this, and can be stored in a non-transitory manner Any device that provides instructions or software and any associated data files to the processor so that the processor can execute the instructions or software program. In this application, the software program stored in the memory includes a context-oriented and emotional Chinese speech synthesis program, and the Chinese speech synthesis program can be provided to the processor, so that the processor can execute the Chinese speech synthesis program to realize the Chinese speech synthesis method. step.

The processor can be a central processing unit, a microprocessor, or other data processing chips, etc., and can run stored programs in the memory, for example, the context-oriented and emotional Chinese speech synthesis program in this application.

The electronic device may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, etc. The display is used to display the information processed in the electronic device and to display the visual work interface.

The electronic device may further include a user interface, and the user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.

In other embodiments, the context and emotion-oriented Chinese speech synthesis program can also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Figure 5 is a schematic diagram of modules of the context- and emotion-oriented Chinese speech synthesis program in this application. As shown in Fig. 5, the Chinese speech synthesis program can be divided into: an acquisition module 1, an emotion analysis module 2, an emotion determination module 3, and Speech synthesis module 4. The functions or operation steps implemented by the above modules are similar to the above, for example:

Obtaining module 1, to obtain the input voice;

Emotion analysis module 2, inputting the input speech into an emotion analysis model, and outputting the emotional state of the input speech through the emotion analysis model;

The emotion determination module 3 determines the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;

The speech synthesis module 4 performs speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.

Preferably, the electronic device further includes a judgment module, before the step of inputting the input voice into the emotion analysis model, judge whether there is an interaction scene according to the input speech and the text to be synthesized, and if there is no interaction scene, set the The emotional state of the synthesized speech or the default emotional state of the synthesized speech is used, and no emotional analysis is performed on the input voice; if there is an interactive scene, the next step is to input the input voice into the emotional analysis model, and perform the emotional state on the input voice analysis. Among them, the set emotional state of the synthesized speech can be impassioned or gentle and gentle, and can be specifically set according to the role or purpose of the human-computer interaction. For example, for an intelligent question answering system, the default emotional state of the feedback synthesized speech is gentle and gentle. If the input speech only involves consultation on a certain question, and does not involve an interactive scene, the content of the text to be synthesized is determined based on the input speech. Yes, outputting the text to be synthesized with a gentle and gentle emotion can meet user needs. For example, if a user asks "What's the temperature in Beijing today", the Q&A system only needs to reply "Today's temperature in Beijing is ××°C" in the default tone and emotion, instead of performing sentiment analysis on the input voice.

In an embodiment of the present application, the sentiment analysis module includes: a parameter acquisition unit that acquires a spectrogram of the input voice and voice feature parameters; a first feature vector acquisition unit that inputs the spectrogram of the input voice In the trained convolutional recurrent neural network (CRNN) in the emotion recognition model, the first feature vector is output through the convolutional recurrent neural network and the first fully connected layer; the second feature vector acquisition unit is based on the The voice feature parameters are used to obtain HSF features (feature statistics are performed on multiple frames of speech in a segment of speech, and the average or maximum value of the feature parameters, etc.) are input into the emotion recognition model, and three of the emotion recognition models The second fully connected layer outputs the second feature vector; the feature fusion unit concatenates the first feature vector and the second feature vector to obtain a fusion feature vector; the first probability vector acquisition unit, the fusion The feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model; the emotion state output unit obtains the input speech according to the first probability vector Emotional state.

In an embodiment of the present application, the sentiment analysis module includes: a text conversion unit, which converts the input speech into text to be classified through voice recognition; a feature extraction unit, which extracts the text feature vector of the text to be classified; The text feature vector is input to a deep neural network classifier in the emotion classification model; a second probability vector acquisition unit, which acquires the second probability vector of the emotion of the input speech through the classifier; an emotion state output unit, Acquire the emotional state of the input speech according to the second probability vector.

Preferably, the emotion analysis module includes: a first confidence degree obtaining unit, which obtains a first probability vector of the emotion of the input voice through a voice-based emotion recognition model, and obtains multiple voice emotions respectively according to the first probability vector The first confidence degree; the second confidence degree obtaining unit obtains the second probability vector of the emotion of the input speech through the text-based emotion classification model, and obtains the second confidence degree of multiple speech emotions according to the second probability vector; The confidence vector obtaining unit adds the first confidence and the second confidence of the same speech emotion to obtain the confidence of the same speech emotion to obtain the confidence vectors of multiple speech emotions; the selection unit selects all The voice emotion corresponding to the maximum confidence in the confidence vector is used as the emotional state of the input speech. For example, for a piece of input speech, the possible emotional states of the voice are happy, sad, impatient, excited, excited, etc., and the first confidence levels corresponding to the above emotional states are 0.6, 0.2, 0.3, 0.4, 0.7, The two confidence levels are 0.8, 0.3, 0.2, 0.5, 0.5, and the corresponding first and second confidence levels are respectively added to obtain the final confidence levels of various speech emotions are 1.4, 0.5, 0.5, 0.9, 1.2, respectively , Select the emotional state (happy) corresponding to the maximum confidence of 1.4 as the emotional state of the input speech.

In an optional embodiment of the present application, the emotion determination module includes:

The construction unit is to construct a scene library, the scene library includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene. Among them, the dialogue scene and the corresponding emotional state can be manually annotated, based on the specific situation combined with human cognition. Labeling, predefine the emotional state required for synthesized speech in some specific dialogue scenarios, because even for the same emotional input voice response, in different dialogue scenarios, the emotional state of the synthesized speech needs to be fed back It may be different;

The context analysis unit performs context analysis according to the input speech and the text to be synthesized, and obtains the dialogue scene of the text to be synthesized;

A query unit to obtain the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;

The emotional state determining unit determines the emotional state of the synthesized speech according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.

In an optional embodiment of the present application, the speech synthesis module includes: a modal particle embedding unit, which embeds modal auxiliary words in the synthesized text through a waveform splicing technology; a prosody control unit, which controls the tone and prosody of the synthesized speech through an end-to-end synthesis technology; The speech synthesis unit performs speech synthesis according to the embedded modal auxiliary words, tone and prosody, so that the synthesized speech can express the corresponding tone and emotion.

In an embodiment of the present application, the computer-readable storage medium may be any tangible medium that contains or stores a program or instruction. The program can be executed, and the stored program instructs related hardware to realize the corresponding function. For example, the computer-readable storage medium may be a computer disk, a hard disk, a random access memory, a read-only memory, etc. The application is not limited to this, and it can be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to enable the processor to execute the programs or instructions therein. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium includes a context-oriented and emotion-oriented Chinese speech synthesis program, and the context-oriented and emotion-oriented Chinese speech synthesis program When executed by the processor, the following Chinese speech synthesis method is realized:

Get input voice;

Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;

Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;

Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned context- and emotion-oriented Chinese speech synthesis method and electronic device, and will not be repeated here.

It should be noted that in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A Chinese speech synthesis method oriented to situations and emotions, applied to electronic devices, including:

Get input voice;

Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;

Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;

Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
The context- and emotion-oriented Chinese speech synthesis method according to claim 1, wherein the emotion analysis model comprises a speech-based emotion recognition model, and the step of outputting the emotion state of the input speech through the emotion analysis model comprises:

Acquiring a spectrogram and voice feature parameters of the input voice;

Input the spectrogram of the input speech into the trained convolutional recurrent neural network in the emotion recognition model, and output a first feature vector through the convolutional recurrent neural network and the first fully connected layer;

Acquiring statistical features according to the speech feature parameters, inputting them into the emotion recognition model, and outputting second feature vectors through the three second fully connected layers in the emotion recognition model;

Fusing the first feature vector and the second feature vector to obtain a fusion feature vector;

The fusion feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model;

Acquire the emotional state of the input speech according to the first probability vector.
The context- and emotion-oriented Chinese speech synthesis method of claim 1, wherein the sentiment analysis model comprises a text-based sentiment classification model, and the step of outputting the sentiment state of the input speech through the sentiment analysis model comprises:

Converting the input speech into text to be classified through speech recognition;

Extracting the text feature vector of the text to be classified;

Inputting the text feature vector into the deep neural network classifier in the sentiment classification model;

Obtaining the second probability vector of the emotion of the input speech by the classifier;

Acquire the emotional state of the input speech according to the second probability vector.
The context- and emotion-oriented Chinese speech synthesis method according to claim 1, wherein the emotion analysis model includes a speech-based emotion recognition model and a text-based emotion classification model, and the input speech is output through the emotion analysis model The steps of the emotional state include:

Obtaining the first probability vector of the emotion of the input speech through the emotion recognition model, and obtaining the first confidence degrees of multiple speech emotions according to the first probability vector;

Obtaining the second probability vector of the emotion of the input speech through the emotion classification model, and obtaining the second confidence degrees of multiple speech emotions according to the second probability vector;

Adding the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions;

The voice emotion corresponding to the maximum confidence in the confidence vector is selected as the emotional state of the input voice.
The context- and emotion-oriented Chinese speech synthesis method of claim 1, wherein the step of determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech comprises:

Constructing a scene library, which includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene;

Perform context analysis according to the input voice and the text to be synthesized, and obtain the dialogue scene of the text to be synthesized;

Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;

The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
The context- and emotion-oriented Chinese speech synthesis method according to any one of claims 1 to 5, wherein the step of performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized comprises:

To embed modal auxiliary words in the synthesized text through waveform splicing technology

Control the tone and rhythm of synthesized speech through end-to-end synthesis technology;

According to the embedded modal auxiliary words, tone and prosody, speech synthesis is performed.
The context- and emotion-oriented Chinese speech synthesis method according to claim 1, wherein before the step of inputting the input speech into an emotion analysis model, the method further comprises:

Determine whether there is an interaction scene based on the input speech and the text to be synthesized. If there is no interaction scene, set the emotional state of the synthesized speech and no longer perform sentiment analysis on the input speech; if there is an interaction scene, the input speech Enter the sentiment analysis model.
An electronic device, wherein the electronic device includes:

processor;

A memory, the memory includes a context-oriented and emotion-oriented Chinese speech synthesis program, and when the Chinese speech synthesis program is executed by the processor, the following steps of the Chinese speech synthesis method are implemented:

Get input voice;

Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;

Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;

Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
A computer device, wherein the computer device includes: a memory, a processor, and a context- and emotion-oriented Chinese speech synthesis program stored in the memory and running on the processor, and the Chinese speech synthesis program is When the processor executes, the following steps are implemented:

Get input voice;

Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;

Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;

Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
9. The computer device of claim 9, wherein the emotion analysis model comprises a voice-based emotion recognition model, and the step of outputting the emotion state of the input voice through the emotion analysis model comprises:

Acquiring a spectrogram and voice feature parameters of the input voice;

Input the spectrogram of the input speech into the trained convolutional recurrent neural network in the emotion recognition model, and output a first feature vector through the convolutional recurrent neural network and the first fully connected layer;

Acquiring statistical features according to the speech feature parameters, inputting them into the emotion recognition model, and outputting second feature vectors through the three second fully connected layers in the emotion recognition model;

Fusing the first feature vector and the second feature vector to obtain a fusion feature vector;

The fusion feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model;

Acquire the emotional state of the input speech according to the first probability vector.
9. The computer device according to claim 9, wherein the sentiment analysis model comprises a text-based sentiment classification model, and the step of outputting the sentiment state of the input speech through the sentiment analysis model comprises:

Converting the input speech into text to be classified through speech recognition;

Extracting the text feature vector of the text to be classified;

Inputting the text feature vector into the deep neural network classifier in the sentiment classification model;

Obtaining the second probability vector of the emotion of the input speech by the classifier;

Acquire the emotional state of the input speech according to the second probability vector.
9. The computer device according to claim 9, wherein the emotion analysis model includes a speech-based emotion recognition model and a text-based emotion classification model, and the step of outputting the emotion state of the input speech through the emotion analysis model comprises:

Obtaining the first probability vector of the emotion of the input speech through the emotion recognition model, and obtaining the first confidence degrees of multiple speech emotions according to the first probability vector;

Obtaining the second probability vector of the emotion of the input speech through the emotion classification model, and obtaining the second confidence degrees of multiple speech emotions according to the second probability vector;

Adding the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions;

The voice emotion corresponding to the maximum confidence in the confidence vector is selected as the emotional state of the input voice.
9. The computer device of claim 9, wherein the step of determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech comprises:

Constructing a scene library, which includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene;

Perform context analysis according to the input voice and the text to be synthesized, and obtain the dialogue scene of the text to be synthesized;

Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;

The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.
The computer device according to any one of claims 9 to 13, wherein the step of performing speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized comprises:

To embed modal auxiliary words in the synthesized text through waveform splicing technology

Control the tone and rhythm of synthesized speech through end-to-end synthesis technology;

According to the embedded modal auxiliary words, tone and prosody, speech synthesis is performed.
9. The computer device of claim 9, wherein before the step of inputting the input voice into the sentiment analysis model, it further comprises:

Determine whether there is an interaction scene based on the input speech and the text to be synthesized. If there is no interaction scene, set the emotional state of the synthesized speech and no longer perform sentiment analysis on the input speech; if there is an interaction scene, the input speech Enter the sentiment analysis model.
A computer-readable storage medium, wherein the computer-readable storage medium includes a context-oriented and emotional Chinese speech synthesis program, and when the Chinese speech synthesis program is executed by a processor, the following steps are implemented:

Get input voice;

Inputting the input speech into an emotion analysis model, and outputting the emotion state of the input speech through the emotion analysis model;

Determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech;

Perform speech synthesis according to the emotional state of the synthesized speech and the text to be synthesized determined based on the input speech.
16. The computer-readable storage medium of claim 16, wherein the emotion analysis model comprises a speech-based emotion recognition model, and the step of outputting the emotion state of the input speech through the emotion analysis model comprises:

Acquiring a spectrogram and voice feature parameters of the input voice;

Input the spectrogram of the input speech into the trained convolutional recurrent neural network in the emotion recognition model, and output a first feature vector through the convolutional recurrent neural network and the first fully connected layer;

Acquiring statistical features according to the speech feature parameters, inputting them into the emotion recognition model, and outputting second feature vectors through the three second fully connected layers in the emotion recognition model;

Fusing the first feature vector and the second feature vector to obtain a fusion feature vector;

The fusion feature vector outputs the first probability vector of the emotion of the input speech through the third fully connected layer and the normalization layer in the emotion recognition model;

Acquire the emotional state of the input speech according to the first probability vector.
16. The computer-readable storage medium of claim 16, wherein the sentiment analysis model comprises a text-based sentiment classification model, and the step of outputting the sentiment state of the input speech through the sentiment analysis model comprises:

Converting the input speech into text to be classified through speech recognition;

Extracting the text feature vector of the text to be classified;

Inputting the text feature vector into the deep neural network classifier in the sentiment classification model;

Obtaining the second probability vector of the emotion of the input speech by the classifier;

Acquire the emotional state of the input speech according to the second probability vector.
The computer-readable storage medium according to claim 16, wherein the emotion analysis model includes a speech-based emotion recognition model and a text-based emotion classification model, and the emotional state of the input speech is output through the emotion analysis model The steps include:

Obtaining the first probability vector of the emotion of the input speech through the emotion recognition model, and obtaining the first confidence degrees of multiple speech emotions according to the first probability vector;

Obtaining the second probability vector of the emotion of the input speech through the emotion classification model, and obtaining the second confidence degrees of multiple speech emotions according to the second probability vector;

Adding the first confidence level and the second confidence level of the same speech emotion to obtain the confidence level of the same speech emotion to obtain the confidence level vectors of multiple speech emotions;

The voice emotion corresponding to the maximum confidence in the confidence vector is selected as the emotional state of the input voice.
16. The computer-readable storage medium of claim 16, wherein the step of determining the emotional state of the synthesized speech according to the dialogue scene and the emotional state of the input speech comprises:

Constructing a scene library, which includes a variety of dialogue scenes and the emotional state corresponding to each dialogue scene;

Perform context analysis according to the input voice and the text to be synthesized, and obtain the dialogue scene of the text to be synthesized;

Acquiring the emotional state corresponding to the dialogue scene of the text to be synthesized according to the scene library;

The emotional state of the synthesized speech is determined according to the emotional state corresponding to the dialogue scene and the emotional state of the input speech.