CN112185389A

CN112185389A - Voice generation method and device, storage medium and electronic equipment

Info

Publication number: CN112185389A
Application number: CN202011003603.9A
Authority: CN
Inventors: 魏晨
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-05
Anticipated expiration: 2040-09-22
Also published as: CN112185389B

Abstract

The present disclosure relates to a voice generating method, apparatus, storage medium, and electronic device, the method comprising: determining a voice emotion label corresponding to the input voice according to the voice frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model; extracting cognitive information from the semantic text; determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion label and the cognitive information; and generating a reply voice aiming at the input voice according to the tone determined by the reply emotion label and the reply text. The voice emotion and the semantic text of the input voice can be acquired, and the corresponding reply voice is generated by the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

Description

Voice generation method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method and an apparatus for generating speech, a storage medium, and an electronic device.

Background

After the intelligent voice assistant Siri of apple company pioneers the intelligent voice assistant, the voice interaction systems or intelligent voice chat systems of various science and technology companies are also developed vigorously like a spring bamboo shoot in the rainy season. The intelligent voice interaction system is arranged in electronic equipment such as a mobile terminal or an intelligent household appliance. In the related technology of intelligent voice interaction, when receiving an input voice of a user, generally, the input voice itself needs to be analyzed by the intelligent voice interaction system, and a reply voice corresponding to the input voice is generated according to the semantic of the input voice, so as to perform voice communication with the user through the reply voice or assist the user in controlling a mobile terminal or an intelligent household appliance.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a voice generation method, apparatus, storage medium, and electronic device.

According to a first aspect of embodiments of the present disclosure, there is provided a speech generation method, the method including:

receiving input voice;

determining a voice emotion label corresponding to the input voice according to the voice frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model;

extracting cognitive information from the semantic text, wherein the cognitive information comprises: at least one of user portrait information, event flow information, and event decision information;

determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion label and the cognitive information; the emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response text and the cognitive information;

generating reply voice aiming at the input voice according to the reply emotion tag and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the tone characteristic of the reply voice is the tone characteristic determined according to the reply emotion tag;

and outputting the reply voice.

Optionally, the emotion classification model includes: the method comprises a voice decoder, a text decoder, an audio processing model, a voice recognition model and a classification prediction model, wherein the classification prediction model comprises a connecting layer and a Softmax layer, and the emotion classification model after preset training determines a voice emotion label corresponding to input voice according to the sound spectrum characteristics of the input voice and the semantic text corresponding to the input voice, and comprises the following steps:

obtaining a sound spectrum characteristic corresponding to the input voice through the audio processing model, and inputting the sound spectrum characteristic into the voice decoder to obtain a corresponding first characteristic vector;

recognizing a semantic text in the input voice through the voice recognition model, and inputting the semantic text into the text decoder to obtain a corresponding second feature vector;

stitching the first eigenvector and the second eigenvector into a third eigenvector through the connection layer;

and inputting the third feature vector into the Softmax layer, and acquiring an emotion label corresponding to the third feature vector as the voice emotion label.

Optionally, before the emotion classification model after the preset training is used to determine the speech emotion tag corresponding to the input speech according to the sound spectrum feature of the input speech and the semantic text corresponding to the input speech, the method further includes:

training a preset classification prediction model through preset speech emotion training data to obtain a trained classification prediction model;

constructing the emotion classification model through the voice decoder, the text decoder, the audio processing model, the voice recognition model and the trained classification prediction model; wherein the content of the first and second substances,

the output of the speech recognition model is the input of the speech decoder, the output of the speech recognition model is the input of the text decoder, and the input of the classification prediction model comprises: an output of the speech decoder and an output of the text decoder.

Optionally, the extracting cognitive information from the semantic text includes:

extracting a first text element for describing personal information and/or interest information from the semantic text to take a text feature corresponding to the first text element as the user portrait information;

extracting a second text element for describing an event processing flow and/or an object development rule from a semantic text, and taking a text feature corresponding to the second text element as the event flow information; and/or the presence of a gas in the gas,

identifying a third text element in the semantic text for describing an event decision condition;

and determining the event probability of each event result caused by the event decision condition according to the third text element through a preset Bayesian network model, so as to use each event result and the event probability corresponding to each event result as the event decision information.

Optionally, the determining, according to a preset emotion association model, a preset text association model, the voice emotion tag and the cognitive information, a response emotion tag corresponding to the voice emotion tag and a response text corresponding to the semantic text includes:

inputting the voice emotion label and the cognitive information into the emotion correlation model and the text correlation model respectively, and acquiring a first probability set output by the emotion correlation model and a second probability set output by the text correlation model; wherein the first probability set comprises a plurality of emotion labels and a first probability corresponding to each emotion label, and the first probability set comprises a plurality of texts and a first probability corresponding to each text;

selecting as the responsive emotion label an emotion label of the plurality of emotion labels that corresponds to a highest first probability; and the number of the first and second groups,

and taking the text corresponding to the highest second probability in the plurality of texts as the reply text.

Optionally, the generating a reply voice for the input voice according to the reply emotion tag and the reply text includes:

inputting a preset tone correlation model according to the response emotion label, and acquiring tone features corresponding to the response emotion label output by the tone correlation model;

and synthesizing the intonation characteristics and the reply text into the reply voice through a preset TTS algorithm from text to voice.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech generating apparatus, the apparatus comprising:

a voice receiving module configured to receive an input voice;

the label determining module is configured to determine a voice emotion label corresponding to input voice according to the sound spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model;

an information extraction module configured to extract cognitive information from the semantic text, the cognitive information comprising: at least one of user portrait information, event flow information, and event decision information;

the information determining module is configured to determine a response emotion tag corresponding to the voice emotion tag and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion tag and the cognitive information; the emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response text and the cognitive information;

a voice synthesis module configured to generate a reply voice for the input voice according to the reply emotion tag and the reply text, where a semantic text corresponding to the reply voice is the reply text, and a tone feature of the reply voice is a tone feature determined according to the reply emotion tag;

a voice output module configured to output the reply voice.

Optionally, the emotion classification model includes: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model including a connection layer and a Softmax layer, the tag determination module configured to:

Optionally, the apparatus further comprises:

the model training module is configured to train a preset classification prediction model through preset speech emotion training data to obtain a trained classification prediction model;

a model construction module configured to construct the emotion classification model through the speech decoder, the text decoder, the audio processing model, the speech recognition model, and the trained classification prediction model; wherein the content of the first and second substances,

Optionally, the information extraction module is configured to:

Optionally, the information determining module is configured to:

Optionally, the speech synthesis module is configured to:

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the speech generation method provided by the first aspect of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, an electronic device is provided, in which a voice interaction system is disposed; the electronic device includes: a second aspect of the present disclosure provides a speech generating apparatus.

According to the technical scheme provided by the embodiment of the disclosure, the emotion classification model after training is preset, and the voice emotion label corresponding to the input voice is determined according to the sound frequency spectrum characteristics of the input voice and the semantic text corresponding to the input voice; extracting cognitive information from the semantic text, wherein the cognitive information comprises: at least one of user portrait information, event flow information, and event decision information; determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion label and the cognitive information, wherein the emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information; and generating a reply voice aiming at the input voice according to the voice emotion corresponding to the reply emotion tag and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the tone of the reply voice is determined according to the reply emotion tag. The voice emotion and the semantic text of the input voice can be acquired, and the corresponding reply voice is generated by the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of speech generation according to an exemplary embodiment;

FIG. 2 is a flow chart of a method of determining a speech emotion tag according to the method shown in FIG. 1;

FIG. 3 is a flow diagram of another method of speech generation according to that shown in FIG. 2;

FIG. 4 is a flow diagram of yet another speech generation method according to that shown in FIG. 1;

FIG. 5 is a flow chart of a method of determining a responsive emotion tag and a responsive text according to the method shown in FIG. 1;

FIG. 6 is a flow diagram of a method of speech synthesis according to that shown in FIG. 1;

FIG. 7 is a block diagram illustrating a speech generating apparatus according to an example embodiment;

FIG. 8 is a block diagram of another speech generating apparatus according to that shown in FIG. 7;

FIG. 9 is a block diagram illustrating an apparatus for speech generation according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Before introducing the voice interaction method provided by the present disclosure, a target application scenario related to each embodiment in the present disclosure is first introduced, where the target application scenario includes an electronic device, and the electronic device is provided with an audio output and output device, and the electronic device may be, for example, an electronic device such as a Personal computer, a notebook computer, a smart phone, a tablet computer, a smart television, a smart watch, and a PDA (Personal Digital Assistant). An intelligent voice interaction system based on a brain-like cognitive model is arranged in the electronic equipment.

Illustratively, such a brain-recognition model generally includes a sensing unit, a memory unit, a learning unit, and a decision unit. The sensing unit is used for sensing voice audio information, image information and even odor information which are actively input by a user or actively monitored by electronic equipment, and extracting and analyzing the information so as to simulate human vision, hearing, smell, touch and the like. In the embodiment of the present disclosure, the sensing unit includes an emotion analysis model, and the emotion classification model can determine the semantics of the input speech itself and emotion information contained in the speech according to the audio characteristics of the input speech. The memory unit is used for extracting and memorizing user personal information and interest information and the like of different latitudes for representing the personal characteristics of the user from the acquired multiple information. The learning unit is used for extracting event flow information for representing the whole flow of the user participating in a certain event (such as train ticket purchase, online taxi appointment travel and the like) from the acquired various information. The decision unit is mainly realized through the construction of a Bayesian network, and is used for extracting different entities for event decision from various information acquired by the sensing unit and constructing the corresponding Bayesian network according to the causal relationship among the entities. The probability of occurrence caused by different entities is stored in a fixed conditional probability table corresponding to the Bayesian network, and when a decision of various entity conditions is needed, the result caused by the multiple entities is determined according to the trigger probabilities corresponding to the multiple entity conditions in the conditional probability table.

Fig. 1 is a flowchart illustrating a speech generation method according to an exemplary embodiment, and the method is applied to the electronic device described in the application scenario, as shown in fig. 1, and includes the following steps:

step 101, receiving an input voice.

And step 102, determining a voice emotion label corresponding to the input voice according to the voice spectrum characteristics of the input voice and the semantic text corresponding to the input voice through a preset trained emotion classification model.

For example, a piece of speech contains semantic text and intonation (or tone) in the speech, which are the basis for determining the actual meaning of a piece of speech, and different semantic texts may exhibit completely opposite actual meanings in different tones, and the tone depends on the emotion that the user wants to express when speaking the piece of speech. Based on this, in the embodiment of the present disclosure, it is necessary to determine the reply voice for the input voice through two kinds of information contained in the input voice. The two kinds of information include semantic text of the input voice and emotion corresponding to the input voice, which may be, for example, pain, excitement, or joy. In step 102, after receiving an input voice at one end through a voice collecting unit in the sensing unit, an emotion label corresponding to the input voice needs to be determined through a preset trained emotion classification model. The emotion classification model comprises two parts, wherein one part is used for extracting the feature vector of the audio features of the input voice, and the other part is used for extracting the feature vector of the text features of the semantic text of the input voice. And then, taking the feature vectors of the speech features and the text features as the input of the trained classification prediction model, so as to obtain a speech emotion label corresponding to the input speech, wherein the speech emotion label is used for representing the emotion contained in the input speech. The voice emotion tag can be recorded and transmitted in a number form in the actual implementation process.

Step 103, extracting cognitive information from the semantic text.

For example, in addition to acquiring a speech emotion tag corresponding to the input speech, analysis needs to be performed on content included in a semantic text of the input speech. The text features of the semantic text of the input speech extracted by the emotion classification model in step 102 may be used in step 103, and cognitive information capable of expressing semantics may be obtained from the text features. The cognitive information includes: at least one of user profile information, event flow information, and event decision information.

For example, in an actual execution process, a situation that the semantic text does not contain any cognitive information may occur, and therefore, in addition to extracting the cognitive information from the received speech text of the input speech, the cognitive information may be determined by information acquired by other information acquisition units in the sensing unit. Specifically, the other information acquisition units described above, for example, an image acquisition unit, a date-and-time information acquisition unit, and a historical behavior acquisition unit, etc., may be activated while the input voice is received. Then, the image information, date and time information, and/or historical behavior information acquired by the information acquisition units are converted into feature vectors that can be recognized, and the cognitive information is determined based on the feature vectors.

And 104, determining a response emotion tag corresponding to the voice emotion tag and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion tag and the cognitive information.

The emotion association model is used for representing association relations among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing association relations among the voice emotion label, the response text and the cognitive information.

For example, the emotion association model (or the text association model) may be a classification prediction model trained in advance, the speech emotion tag and the cognitive information are used as input of the classification prediction model, so that a plurality of emotion tags (or a plurality of texts) output by the classification prediction model and a prediction probability corresponding to each emotion tag (or a prediction probability corresponding to each text) can be obtained, and then the response emotion tag (or the reply text) is determined from the plurality of emotion tags (or the plurality of texts) according to the prediction probabilities. Or, the emotion association model (or the text association model) may also be an association relation comparison table, and the comparison table includes association relations among the voice emotion tags, the response emotion tags (or the response texts), and the cognitive information. After determining the speech emotion tag and cognitive information, the look-up table may be queried directly to determine the responding emotion tag (or responding text).

And 105, generating a reply voice aiming at the input voice according to the reply emotion tag and the reply text.

Step 106, outputting the reply voice.

The semantic text corresponding to the reply voice is the reply text, and the intonation feature of the reply voice is the intonation feature determined according to the response emotion label.

Illustratively, the intonation feature and the text feature are both variables in a TTS (texttostech, from text to speech) algorithm. In step 105, it is necessary to determine the intonation feature corresponding to the reply emotion tag, and then input the intonation feature corresponding to the reply emotion tag and the text feature corresponding to the reply text as variables into the TTS algorithm, so as to obtain the synthesized reply voice. After obtaining the reply voice, the reply voice may be output through the sound output device of the electronic device according to the application scenario in step 106 to interact with the user.

In summary, according to the technical scheme provided by the embodiment of the disclosure, a trained emotion classification model can be preset, and a speech emotion tag corresponding to an input speech is determined according to a sound spectrum feature of the input speech and a semantic text corresponding to the input speech; extracting cognitive information from the semantic text, wherein the cognitive information comprises: at least one of user portrait information, event flow information, and event decision information; determining a response emotion label corresponding to the voice emotion label and a response text corresponding to the semantic text according to a preset emotion association model, a preset text association model, the voice emotion label and the cognitive information, wherein the emotion association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information, and the text association model is used for representing the association relationship among the voice emotion label, the response emotion label and the cognitive information; and generating a reply voice aiming at the input voice according to the voice emotion corresponding to the reply emotion tag and the reply text, wherein the semantic text corresponding to the reply voice is the reply text, and the tone of the reply voice is determined according to the reply emotion tag. The voice emotion and the semantic text of the input voice can be acquired, and the corresponding reply voice is generated by the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, so that the intelligent degree of intelligent voice interaction is improved.

Fig. 2 is a flow chart of a method of determining a speech emotion tag according to fig. 1, the emotion classification model comprising, as shown in fig. 2: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model being a Softmax logistic regression model including a connection layer and a Softmax layer, the step 102 may include:

in step 1021, the audio processing model obtains the audio spectrum feature corresponding to the input speech, and the audio spectrum feature is input to the speech decoder to obtain the corresponding first feature vector.

Illustratively, the speech processing model is used for preprocessing speech, and the preprocessing process may include: pre-emphasis processing, framing processing, windowing processing, and FFT (Fast Fourier Transform) processing. Wherein the pre-emphasis process is used to emphasize the high frequency portions of the input speech and remove the effects of lip radiation to increase the high frequency resolution of the speech. In addition, because human voice has short-time stationarity, the voice signal can be considered to be stable within the range of 10-30ms, and in the framing processing process, the voice signal can be framed by taking not less than 20ms as a frame and taking the duration of about 1/2 as a frame shift. The frame shift is an overlapping area between two adjacent frames, and is used for avoiding overlarge change of the two adjacent frames. Discontinuous places appear at the beginning and the end of each frame after framing, so the more frames are, the larger the error with the original signal is. The windowing process can reduce the error, so that the signals after the framing process become continuous, and each frame of voice signal can show the characteristics of a periodic function. The FFT process is used to transform the time domain audio signal to the frequency domain audio spectrum signal. The final output of the speech processing model is the sound spectrum feature corresponding to the input speech. In addition, the speech decoder may include a convolutional neural network that also includes a convolutional layer and a pooling layer.

In this step 1022, the semantic text in the input speech is recognized by the speech recognition model, and the semantic text is input into the text decoder to obtain the corresponding second feature vector.

Illustratively, the Speech Recognition model is an end-to-end ASR (Automatic Speech Recognition) model in which an input Speech is converted into a piece of text, i.e., semantic text of the input Speech, after undergoing encoding and decoding processes. The speech decoder may comprise two sets of convolutional neural networks, each set of convolutional neural networks comprising a convolutional layer and a pooling layer, the output of the pooling layer of a preceding set of convolutional neural networks being the input of the convolutional layer of a subsequent set of convolutional neural networks.

In step 1023, the first eigenvector and the second eigenvector are spliced into a third eigenvector through the connection layer.

In step 1024, the third feature vector is input into the Softmax layer, and an emotion tag corresponding to the third feature vector is obtained as the speech emotion tag.

For example, the step 1021 may be performed simultaneously with the step 1022, so as to generate the first feature vector and the second feature vector simultaneously. After the first feature vector and the second feature vector are obtained through

steps

1021 and 1022, the two feature vectors may be combined into the same feature vector (i.e., a third feature vector). The third feature vector can reflect semantic characteristics and audio spectral characteristics of the original audio of the input speech. Then, inputting the third feature vector into a previously trained Softmax layer, and acquiring the emotion label output by the Softmax layer as the speech emotion label

Fig. 3 is a flow chart of another speech generation method according to fig. 2, as shown in fig. 3, before the step 101, the method may further include:

in this step 107, a preset classification prediction model is trained through preset speech emotion training data to obtain a trained classification prediction model.

In step 108, the emotion classification model is constructed by the speech decoder, the text decoder, the audio processing model, the speech recognition model and the trained classification prediction model.

Wherein the output of the speech recognition model is the input of the speech decoder, the output of the speech recognition model is the input of the text decoder, and the input of the classification prediction model comprises: an output of the speech decoder and an output of the text decoder.

Illustratively, the emotion classification model construction process may adopt two construction methods, one of which is the method described in the

above steps

107 and 108, that is, firstly, a classification prediction model is trained through a preset training data set containing a large amount of speech emotion training data; after the training of the classification prediction model is completed, the emotion classification model is constructed through the speech decoder, the text decoder, the audio processing model, the speech recognition model and the trained classification prediction model. Each piece of speech emotion training data is a binary group consisting of two groups of speech feature vectors (the form of each speech feature vector is the same as that of the first feature vector and that of the second feature vector) and emotion labels.

Illustratively, another way of constructing the emotion classification model includes: step a, firstly, an initial emotion classification model is constructed through the voice decoder, the text decoder, the audio processing model, the voice recognition model and a preset classification prediction model; and b, inputting a large number of binary groups consisting of voice audio and emotion labels into the initial emotion classification model as training data to obtain the trained emotion classification model. It can be understood that, in the construction method, the speech audio in the binary group input to the initial emotion classification model is respectively input to the audio processing model and the speech recognition model, and then two feature vectors are obtained through the speech decoder, the text decoder, the audio processing model and the speech recognition model, and then the two feature vectors and the emotion label in the binary group are simultaneously input to the preset classification prediction model to train the classification prediction model. It is understood that, based on the construction manner, the completion of the training of the classification prediction model means the completion of the construction of the whole emotion classification model.

FIG. 4 is a flow chart of yet another speech generation method according to FIG. 1, as shown in FIG. 4, this step 103 may include: the step 1031, the step 1032 and/or the step 1033 and the step 1034.

The step 1031 is to extract a first text element describing personal information and/or interest information from the semantic text, so as to use a text feature corresponding to the first text element as the user portrait information.

The step 1032 is to extract a second text element for describing the event processing flow and/or the object development rule from the semantic text, so as to use a text feature corresponding to the second text element as the event flow information.

In step 1033, a third text element describing the event decision condition in the semantic text is identified.

In this step 1034, the event probability of each event result caused by the event decision condition is determined according to the third text element through a preset bayesian network model, so as to use each event result and the event probability corresponding to each event result as the event decision information.

Illustratively, the user representation information may include: age, emotional state, gender, place of birth, work, favorite person, most frequently listened song, favorite sport, and the like. The event flow information may include: flow information of various social activities engaged in by human beings or flow information of development rules of things in the nature, such as flow information for making a certain dish, flow information of train ticket buying events, flow information of four-season rotation day and night replacement and the like. Specifically, taking a train ticket purchasing event as an example, the flow information may include: and the information tree is composed of node information such as ticket buying time, ticket buying money, starting station, target place, riding time, arrival time and the like, and each node of the information tree is the node information. The event decision information is different from the event flow information in that the event decision information includes causal information of a certain event result caused by a certain decision condition occurring in the process of human activity. The event decision information may include: and (4) decision information on whether to take an umbrella, take a sale, watch television or not and the like today. Taking the decision information of whether to take an umbrella today as an example, the first decision condition may be that more people have taken an umbrella on the road, the second decision condition may be that it is raining today, and the event result corresponding to the decision time, that is, the umbrella needs to be taken, may be determined based on the first decision condition and the second decision condition. The first text element, the second text element and the third text element may be a word or text in the segment semantic text. For the user portrait information, the event flow information and the event decision condition, the brain cognition model has a corresponding corpus. In step 1031, step 1032 and step 1033, the first text element, the second text element and the third text element may be recognized through the corresponding corpus and a predetermined text recognition algorithm.

Fig. 5 is a flow chart of a method of determining a responsive emotion tag and a responsive text according to fig. 1, as shown in fig. 5, this step 104 may include:

in this step 1041, the speech emotion tag and the cognitive information are respectively input into the emotion association model and the text association model, and a first probability set output by the emotion association model and a second probability set output by the text association model are obtained.

The first probability set comprises a plurality of emotion labels and a first probability corresponding to each emotion label, and the first probability set comprises a plurality of texts and a first probability corresponding to each text.

In step 1042, the emotion label corresponding to the highest first probability in the emotion labels is used as the responding emotion label.

In this step 1043, the text corresponding to the highest second probability in the plurality of texts is used as the reply text.

For example, the emotion correlation model and the text correlation model may be classification prediction models, and the probability recognition model may be a neural network model having different structures. Taking the emotion association model as an example, the speech emotion tag and the cognitive information can be used as input end training data of the neural network model, and the emotion tag bound with the speech emotion tag and the cognitive information can be used as output end training data of the neural network model to train the neural network model. Wherein, each emotion label is equivalent to one classification, and the probability output by the trained neural network model (the emotion correlation model) is the classification prediction probability of the current speech emotion label and the cognitive information in each classification (corresponding to each emotion label).

Fig. 6 is a flow chart of a speech synthesis method according to fig. 1, as shown in fig. 6, the step 105 may include:

in step 1051, a preset intonation correlation model is input according to the response emotion tag, and the intonation characteristics corresponding to the response emotion tag output by the intonation correlation model are obtained.

This step 1052 synthesizes the intonation features and the reply text into the reply speech through a predetermined text-to-speech TTS algorithm.

For example, the TTS algorithm uses the intonation feature and the reply text as the basis for synthesizing the reply voice, and therefore, before the step 1052, the intonation feature corresponding to the reply emotion tag needs to be determined by an intonation association model capable of representing the corresponding relationship between the emotion tag and the intonation feature. The intonation correlation model can also be a correlation comparison table or a classification prediction model trained in advance. After determining the responsive emotion label, the intonation characteristics may be determined in step 1051 by look-up tables or by classification prediction using an input model.

Fig. 7 is a block diagram of a speech generating apparatus according to an exemplary embodiment, and as shown in fig. 7, the apparatus 700 may include:

a voice receiving module 710 configured to receive an input voice;

the label determining module 720 is configured to determine a speech emotion label corresponding to the input speech according to the sound spectrum feature of the input speech and the semantic text corresponding to the input speech through a preset trained emotion classification model;

an information extraction module 730 configured to extract cognitive information from the semantic text, the cognitive information including: at least one of user portrait information, event flow information, and event decision information;

the information determining module 740 is configured to determine, according to a preset emotion association model, a preset text association model, the voice emotion tag and the cognitive information, a response emotion tag corresponding to the voice emotion tag and a response text corresponding to the semantic text, wherein the emotion association model is used for representing an association relationship among the voice emotion tag, the response emotion tag and the cognitive information, and the text association model is used for representing an association relationship among the voice emotion tag, the response emotion tag and the cognitive information;

a speech synthesis module 750 configured to generate a reply speech for the input speech according to the reply emotion tag and the reply text, where a semantic text corresponding to the reply speech is the reply text, and a tone feature of the reply speech is a tone feature determined according to the reply emotion tag;

a voice output module 760 configured to output the reply voice. .

Optionally, the emotion classification model includes: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model comprising a connection layer and a Softmax layer, the tag determination module 720 configured to:

obtaining a sound frequency spectrum characteristic corresponding to the input voice through the audio processing model, and inputting the sound frequency spectrum characteristic into the voice decoder to obtain a corresponding first characteristic vector;

splicing the first feature vector and the second feature vector into a third feature vector through the connection layer;

FIG. 8 is a block diagram of another speech generating apparatus according to FIG. 7, as shown in FIG. 8, the apparatus 700 may further comprise;

a model training module 770 configured to train a preset classification prediction model through preset speech emotion training data to obtain a trained classification prediction model;

a model construction module 780 configured to construct the emotion classification model through the speech decoder, the text decoder, the audio processing model, the speech recognition model, and the trained classification prediction model; wherein the content of the first and second substances,

Optionally, the information extracting module 730 is configured to:

extracting a second text element for describing an event processing flow and/or an object development rule from the semantic text, and taking a text feature corresponding to the second text element as the event flow information; and/or the presence of a gas in the gas,

and determining the event probability of each event result caused by the event decision condition according to the third text element through a preset Bayesian network model, and taking each event result and the event probability corresponding to each event result as the event decision information.

Optionally, the information determining module 740 is configured to:

respectively inputting the voice emotion label and the cognitive information into the emotion correlation model and the text correlation model, and acquiring a first probability set output by the emotion correlation model and a second probability set output by the text correlation model; the first probability set comprises a plurality of emotion labels and a first probability corresponding to each emotion label, and the first probability set comprises a plurality of texts and a first probability corresponding to each text;

using the emotion label corresponding to the highest first probability in the plurality of emotion labels as the response emotion label; and the number of the first and second groups,

Optionally, the speech synthesis module 750 is configured to:

inputting a preset tone correlation model according to the response emotion label, and acquiring tone characteristics corresponding to the response emotion label output by the tone correlation model;

and synthesizing the intonation characteristics and the reply text into the reply voice through a preset text-to-speech (TTS) algorithm.

FIG. 9 is a block diagram illustrating an apparatus 900 for speech generation according to an example embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 9, apparatus 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the speech generation method described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the apparatus 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 906 provides power to the various components of device 900. The power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, sensor assembly 914 may detect an open/closed state of device 900, the relative positioning of components, such as a display and keypad of device 900, the change in position of device 900 or a component of device 900, the presence or absence of user contact with device 900, the orientation or acceleration/deceleration of device 900, and the change in temperature of device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described speech generation methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the device 900 to perform the speech generation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described speech generation method when executed by the programmable apparatus.

The device for generating the voice can acquire the voice emotion and the semantic text of the input voice, generate the corresponding reply voice according to the response emotion corresponding to the voice emotion and the reply text corresponding to the semantic text, and improve the intelligent degree of intelligent voice interaction.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech generation, the method comprising:

receiving input voice;

generating a reply voice of the input voice according to the reply emotion tag and the reply text, wherein a semantic text corresponding to the reply voice is the reply text, and the tone characteristic of the reply voice is the tone characteristic determined according to the reply emotion tag;

and outputting the reply voice.

2. The method of claim 1, wherein the emotion classification model comprises: the method comprises a voice decoder, a text decoder, an audio processing model, a voice recognition model and a classification prediction model, wherein the classification prediction model comprises a connecting layer and a Softmax layer, and the emotion classification model after preset training determines a voice emotion label corresponding to input voice according to the sound spectrum characteristics of the input voice and the semantic text corresponding to the input voice, and comprises the following steps:

3. The method according to claim 1 or 2, wherein before the emotion classification model after the preset training is used to determine the speech emotion label corresponding to the input speech according to the sound spectrum feature of the input speech and the semantic text corresponding to the input speech, the method further comprises:

4. The method of claim 1, wherein the extracting cognitive information from the semantic text comprises:

5. The method according to claim 1, wherein the determining the response emotion tag corresponding to the speech emotion tag and the response text corresponding to the semantic text according to a preset emotion correlation model, a preset text correlation model, the speech emotion tag and the cognitive information comprises:

6. The method of claim 1, wherein generating a reply voice for the input voice from the reply emotion tag and the reply text comprises:

7. An apparatus for generating speech, the apparatus comprising:

a voice receiving module configured to receive an input voice;

a voice output module configured to output the reply voice.

8. The apparatus of claim 7, wherein the emotion classification model comprises: a speech decoder, a text decoder, an audio processing model, a speech recognition model, and a classification prediction model, the classification prediction model including a connection layer and a Softmax layer, the tag determination module configured to:

9. The apparatus of claim 7 or 8, further comprising:

10. The apparatus of claim 7, wherein the information extraction module is configured to:

11. The apparatus of claim 7, wherein the information determination module is configured to:

12. The apparatus of claim 7, wherein the speech synthesis module is configured to:

13. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 6.

14. An electronic device is characterized in that a voice interaction system is arranged in the electronic device;

the electronic device includes: the speech generating apparatus of any of claims 7-12.