CN112992147A

CN112992147A - Voice processing method, device, computer equipment and storage medium

Info

Publication number: CN112992147A
Application number: CN202110217729.4A
Authority: CN
Inventors: 顾艳梅; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-18

Abstract

The application relates to the technical field of voice processing, and realizes voice response according to emotional characteristics of a user by fusing coarse-grained emotion categories into an intention recognition process and fusing fine-grained emotion categories into a voice synthesis process, so that the accuracy of response voice signals and the experience of the user are improved. To a method, apparatus, computer device and storage medium for speech processing, the method comprising: acquiring voice data to be processed; carrying out voice recognition on voice data to obtain text information; calling an emotion recognition model, inputting voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category; determining response text information corresponding to the voice data according to the text information and the coarse-grained emotion types; and performing voice synthesis according to the fine-grained emotion category and the response text information to obtain a response voice signal corresponding to the voice data. The application also relates to blockchain techniques, where emotion recognition models may be stored.

Description

Voice processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, a computer device, and a storage medium.

Background

With the rapid development of artificial intelligence, intelligent voice robots, such as an outbound robot, a chat robot, and an intelligent customer service, have appeared. The intelligent voice robot realizes services such as online question answering, consultation, instruction execution and the like through artificial intelligent technologies such as voice recognition, semantic understanding, conversation management and the like. However, in the existing voice interaction process, the intelligent voice robot generally converts the received voice of the user into a text, determines a response text according to the text to synthesize the voice, and finally outputs the cold response voice; in the process, the intelligent voice robot does not consider the influence of the actual environment, so that the accuracy of the matching of the response voice is low, and the experience degree of a user is reduced.

Therefore, how to improve the accuracy of the response voice of the intelligent voice robot becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a voice processing method, a device, computer equipment and a storage medium, wherein the voice response is carried out according to the emotional characteristics of a user by fusing coarse-grained emotion categories in an intention recognition process and fusing fine-grained emotion categories in a voice synthesis process, so that the accuracy of responding voice signals and the experience of the user are improved.

In a first aspect, the present application provides a speech processing method, including:

acquiring voice data to be processed;

performing voice recognition on the voice data to obtain text information corresponding to the voice data;

calling an emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data;

determining response text information corresponding to the voice data according to the text information and the coarse-grained emotion types;

and performing voice synthesis according to the fine-grained emotion category and the response text information to obtain a response voice signal corresponding to the voice data.

In a second aspect, the present application further provides a speech processing apparatus, comprising:

the voice data acquisition module is used for acquiring voice data to be processed;

the voice recognition module is used for carrying out voice recognition on the voice data to obtain text information corresponding to the voice data;

the emotion recognition module is used for calling an emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data;

the response text generation module is used for determining response text information corresponding to the voice data according to the text information and the coarse-grained emotion categories;

and the voice synthesis module is used for carrying out voice synthesis according to the fine-grained emotion categories and the response text information to obtain response voice signals corresponding to the voice data.

In a third aspect, the present application further provides a computer device comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to execute the computer program and implement the voice processing method as described above when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the speech processing method as described above.

The application discloses a voice processing method, a voice processing device, computer equipment and a storage medium, wherein text information corresponding to voice data can be obtained by acquiring voice data to be processed and carrying out voice recognition on the voice data; the emotion recognition model is called, and the voice data are input into the emotion recognition model for emotion recognition, so that coarse-grained emotion categories and fine-grained emotion categories corresponding to the voice data can be obtained; determining response text information corresponding to the voice data according to the text information and the coarse-grained emotion categories, so that the coarse-grained emotion categories are fused in the meaning recognition process; by carrying out voice synthesis according to the fine-grained emotion categories and the response text information, the fine-grained emotion categories are fused in the voice synthesis process, voice response is carried out according to the emotion characteristics of the user, and the accuracy of response voice signals and the experience of the user are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a speech processing method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of generating a response voice signal according to voice data according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a sub-step of obtaining voice data to be processed according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of substeps of training a emotion recognition model provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart diagram of a sub-step of determining a response text message provided by an embodiment of the present application;

FIG. 6 is a schematic flow chart of sub-steps of performing speech synthesis provided by an embodiment of the present application;

fig. 7 is a schematic block diagram of a speech processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a voice processing method and device, computer equipment and a storage medium. The voice processing method can be applied to a server or a terminal, and realizes voice response according to the emotional characteristics of the user by fusing the coarse-grained emotion categories into the intention recognition process and fusing the fine-grained emotion categories into the voice synthesis process, so that the accuracy of voice signal response and the experience of the user are improved.

The server may be an independent server or a server cluster. The terminal can be an electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer and the like.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

As shown in fig. 1, the voice processing method includes steps S101 to S105.

And step S101, acquiring voice data to be processed.

It should be noted that the voice data to be processed may be a voice signal of a user collected in advance, or may be a voice signal of a user collected in real time. In the embodiment of the present application, a detailed description is given by taking an example of collecting a voice signal of a user in real time.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a response voice signal generated according to voice data according to an embodiment of the present application. As shown in fig. 2, firstly, inputting voice data into a voice recognition model and an emotion recognition model, respectively, performing voice recognition on the voice data by the voice recognition model, outputting text information to an intention recognition model, performing emotion recognition on the voice data by the emotion recognition model, outputting a coarse-grained emotion category to the intention recognition model, and outputting a fine-grained emotion category to a voice synthesis model; then, the intention recognition model carries out intention recognition according to the text information and the coarse-grained emotion types, and outputs response text information to the sound synthesis model; and finally, carrying out sound synthesis by the sound synthesis model according to the response text information and the fine-grained emotion types, and outputting a response voice signal.

Referring to fig. 3, fig. 3 is a schematic flowchart of sub-steps of acquiring to-be-processed voice data according to an embodiment of the present application, and specifically includes the following steps S1011 to S1013.

Step S1011, acquiring a voice signal acquired by the voice acquisition device.

The voice collecting device may include, for example, an electronic device such as a recorder, a recording pen, and a microphone for collecting voice. Wherein, pronunciation collection system can install in intelligent voice robot.

In some application scenarios, a user may transact business through an intelligent voice robot. For example, when handling business, the intelligent voice robot may collect voice signals input by a user through the voice collecting device. Wherein the speech signal may be a speech signal of the user in different emotions.

Step S1012, extracting a useful speech signal from the speech signal based on a speech detection model preset in the block chain.

It should be noted that, since the speech signal may include a useless signal, in order to improve the accuracy of the subsequent recognition of the emotion category, it is necessary to extract a useful speech signal from the speech signal. Unwanted signals may include, but are not limited to, footsteps, horns, silence, machine noise, and the like.

For example, the predetermined voice detection model may include a voice activity endpoint detection model. It should be noted that in the Voice signal processing, Voice Activity endpoint Detection (VAD) is used to detect whether there is Voice, so as to separate the Voice segment and the non-Voice segment in the signal. VADs can be used for echo cancellation, noise suppression, speaker recognition, speech recognition, and the like.

In some embodiments, the initial voice activity endpoint detection model may be trained in advance to obtain a trained voice activity endpoint detection model. To further ensure the privacy and security of the trained voice activity endpoint detection model, the trained voice activity endpoint detection model may be stored in a node of a blockchain. When the trained voice activity endpoint detection model needs to be used, it can be obtained from the nodes of the blockchain.

In some embodiments, extracting a useful speech signal from the speech signal based on a preset speech detection model in the blockchain may include: segmenting the voice signal to obtain at least one segmented voice signal corresponding to the voice signal; determining a short-time energy of each segmented speech signal; and splicing the segmented voice signals corresponding to the short-time energy larger than the preset energy amplitude value to obtain the useful voice signals.

The preset energy amplitude value may be set according to an actual situation, and the specific value is not limited herein.

For example, when extracting a useful speech signal from a speech signal based on a speech activity endpoint detection model, in addition to the short-term energy, the determination may be performed according to features of the speech signal, such as spectral energy, zero crossing rate, and the like, and the specific process is not limited herein.

Step S1013, determining the speech data according to the useful speech signal.

For example, the useful voice signal extracted from the voice signal may be determined as voice data.

By extracting the useful voice signals in the voice signals based on the preset voice detection model, the recognition accuracy of subsequent voice recognition and emotion categories can be improved.

And S102, carrying out voice recognition on the voice data to obtain text information corresponding to the voice data.

For example, a speech recognition model may be invoked to perform speech recognition on speech data, so as to obtain text information corresponding to the speech data.

The speech recognition model may include, but is not limited to, hidden markov models, convolutional neural networks, bounded boltzmann machines, recurrent neural networks, long-short term memory networks, and time-delay neural networks, among others.

In the embodiment of the present application, the speech recognition model may be a Time Delay Neural Network (TDNN) for example, which is described in detail. It should be noted that. The TDNN is an artificial neural network structure used for classifying phonemes in a speech signal to automatically recognize speech; TDNN identifies phonemes and their underlying acoustic/speech features, independent of location in time, and independent of time offsets.

Illustratively, the speech recognition model is a pre-trained time-delay neural network model. The specific training process is not limited herein.

In some embodiments, the speech data may be input into a speech recognition model for speech recognition based on the GPU cluster, so as to obtain text information corresponding to the speech data. For example, the text message is "do not call me any more".

It should be noted that a GPU (Graphics Processing Unit) cluster is a computer cluster, in which each node is equipped with a Graphics Processing Unit. Because the general-purpose GPU has a high data parallel architecture, a large number of data points can be processed in parallel, so that the GPU cluster can execute quick calculation, and the calculation throughput is improved.

Based on the GPU cluster, the voice data is input into the voice recognition model for voice recognition, so that the accuracy and efficiency of the voice recognition can be improved.

Step S103, calling an emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data.

In an embodiment of the application, the emotion recognition model may include a first emotion recognition model and a second emotion recognition model. The first emotion recognition model is used for recognizing coarse-grained emotion categories, and the second emotion recognition model is used for recognizing fine-grained emotion categories.

It is noted that coarse grained mood categories may include positive mood, slightly negative mood, and strongly negative mood. The fine-grained emotions are specifically classified under the coarse-grained emotions.

Exemplary, positive emotions correspond to fine-grained emotions that may include, but are not limited to: happy, optimistic, happy, etc.; fine-grained emotions corresponding to a mild negative emotion may include, but are not limited to: anxiety, tension, sadness, complaints, and liability, among others; fine-grained emotions corresponding to strongly negative emotions may include, but are not limited to: abuse, anger, complaints, and the like.

For example, the first emotion recognition model and the second emotion recognition model may include, but are not limited to, a convolutional neural network, a constrained boltzmann machine, a cyclic neural network, or the like.

In some embodiments, invoking the emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and before obtaining the coarse-grained emotion category and the fine-grained emotion category corresponding to the voice data, may further include: and respectively carrying out iterative training on the first emotion recognition model and the second emotion recognition model until convergence is reached to obtain the trained first emotion recognition model and the trained second emotion recognition model.

It can be understood that, because the emotion recognition model includes the first emotion recognition model and the second emotion recognition model, during training, the first emotion recognition model and the second emotion recognition model need to be trained respectively until convergence, so as to obtain a trained emotion recognition model.

It is emphasized that, to further ensure the privacy and security of the trained emotion recognition model, the trained emotion recognition model may be stored in a node of a block chain. When the trained emotion recognition model needs to be used, the emotion recognition model can be obtained from the nodes of the block chain.

Referring to fig. 4, fig. 4 is a schematic flowchart of a sub-step of training an emotion recognition model according to an embodiment of the present application, and specifically includes the following steps S1031 to S1034.

Step S1031, obtaining first training data, wherein the first training data comprises a preset number of text data, a preset number of voice data and labeled coarse-grained emotion category labels.

The number ratio of the text data to the speech data in the first training data is not limited. Illustratively, the ratio of the amount of text data to the amount of speech data may be 1:1 or 1: 2.

Illustratively, coarse-grained emotion category labeling can be performed on text data and voice data respectively to obtain labeled coarse-grained emotion category labels. Wherein the labeled coarse-grained emotion category labels may include a positive emotion, a slightly negative emotion, and a strongly negative emotion.

By acquiring first training data containing text data and voice data, the first emotion recognition model can learn text features and voice features, so that coarse-grained emotion categories predicted and output by the subsequent first emotion recognition model comprise the text features and the voice features, and the method can be applied to an intention recognition process or a voice synthesis process.

Step S1032, second training data are obtained, wherein the second training data comprise a preset number of text data, a preset number of voice data and labeled fine-grained emotion category labels.

For example, the ratio of the amount of text data to the amount of speech data in the second training data is not limited. Illustratively, the ratio of the amount of text data to the amount of speech data may be 2:1 or 1: 2. The text data and the voice data in the second training data are text data and voice data corresponding to the coarse-grained emotion categories. For example, the text data and the voice data in the second training data may be text data and voice data corresponding to a positive emotion, text data and voice data corresponding to a slight negative emotion, or text data and voice data corresponding to a strong negative emotion.

Illustratively, fine-grained emotion category labeling can be performed on text data and voice data respectively to obtain labeled fine-grained emotion category labels. For example, when the coarse-grained emotion category label corresponding to the first training data is a positive emotion, the labeled fine-grained emotion category label may include happy, optimistic, happy, and the like; when the coarse-grained emotion category labels corresponding to the first training data are slightly negative emotions, the labeled fine-grained emotion category labels can include anxiety, tension, sadness, complaints, liability and the like; when the coarse-grained emotion class label to which the first training data corresponds is a strongly negative emotion, the labeled fine-grained emotion class label may include abuse, anger, complaints, and the like.

By acquiring second training data containing text data and voice data, the second emotion recognition model can learn text features and voice features, and fine-grained emotion categories output by prediction of the second emotion recognition model subsequently comprise the text features and the voice features, so that the second emotion recognition model can be applied to an intention recognition process or a voice synthesis process.

Step S1033, inputting the first training data into the first emotion recognition model for iterative training until the first emotion recognition model converges.

In some embodiments, inputting the first training data into the first emotion recognition model for iterative training until the first emotion recognition model converges may include: determining training sample data of each training round according to the text data, the voice data and the coarse-grained emotion class labels; inputting sample data of the current training round into a first emotion recognition model for emotion recognition training to obtain an emotion prediction result; determining a loss function value according to a coarse-grained emotion category label and an emotion prediction result corresponding to current round training sample data; and if the loss function value is larger than the preset loss value threshold, adjusting parameters of the first emotion recognition model, carrying out next round of training until the obtained loss function value is smaller than or equal to the loss value threshold, and finishing the training to obtain the trained first emotion recognition model.

For example, the preset loss value threshold may be set according to actual conditions, and the specific value is not limited herein.

Illustratively, the loss function value may be calculated using a loss function such as a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a cross entropy loss function, a square loss function, or an exponential loss function.

For example, a convergence algorithm such as a gradient descent algorithm, a newton algorithm, a conjugate gradient method, or a cauchy-newton method may be used to adjust the parameters of the first emotion recognition model.

Step S1034, inputting the second training data into the second emotion recognition model for iterative training until the second emotion recognition model is converged.

It should be noted that the training process of the second emotion recognition model is similar to the training process of the first emotion recognition model, and the specific process is not described herein again.

In the embodiment of the application, after iterative training is performed on the first emotion recognition model and the second emotion recognition model until convergence, the speech data can be respectively input into the trained first emotion recognition model and the trained second emotion recognition model for emotion recognition.

In some embodiments, invoking the emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data may include: inputting the voice data into a first emotion recognition model for emotion prediction to obtain a coarse-grained emotion category corresponding to the voice data; and inputting the voice data into a second emotion recognition model corresponding to the coarse-grained emotion category for emotion prediction to obtain a fine-grained emotion category corresponding to the voice data.

For example, voice data is input into the trained first emotion recognition model for emotion prediction, and the obtained coarse-grained emotion category may be "positive emotion"; and inputting the voice data into a second emotion recognition model corresponding to the positive emotion to perform emotion prediction, wherein the obtained coarse-grained emotion category can be 'happy'.

Illustratively, voice data is input into the trained first emotion recognition model for emotion prediction, and the obtained coarse-grained emotion category can be 'strong negative emotion'; the speech data is input into a second emotion recognition model corresponding to "strongly negative emotion" for emotion prediction, and the obtained coarse-grained emotion category may be "abuse".

Parameter updating is carried out on the first emotion recognition model and the second emotion recognition model according to a preset loss function and a convergence algorithm, so that the first emotion recognition model and the second emotion recognition model can be converged quickly, and further training efficiency and accuracy of the emotion recognition models are improved.

And step S104, determining response text information corresponding to the voice data according to the text information and the coarse-grained emotion type.

It should be noted that, in the embodiment of the present application, intention recognition processing may be performed according to the text information and the coarse-grained emotion category, so as to obtain response text information corresponding to the voice data. Wherein the intent recognition process may include intent recognition and matching with dialogs.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a sub-step of determining the response text information corresponding to the voice data according to the text information and the coarse-grained emotion category in step S104, and the specific step S104 may include the following steps S1041 to S1044.

Step S1041, performing word segmentation processing on the text information to obtain a plurality of word groups corresponding to the text information.

In some embodiments, word segmentation processing may be performed on the text information based on a preset word segmentation model to obtain a plurality of word groups corresponding to the text information.

For example, the predetermined word segmentation model may include a BI _ LSTM-CRF neural network model, but may be other neural network models, and is not limited herein. It should be noted that the BI _ LSTM-CRF neural network model combines the BI _ LSTM network and the CRF (conditional Random field) layer. The BI _ LSTM-CRF neural network model can not only use the features and statement label information input in the past, but also use the input features in the future, and can ensure higher accuracy of Chinese word segmentation by considering the influence of long-distance context information on Chinese word segmentation.

Illustratively, for the text message "do not call me any more," the resulting phrase may include [ "do", "again", "call me", "make a call" ].

Step S1042, vectorizing each phrase input word vector model to obtain a word vector matrix corresponding to the text information.

Illustratively, the word vector model may include, but is not limited to, word2vec (word vector) model, glove (global vectors for word representation) model, and bert (bidirectional Encoder retrieval from transform) model, among others.

For example, each phrase may be input into the BERT model for vectorization, so as to obtain a word vector matrix corresponding to the text information.

And S1043, inputting the word vector matrix into an intention recognition model for intention recognition, and obtaining intention information corresponding to the text information.

Illustratively, the intent recognition model is a trained intent recognition model. In the embodiment of the present application, the intention recognition model may include, but is not limited to, a convolutional neural network, a Han model, a recurrent neural network, and the like.

In some embodiments, the initial intent recognition model may be trained to converge according to the training text word vectors and the intent labels, resulting in a trained intent recognition model. The specific training process is not limited herein.

Illustratively, word vector matrixes corresponding to phrases [ "don't need", "then", "give me", "make a call" ] are input into a trained intention recognition model for intention recognition, and recognized intention information is obtained. For example, the identified intention information is "call refusal".

The word vector matrix is input into the trained intention recognition model for intention recognition, so that the prediction accuracy of intention information can be improved.

And S1044, performing jargon matching on the intention information and the coarse-grained emotion categories to obtain the response text information.

In some embodiments, verbally matching intent information with coarse-grained emotion classifications to obtain responsive textual information may include: determining a target phone operation database according to the coarse-grained emotion types based on a preset corresponding relation between the emotion types and the phone operation database; and matching the tactical information corresponding to the intention information based on the target tactical database to obtain response text information.

In the embodiment of the application, dialogs can be divided into different dialogs databases according to emotion categories in advance. Illustratively, the jargon with the emotion classification of "positive emotion" is divided into a jargon database a, the jargon with the emotion classification of "slight negative emotion" is divided into a jargon database B, and the jargon with the emotion classification of "strong negative emotion" is divided into a jargon database C; the conversational database is then associated with the corresponding emotion classification label.

For example, if the coarse-grained emotion category is "mild negative emotion", the target tactical database may be determined to be tactical database B based on the coarse-grained emotion category "mild negative emotion".

In some embodiments, the corresponding tactical information may be matched based on the target tactical database according to keywords in the intent information.

For example, when the intention information is "call refusal" and the telephone database is the telephone database a, the answer text information can be obtained by matching the corresponding telephone information in the telephone database a according to the keyword "refusal" and the keyword "call answering" in the intention information: good, not good meaning, disturbing.

For example, when the intention information is "call refusal" and the telephone database is the telephone database B, the answer text information may be obtained by matching the corresponding telephone information in the telephone database B according to the keyword "refusal" and the keyword "call answering" in the intention information: sorry, which brings a bad experience to you.

For another example, when the intention information is "call refusal" and the telephone database is the telephone database C, the answer text information may be obtained by matching the corresponding telephone information in the telephone database C according to the keyword "refusal" and the keyword "call answering" in the intention information: you are forgiving and do not disturb you any more subsequently.

By performing intention recognition according to the text information and the coarse-grained emotion categories, the coarse-grained emotion categories can be fused into the intention recognition process, and response text information can be generated according to the emotion characteristics of the user, so that the response text information can reflect the emotion state of the user.

And S105, carrying out voice synthesis according to the fine-grained emotion category and the response text information to obtain a response voice signal corresponding to the voice data.

It should be noted that, by performing speech synthesis according to the fine-grained emotion categories and the response text information, the fine-grained emotion categories corresponding to the user can be fused into the speech synthesis, so that the response speech signal can reflect the real emotion state of the user.

Referring to fig. 6, fig. 6 is a schematic flowchart of the sub-steps of speech synthesis according to the fine-grained emotion classification and the response text message in step S105, and the specific step S105 may include the following steps S1051 to S1053.

And S1051, determining a target intonation type according to the fine-grained emotion category based on a preset corresponding relation between the emotion category and the intonation type.

For example, the mood category may be positive mood such as happy, optimistic, and happy; slight negative emotions such as anxiety, tension, sadness, complaints and liability; but also strong negative emotions such as abuse, anger, and complaints.

Illustratively, the intonation types include, but are not limited to, calm, light and strong, among others.

In the embodiment of the present application, the emotion category may be associated with the intonation type in advance, as shown in table 1.

TABLE 1

Categories of emotions	Tone type
		Happy, happy and happy	Quiet
Anxiety, tension, sadness, complaints and responsibility	Lightly sooth the hair
		Abuse, anger, complaints	Strongly soothing

And step 1052, synthesizing the sound of the response text information to obtain a corresponding sound spectrogram.

Illustratively, the answering Text information may be voice-synthesized according To the target intonation through a TTS (Text To Speech) model. It should be noted that the TTS model includes modules such as a text analysis module, an acoustic model module, and an audio synthesis module, and can convert a section of text into a speech signal.

Illustratively, the TTS model may perform vectorization, fourier transform, and pass through a mel filter bank on the response text information to obtain a mel spectrum, i.e., a sound spectrogram.

Step S1053, adjusting the tone of the sound spectrogram according to the target tone type, and determining the adjusted sound spectrogram as the response voice signal.

In the embodiment of the application, the intonation of the sound spectrogram can be adjusted according to the target intonation type through a preset automatic script, and the adjusted sound spectrogram is determined as a response voice signal.

Illustratively, the fundamental frequency in the sound spectrogram can be adjusted according to the target speech type. The fundamental frequency is a frequency of a fundamental tone, and is used to determine a pitch of the entire tone. Therefore, by adjusting the fundamental frequency, the adjusted sound spectrogram has different intonation and tone.

For example, a piece of fundamental frequency data may be added to the sound spectrogram at preset time intervals. The preset time can be 10ms or other time lengths; the base frequency data may be in the frequency band of 0KHz-4 KHz.

For example, when the target speech type is "quiet", a segment of fundamental frequency data with a smaller frequency band may be added to the sound spectrogram; when the target voice type is "slight placation" or "strong placation", a section of fundamental frequency data with a larger frequency band can be added in the sound spectrogram.

In some embodiments, after the response voice signal corresponding to the voice data is obtained, the response voice signal may be broadcasted.

Illustratively, the response voice signal can be broadcasted on a broadcast interface of the intelligent voice robot. In addition, after the response voice signal is broadcasted, the user call interface information can be monitored in real time and used for receiving the real-time question and answer voice signal of the user.

By adjusting the tone of the voice spectrogram according to the target tone type, the emotion characteristics of the user are integrated into the response voice signal, so that the broadcasted tone and tone are required to be adjusted constantly according to the emotion state of the user in the voice broadcasting process, the output response voice signal is more natural, more emotional and more real, and the user experience is improved.

According to the voice processing method provided by the embodiment, the useful voice signals in the voice signals are extracted based on the preset voice detection model, so that the recognition accuracy of subsequent voice recognition and emotion types can be improved; the accuracy and the efficiency of voice recognition can be improved by inputting voice data into the voice recognition model for voice recognition based on the GPU cluster; the first emotion recognition model can learn text features and voice features by acquiring first training data containing text data and voice data, so that the coarse-grained emotion categories predicted and output by the subsequent first emotion recognition model comprise the text features and the voice features and can be applied to an intention recognition process or a voice synthesis process; parameter updating is carried out on the first emotion recognition model and the second emotion recognition model according to a preset loss function and a convergence algorithm, so that the first emotion recognition model and the second emotion recognition model can be converged quickly, and the training efficiency and accuracy of the emotion recognition models are improved; by performing intention recognition according to the text information and the coarse-grained emotion categories, the coarse-grained emotion categories can be fused into the intention recognition process, and response text information can be generated according to the emotion characteristics of the user, so that the response text information can reflect the emotion state of the user; by performing voice synthesis according to the fine-grained emotion categories and the response text information, the fine-grained emotion categories corresponding to the user can be fused into the voice synthesis, so that the response voice signal can reflect the real emotion state of the user; by adjusting the tone of the voice spectrogram according to the target tone type, the emotion characteristics of the user are integrated into the response voice signal, so that the broadcasted tone and tone are required to be adjusted constantly according to the emotion state of the user in the voice broadcasting process, the output response voice signal is more natural, more emotional and more real, and the user experience is improved.

Referring to fig. 7, fig. 7 is a schematic block diagram of a speech processing apparatus 1000 according to an embodiment of the present application, the speech processing apparatus being configured to perform the foregoing speech processing method. The voice processing device can be configured in a server or a terminal.

As shown in fig. 7, the speech processing apparatus 1000 includes: a voice data acquisition module 1001, a voice recognition module 1002, a emotion recognition module 1003, a response text generation module 1004, and a voice synthesis module 1005.

A voice data obtaining module 1001, configured to obtain voice data to be processed.

The voice recognition module 1002 is configured to perform voice recognition on the voice data to obtain text information corresponding to the voice data.

And the emotion recognition module 1003 is configured to invoke an emotion recognition model, and input the voice data into the emotion recognition model to perform emotion recognition, so as to obtain a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data.

And the response text generation module 1004 is configured to determine, according to the text information and the coarse-grained emotion category, response text information corresponding to the voice data.

A speech synthesis module 1005, configured to perform speech synthesis according to the fine-grained emotion category and the response text information, and obtain a response speech signal corresponding to the speech data.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 8, the computer device includes a processor and a memory connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech processing methods.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring voice data to be processed; performing voice recognition on the voice data to obtain text information corresponding to the voice data; calling an emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data; determining response text information corresponding to the voice data according to the text information and the coarse-grained emotion types; and performing voice synthesis according to the fine-grained emotion category and the response text information to obtain a response voice signal corresponding to the voice data.

In one embodiment, the emotion recognition model comprises a first emotion recognition model and a second emotion recognition model; before the processor calls an emotion recognition model and inputs the voice data into the emotion recognition model for emotion recognition to obtain a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data, the processor is further configured to:

and respectively carrying out iterative training on the first emotion recognition model and the second emotion recognition model until convergence, so as to obtain the trained first emotion recognition model and the trained second emotion recognition model.

In one embodiment, the processor is configured to implement, when invoking an emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data:

inputting the voice data into the first emotion recognition model for emotion prediction to obtain the coarse-grained emotion category corresponding to the voice data; and inputting the voice data into the second emotion recognition model corresponding to the coarse-grained emotion category for emotion prediction to obtain the fine-grained emotion category corresponding to the voice data.

In one embodiment, the processor, when implementing iterative training of the first emotion recognition model and the second emotion recognition model to convergence respectively to obtain the trained first emotion recognition model and the trained second emotion recognition model, is configured to implement:

acquiring first training data, wherein the first training data comprises a preset number of text data, a preset number of voice data and labeled coarse-grained emotion category labels; acquiring second training data, wherein the second training data comprises a preset number of text data, a preset number of voice data and labeled fine-grained emotion category labels; inputting the first training data into the first emotion recognition model for iterative training until the first emotion recognition model converges; and inputting the second training data into the second emotion recognition model for iterative training until the second emotion recognition model is converged.

In one embodiment, the processor, when implementing determining the response text information corresponding to the voice data according to the text information and the coarse-grained emotion category, is configured to implement:

performing word segmentation processing on the text information to obtain a plurality of word groups corresponding to the text information; vectorizing each word group input word vector model to obtain a word vector matrix corresponding to the text information; inputting the word vector matrix into an intention recognition model for intention recognition to obtain intention information corresponding to the text information; and performing jargon matching on the intention information and the coarse-grained emotion category to obtain the response text information.

In one embodiment, the processor, in enabling verbalization matching of the intent information with the coarse-grained emotion classification, obtaining the response text information, is configured to enable:

determining a target phone operation database according to the coarse-granularity emotion classification based on a preset corresponding relation between the emotion classification and the phone operation database; and matching the tactical information corresponding to the intention information based on the target tactical database to obtain the response text information.

In one embodiment, when implementing speech synthesis according to the fine-grained emotion category and the response text information to obtain a response speech signal corresponding to the speech data, the processor is configured to implement:

determining a target intonation type according to the fine-grained emotion category based on a preset corresponding relation between the emotion category and the intonation type; carrying out sound synthesis on the response text information to obtain a corresponding sound spectrogram; and adjusting the tone of the sound spectrogram according to the target tone type, and determining the adjusted sound spectrogram as the response voice signal.

In one embodiment, the processor, when implementing acquiring the voice data to be processed, is configured to implement:

acquiring a voice signal acquired by a voice acquisition device; extracting a useful voice signal in the voice signal based on a voice detection model preset in a block chain; and determining the voice data according to the useful voice signal.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the voice processing methods provided in the embodiments of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital Card (SD Card), a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech processing, comprising:

acquiring voice data to be processed;

2. The speech processing method of claim 1, wherein the emotion recognition model comprises a first emotion recognition model and a second emotion recognition model; the calling of the emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and before obtaining the coarse-grained emotion category and the fine-grained emotion category corresponding to the voice data, further comprising:

performing iterative training on the first emotion recognition model and the second emotion recognition model respectively until convergence, so as to obtain the trained first emotion recognition model and the trained second emotion recognition model;

the calling of the emotion recognition model, inputting the voice data into the emotion recognition model for emotion recognition, and obtaining a coarse-grained emotion category and a fine-grained emotion category corresponding to the voice data, includes:

inputting the voice data into the first emotion recognition model for emotion prediction to obtain the coarse-grained emotion category corresponding to the voice data;

and inputting the voice data into the second emotion recognition model corresponding to the coarse-grained emotion category for emotion prediction to obtain the fine-grained emotion category corresponding to the voice data.

3. The speech processing method according to claim 2, wherein the iteratively training the first emotion recognition model and the second emotion recognition model to converge respectively to obtain the trained first emotion recognition model and the trained second emotion recognition model, comprises:

acquiring first training data, wherein the first training data comprises a preset number of text data, a preset number of voice data and labeled coarse-grained emotion category labels;

acquiring second training data, wherein the second training data comprises a preset number of text data, a preset number of voice data and labeled fine-grained emotion category labels;

inputting the first training data into the first emotion recognition model for iterative training until the first emotion recognition model converges;

and inputting the second training data into the second emotion recognition model for iterative training until the second emotion recognition model is converged.

4. The method according to claim 1, wherein the determining the response text information corresponding to the voice data according to the text information and the coarse-grained emotion classification comprises:

performing word segmentation processing on the text information to obtain a plurality of word groups corresponding to the text information;

vectorizing each word group input word vector model to obtain a word vector matrix corresponding to the text information;

inputting the word vector matrix into an intention recognition model for intention recognition to obtain intention information corresponding to the text information;

and performing jargon matching on the intention information and the coarse-grained emotion category to obtain the response text information.

5. The speech processing method according to claim 4, wherein said verbally matching the intention information with the coarse-grained emotion classification to obtain the response text information comprises:

determining a target phone operation database according to the coarse-granularity emotion classification based on a preset corresponding relation between the emotion classification and the phone operation database;

and matching the tactical information corresponding to the intention information based on the target tactical database to obtain the response text information.

6. The speech processing method according to claim 1, wherein performing speech synthesis with the response text information according to the fine-grained emotion classification to obtain a response speech signal corresponding to the speech data comprises:

determining a target intonation type according to the fine-grained emotion category based on a preset corresponding relation between the emotion category and the intonation type;

carrying out sound synthesis on the response text information to obtain a corresponding sound spectrogram;

and adjusting the tone of the sound spectrogram according to the target tone type, and determining the adjusted sound spectrogram as the response voice signal.

7. The speech processing method according to any one of claims 1 to 6, wherein the obtaining the speech data to be processed comprises:

acquiring a voice signal acquired by a voice acquisition device;

extracting a useful voice signal in the voice signal based on a voice detection model preset in a block chain;

and determining the voice data according to the useful voice signal.

8. A speech processing apparatus, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory for storing a computer program;

the processor for executing the computer program and implementing the speech processing method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the speech processing method according to any one of claims 1 to 7.