CN116469420A

CN116469420A - Speech emotion recognition method, device, equipment and medium

Info

Publication number: CN116469420A
Application number: CN202310498896.XA
Authority: CN
Inventors: 肖湘粤; 张琪; 王竟成; 郑天文
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-07-21

Abstract

The disclosure provides a voice emotion recognition method, and relates to the field of artificial intelligence. The method comprises the following steps: extracting voice emotion characteristics based on a first voice signal of a user; respectively inputting the voice emotion characteristics into N emotion recognition models to obtain N emotion recognition results which are output by the N emotion recognition models one by one, wherein N is more than or equal to 2; determining a target voting mechanism corresponding to the service scene according to the current service scene of the user, wherein M voting mechanisms corresponding to M service scenes one by one are provided in advance, and M is greater than or equal to 2; and determining a final emotion recognition result according to the target voting mechanism and the N emotion recognition results. The present disclosure also provides a speech emotion recognition apparatus, device, storage medium and program product.

Description

Speech emotion recognition method, device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to a speech emotion recognition method, apparatus, device, medium, and program product.

Background

The speech serves as a first attribute of the language, plays a decisive supporting role in the language, and not only contains text contents to be expressed by the speaker, but also contains emotion information to be expressed by the speaker. Emotion is a phenomenon that combines human behavior, thinking and feeling, and speech emotion refers to the acquisition of corresponding emotion information from speech signals.

In the related art, speech signal analysis is generally performed based on a single deep learning algorithm to obtain emotion recognition results.

In the process of realizing the inventive concept, the inventor finds that the problem of low accuracy exists in the process of analyzing the voice signal by adopting a single certain deep learning algorithm, so that the emotion recognition result is inaccurate.

Disclosure of Invention

In view of the foregoing, the present disclosure provides speech emotion recognition methods, apparatus, devices, media, and program products.

In one aspect of the embodiments of the present disclosure, there is provided a speech emotion recognition method, including: extracting voice emotion characteristics based on a first voice signal of a user; respectively inputting the voice emotion characteristics into N emotion recognition models to obtain N emotion recognition results which are output by the N emotion recognition models one by one, wherein N is more than or equal to 2; determining a target voting mechanism corresponding to the service scene according to the current service scene of the user, wherein M voting mechanisms corresponding to M service scenes one by one are provided in advance, and M is greater than or equal to 2; and determining a final emotion recognition result according to the target voting mechanism and the N emotion recognition results.

According to an embodiment of the present disclosure, the voice emotion recognition method is applied to a remote telephone banking system, and before the voice emotion feature is extracted based on the first voice signal of the user, the method further includes: and acquiring the first voice signal in response to the user talking to customer service using the remote telephone banking system.

According to an embodiment of the present disclosure, a filtering device is provided to be connected to the customer service communication device, where the filtering device includes a DSP chip, and extracting a speech emotion feature based on a first speech signal of a user includes: filtering the first voice signal based on the DSP chip; and extracting mel-frequency cepstral coefficients based on the filtered first speech signal.

According to an embodiment of the disclosure, the M business scenarios include at least one financial business scenario and at least one non-financial business scenario, each of the M voting mechanisms includes a respective weight of the N emotion recognition models, and any one voting mechanism has at least one different weight from the other at least one voting mechanism.

In accordance with an embodiment of the present disclosure, prior to determining the target voting mechanism, the method further comprises: identifying first speaking content based on the first speech signal and/or identifying second speaking content based on a second speech signal of at least one person other than the user; and determining the current service scene according to the first description content and/or the second speaking content.

According to an embodiment of the disclosure, the target voting mechanism includes weights of the N emotion recognition models, and the N emotion recognition results include N emotion recognition categories and respective corresponding voice attribute values; according to the target voting mechanism and the N emotion recognition results, determining a final emotion recognition result comprises: multiplying the weight of each of the N emotion recognition models with the voice attribute value corresponding to each of the N emotion recognition results; summing the multiplied results to obtain a final voice attribute value; and determining the final emotion recognition result according to the final voice attribute value.

According to an embodiment of the present disclosure, determining the final emotion recognition result from the final voice attribute value includes: comparing the final voice attribute value with S emotion category standard ranges, wherein S is greater than or equal to 1; and determining the final emotion recognition result according to the comparison result.

According to an embodiment of the present disclosure, before obtaining the N emotion recognition results, the method further includes: m training sets in one-to-one correspondence are determined according to the M business scenes, wherein each training set comprises a voice emotion characteristic sample and an emotion label in the corresponding business scene; training N emotion recognition models to be trained and corresponding voting mechanisms by using each training set, wherein each voting mechanism comprises respective learnable weights of the N emotion recognition models, and the training comprises the adjustment of model parameters of the N emotion recognition models and the learnable weights; obtaining trained N emotion recognition models and the M voting mechanisms, each of N and M being greater than or equal to 2.

Another aspect of an embodiment of the present disclosure provides a speech emotion recognition device, including: the feature extraction module is used for extracting voice emotion features based on the first voice signals of the user; the emotion recognition module is used for respectively inputting the voice emotion characteristics into N emotion recognition models to obtain N emotion recognition results which are output by the N emotion recognition models one by one, wherein N is greater than or equal to 2; the mechanism determining module is used for determining a target voting mechanism corresponding to the service scene according to the current service scene of the user, wherein M voting mechanisms corresponding to M service scenes one to one are provided in advance, and M is greater than or equal to 2; and the final recognition module is used for determining a final emotion recognition result according to the target voting mechanism and the N emotion recognition results.

The apparatus comprises means for performing the steps of the method as claimed in any one of the preceding claims, respectively.

Another aspect of an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a storage means for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as described above.

Another aspect of the disclosed embodiments also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method as described above.

Another aspect of the disclosed embodiments also provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

One or more of the above embodiments have the following advantages: compared with a mode of analyzing voice signals by using a single certain deep learning algorithm, the method provides a plurality of emotion recognition models to output a plurality of emotion recognition results and provides a corresponding voting mechanism in combination with a specific service scene. The method can determine the target voting mechanism corresponding to the service scene according to the current service scene of the user, thereby determining the final emotion recognition result, better fusing a plurality of emotion recognition models together, and having higher accuracy and robustness compared with other voice emotion analysis adopting a single classification algorithm.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a speech emotion recognition method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a speech emotion recognition method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of several emotion recognition results, according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of extracting speech emotion features according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of wiener filtering according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a flow chart of extracting Mel cepstral coefficients according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow chart of determining a current traffic scenario according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a flowchart of determining a final emotion recognition result, according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow diagram of a pre-obtained voting mechanism in accordance with an embodiment of the present disclosure;

FIG. 10 schematically illustrates a flow chart of a method of speech emotion recognition according to another embodiment of the present disclosure;

FIG. 11 schematically illustrates a block diagram of a speech emotion recognition device in accordance with an embodiment of the present disclosure; and

fig. 12 schematically illustrates a block diagram of an electronic device adapted to implement a speech emotion recognition method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In current widespread practice of banking, users transacting business by remote telephone voice occupy a significant proportion. For example, a bank user manager can better promote the service level of the user and improve the marketing success rate of the financial products by knowing the emotion characteristics of the user reflected by the voice in the voice conversation process with the user. In the emergency help seeking process of the user, the abnormal behavior of the user can be found in advance, and the user can intervene and follow up in the first time. However, in the related art, a single certain deep learning algorithm is adopted to analyze the voice signal, so that the problem of low accuracy exists, and the emotion recognition result is inaccurate.

Some embodiments of the present disclosure provide a speech emotion recognition method, which provides a plurality of emotion recognition models to output a plurality of emotion recognition results and provides a corresponding voting mechanism in combination with a specific service scenario, compared with a method of performing speech signal analysis using a single certain deep learning algorithm. The method can determine the target voting mechanism corresponding to the service scene according to the current service scene of the user, thereby determining the final emotion recognition result, better fusing a plurality of emotion recognition models together, and having higher accuracy and robustness compared with other voice emotion analysis adopting a single classification algorithm.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information (such as voice signals) of the user is obtained or collected. The related processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying personal information of the user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the public order is not violated.

Fig. 1 schematically illustrates an application scenario diagram of a speech emotion recognition method according to an embodiment of the present disclosure. It should be noted that fig. 1 is merely an example of a scenario in which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a user 110, a network 120, and a customer service system 130. Network 120 is the medium used to provide communications or communication links between users 110 and customer service system 130. Network 120 may include various connection types such as base stations, wired, wireless communication links, or fiber optic cables, among others.

The user 110 may interact with the customer service system 130 via the network 120 using a communication device to obtain assistance from customer service personnel. Customer service system 130 may include, for example, a plurality of customer service personnel 131, 132, and 133. Specifically, the user 1 may make a voice call with one of a plurality of service personnel, and the service personnel may also make a voice call using the communication device. The user 110 or customer service personnel can use the mobile phone to dial to make a call, or can use applications with voice functions (such as WeChat, QQ, etc.) to make a call.

It should be noted that, the speech emotion recognition provided in the embodiments of the present disclosure may be generally performed by a server deployed with the customer service system 130, or may be performed by a communication device of each customer service, where the communication device may be various electronic devices with a call function, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and so on.

It should be understood that the number of users or customer services in fig. 1 is merely illustrative. Multiple users can access the customer service system at the same time, and respectively communicate with the multiple users. The customer service scene shown in fig. 1 can be used in, for example, user services such as banks, hotels, bus taking, ticket booking, communication service or other after-sales scenes, such as real-time conversation between a user and customer service, or an intelligent customer service system with only single voice of the user without manual customer service. The above is by way of example only and not by way of limitation, and may also be used in other scenarios where real-time emotion recognition or asynchronous emotion recognition is performed on a speech signal.

The speech emotion recognition method of the embodiment of the present disclosure will be described in detail below by way of fig. 2 to 10 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flowchart of a speech emotion recognition method according to an embodiment of the present disclosure.

As shown in fig. 2, the speech emotion recognition method of this embodiment includes operations S210 to S240.

In operation S210, a speech emotion feature is extracted based on a first speech signal of a user.

In some embodiments, the voice emotion recognition method may be applied to a remote telephone banking system, and the first voice signal may be collected in response to the user talking with the customer service using the remote telephone banking system before extracting the voice emotion feature based on the first voice signal of the user. For example, the telephone audio collection is realized, the audio signals of the two parties are separated, and the telephone audio of the user is collected in real time in the customer service call.

Illustratively, the speech emotion features may include one or more of frame energy, frame intensity, critical band spectrum, cepstral coefficients, auditory spectrum, linear prediction coefficients, fundamental frequency, and zero crossing rate, among others. These features may effectively represent various mood-related changes in the speech signal.

In operation S220, the speech emotion features are respectively input to the N emotion recognition models, and N emotion recognition results output by the N emotion recognition models one by one are obtained, where N is greater than or equal to 2.

Illustratively, the N emotion recognition models include, for example, support vector machine classifiers, decision tree classifiers, neural network classifiers, and the like. Each emotion recognition model can be trained in advance, specifically, each emotion recognition model is trained in advance based on a feature set of a training sample and a large number of voice signal samples in a training process of voice emotion type recognition based on different emotion type sets.

Wherein, when inputting speech emotion characteristics to an emotion recognition model, the model can predict and output an emotion type as an emotion recognition result based on pre-trained model parameters. Therefore, each emotion recognition model outputs one emotion category, and N emotion recognition results are obtained, wherein any one of the results can be the same as or different from other results.

The emotion categories may include, for example, happy, urgent, restless, sad, etc., and those skilled in the art may set the category and number of emotion categories according to actual needs.

In operation S230, according to the current service scenario of the user, determining a target voting mechanism corresponding to the service scenario, where M voting mechanisms corresponding to M service scenarios one to one are provided in advance, and M is greater than or equal to 2.

In some embodiments, the M business scenarios include at least one financial business scenario and at least one non-financial business scenario, each of the M voting schemes including respective weights of the N emotion recognition models, with at least one different weight between any one voting scheme and at least one other voting scheme.

Illustratively, the at least one financial transaction scenario may include a financial transaction scenario of deposit, withdrawal, transfer, loan, money transfer, financial accounting, or payment, among others. At least one non-financial business scenario includes after-sales, consultation, complaints, reservations, registration, or refunds, among other non-financial business scenarios.

Because the demands of users in each service scene are inconsistent and the service contents are inconsistent, each service scene has a corresponding voting mechanism, and specifically, different weights can be allocated according to the characteristic processing types of the emotion recognition models, or different weights can be allocated according to the results output by the emotion recognition models. For example, users often have a feeling of impatience or even anger when complaining, and the scene may be weighted up against impatience or anger and down against e.g. happiness.

In operation S240, a final emotion recognition result is determined according to the target voting mechanism and the N emotion recognition results.

For example, there are 3 emotion recognition models:

model 1: based on the speech energy features and the fundamental frequency features, the method is good at recognizing excited and calm emotion.

Model 2: based on the cepstrum coefficient characteristics, the method is good at recognizing the vigilance and aversion emotion.

Model 3: based on the linear prediction coefficient characteristics, the high-rising and sad emotion is distinguished.

In the complaint scene, the target voting mechanism is as follows

Model 1: weight 0.3, it is more relevant to a good mood of agitation and less weight to calm mood.

Model 2: weight 0.5, it is more relevant to the good feeling of liveliness and aversion to emotion, higher weight.

Model 3: weight 0.2, which is good at the high-rising and sad emotions, is less relevant and lower in weight in this scenario.

And then carrying out weighted summation on the identification results of the 3 models, wherein the model results with high weight have larger contribution, and finally obtaining the emotion classification result in the complaint scene.

In the weighted summation process, a multi-model weighted fusion classification method is used, and the method comprises the steps of carrying out softmax normalization on the output of each model, distributing weights according to a service scene, carrying out weighted summation on each emotion confidence coefficient output by each model, and carrying out softmax processing on the weighted summation result to finally obtain an emotion classification result matched with the service scene.

Therefore, different weights can be distributed to different business scenes according to the characteristic types of the adept models, so that scene self-adaptive multi-model emotion recognition and decision can be realized.

Determining the final emotion recognition result is, for example, fear. And under the condition that the emotion of the user is fear, displaying the result to the artificial customer service in real time, so that the user can timely comprehend or be guided to obtain more information, and finally judging whether the user is in danger or not. In other embodiments, the system may automatically record and communicate to the relevant departments for processing.

It is understood that operation S220 and operation S230 may be performed simultaneously or sequentially. Operation S220 may be performed first and then operation S230 may be performed, or operation S230 may be performed first and then operation S220 may be performed.

According to the embodiment of the disclosure, compared with a mode of voice signal analysis by using a single certain deep learning algorithm, the method provides a plurality of emotion recognition models to output a plurality of emotion recognition results and provides a corresponding voting mechanism in combination with a specific service scene. The method can determine the target voting mechanism corresponding to the service scene according to the current service scene of the user, thereby determining the final emotion recognition result, better fusing a plurality of emotion recognition models together, and having higher accuracy and robustness compared with other voice emotion analysis adopting a single classification algorithm.

In one embodiment of operation S220, the N emotion recognition models may be shallow networks, and the overall complexity may be reduced to a greater extent due to the relatively small number of network layers.

The speech signal features are classified, and three different models can be used for recognition due to the speech signal having multiple features such as time domain, frequency domain, energy, etc., specifically, the N emotion recognition models include a first recognition model based on a support vector machine, a second recognition model based on a convolutional neural network, and a third recognition model based on a recurrent neural network, and obtaining N emotion recognition results output by the N emotion recognition models one by one includes:

And obtaining a first recognition result output by the first recognition model.

The SVM (support vector machine) has good linear signal classification characteristics, and has good classification for emotion audios with larger difference in voice signals. The linear calculation has low complexity and has the effect of quick classification. Fig. 3 schematically shows a schematic diagram of several emotion recognition results according to an embodiment of the present disclosure, and speech signal characteristics may be classified into 5 types of cheering, anger, flat, fear, and sadness.

And obtaining a second recognition result output by the second recognition model.

For example, MFCC features are arranged according to time signals to form a matrix and input into a second recognition model, including a convolutional neural network for classifying speech signal features, where the framework is an input layer, a convolutional layer, a pooled layer, a convolutional layer, a fully connected layer, and a Softmax output layer. The loss function during training adopts a cross entropy loss function, and the activation function adopts a ReLU function.

And obtaining a third recognition result output by the third recognition model.

The recurrent neural network (Recurrent Neural Network, RNN) is a class of neural networks with short-term memory capabilities. In the recurrent neural network, neurons can receive not only information of other neurons but also information of the neurons themselves, and a network structure with loops is formed. Compared with convolutional neural networks, the convolutional neural networks can obtain the characteristics of time domain signals, and have wide application in text and voice signal characteristic extraction.

LSTM (long-term memory neural) networks are a commonly used type of RNN network, for example, inputting MFCC features according to a time sequence, processing via LSTM gates, and outputting the results via a classification layer.

Fig. 4 schematically illustrates a flowchart of extracting speech emotion features according to an embodiment of the present disclosure. Fig. 5 schematically illustrates a schematic diagram of wiener filtering according to an embodiment of the disclosure. Fig. 6 schematically illustrates a flowchart of extracting mel-frequency cepstral coefficients according to an embodiment of the present disclosure.

As shown in fig. 4, extracting a speech emotion feature based on a first speech signal of a user in operation S210 includes operations S410 to S420. The filtering equipment is connected with customer service communication equipment and comprises a DSP chip.

The filtering device can be inserted into the communication device through the USB interface in a hot plug mode for use, so that the filtering function is provided, great improvement on the communication device is avoided, and the modification cost is saved.

In operation S410, the first voice signal is filtered based on the DSP chip.

The wiener filtering can be realized by using a hardware DSP chip to filter noise, so that the noise in the acquired signal is filtered. The main interference in the voice call is unsteady noise, which mainly includes door opening and closing sound, background voice, doorbell sound, etc., and the statistical characteristics of the unsteady noise change with time. Wiener filtering can extract the actual signal from the observed quantity with noise, and can be used for unsteady noise filtering. Referring to fig. 5, where Y (t) is a real-time voice signal, W (t) is a real-time voice of a voice caller, and N (t) is a real-time noise signal. The filtered signal X (t) is obtained by wiener filtering H (t).

In operation S420, mel-frequency cepstral coefficients are extracted based on the filtered first speech signal.

And extracting features in the first voice signal based on the filtered signal obtained through the DSP chip. In terms of speech signal analysis, speech emotional characteristics include Mel-cepstral coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC for short).

Referring to fig. 6, the process of extracting MFCC features includes: the filtered first voice signal is input (operation S601). Pre-emphasis, framing and windowing are performed first (operation S602), and pre-emphasis and windowing also have a filtering effect. Then, for each short-time analysis window, a corresponding spectrum is obtained through FFT (fast fourier transform) (operation S603). The above spectrum is then taken as an absolute value or a square value (operation S604), and then passed through a Mel filter bank to obtain a Mel spectrum (operation S605). Cepstrum analysis is performed on the Mel spectrum, the logarithm is taken (operation S606), inverse transformation is performed, the actual inverse transformation is generally implemented by DCT discrete cosine transformation, and the 2 nd to 13 th coefficients after DCT are taken as MFCC coefficients (operation S607), so as to obtain Mel frequency cepstrum coefficients MFCC, and the characteristics of the frame of speech include the MFCC characteristic vector (operation S608).

According to the embodiment of the disclosure, the DSP chip is adopted to realize signal processing, so that the digital signal characteristics are rapidly processed, the time of filtering loss in the whole system is reduced, and the accurate mel-frequency cepstrum coefficient is extracted.

In other embodiments, the first speech signal of the user is obtained by filtering, and in fact, the user side has other sounds, and when it is determined that the final emotion recognition result is fear, the user may be threatened, in which case the sounds of the user side are re-filtered, a third speech signal other than the first speech signal is determined, and the speaking content is recognized for the third speech signal, or the speech emotion characteristics are extracted. In the environment surrounding the user, the third voice signal may come from a person threatening the user, or may be a person in the same environment as the user, and can provide more information to assist in judgment. The speaking content or emotion recognition of the third speech signal may be the same as that of the first speech signal, and will not be described here again.

Fig. 7 schematically illustrates a flow chart of determining a current traffic scenario according to an embodiment of the present disclosure.

Before determining the target voting mechanism, the determination of the current traffic scenario of this embodiment includes operations S710 to S720, as shown in fig. 7.

In operation S710, first speaking content is recognized based on the first voice signal, and/or second speaking content is recognized based on a second voice signal of at least one person other than the user.

For example, at least one person other than the user may include a customer service, or the sound of another person next to the user. In addition, a third speech signal of the system alert tone may also be extracted.

In operation S720, a current traffic scenario is determined according to the first description content and/or the second speaking content.

For example, based on the first speaking content "your good, i want to make a deposit business" of the user, based on the second speaking content "your good, please complain me about your card number and deposit amount" of the customer service. And judging that the current business scene is deposit according to the voice of the user. Based on the customer service voice, the system further confirms that the current business scenario is "deposit". The system successfully judges that the current service scene is deposit service through identifying multiple rounds of voice interaction of the user and customer service parties. By recognizing multiple sections of voices of different people, understanding semantics and judging deposit business scenes according to voice interaction, business scene judgment based on person-to-person conversation is achieved.

Fig. 8 schematically illustrates a flowchart of determining a final emotion recognition result according to an embodiment of the present disclosure.

As shown in fig. 8, determining the final emotion recognition result in operation S240 includes operations S810 to S830. The target voting mechanism comprises the weights of N emotion recognition models, and N emotion recognition results comprise N emotion recognition categories and corresponding voice attribute values. The voice attribute values may include values of a reference frequency or amplitude, etc.

In operation S810, weights of the N emotion recognition models are multiplied by voice attribute values corresponding to the N emotion recognition results, respectively.

The multiplied results are summed to obtain a final voice attribute value in operation S820.

For example, according to the first recognition model, the second recognition model, and the third recognition model described above, since effects of the respective models are each advantageous in different scenes. Setting classifier output weight values for different business scenes, and carrying out same-dimension addition. For example, the category in the first recognition result is cheerful, the voice attribute value is a first predicted frequency, the category in the second recognition result is cheerful, the voice attribute value is a second predicted frequency, the category in the third recognition result is flat, and the voice attribute value is a third predicted frequency. Each predicted frequency corresponds to a reference frequency in emotion.

Aiming at financial business handling in a financial business scene, the time domain signal characteristics of a user are not obvious, and the method comprises the following steps:

RESULT＝0.3*A+0.4*B+0.3*C。

aiming at non-financial business handling in a non-financial business scene, the time domain signal characteristics before and after the consultation of the user are more remarkable, and the method is provided with:

RESULT＝0.3*A+0.3*B+0.4*C。

the RESULT is a final attribute value, and A, B and C are a first predicted frequency, a second predicted frequency, and a third predicted frequency, respectively. The value multiplied by each frequency is the weight of each.

In operation S830, a final emotion recognition result is determined according to the final voice attribute value.

In some embodiments, a voting mechanism shallow neural network is designed that includes an input layer, a hidden layer, and a softmax output layer. And sending the final voice attribute value into a neural network input layer, and outputting the value with the maximum emotion probability of the predicted result through a softmax output layer.

In other embodiments, the final speech attribute value is compared to a range of S emotion classification criteria, S being greater than or equal to 1. And determining a final emotion recognition result according to the comparison result.

And the final voice attribute value is used as a reference frequency value, and the S emotion category standard ranges comprise preset standard reference frequency ranges corresponding to S emotion categories. For example: the happiness is 180-220Hz, the gas generation is 150-180Hz, the aversion is 140-160Hz, the sadness is 110-130Hz, the agitation is 200-250Hz, and the calm is 90-110Hz. In particular, the above-mentioned ranges corresponding to the emotion categories are only examples, and those skilled in the art can set the ranges according to the actual ranges, and can flexibly adjust the range interval.

For example, the final voice attribute value is 185Hz, the final voice attribute value is compared with the standard attribute range of each emotion category, and the system judges that the final emotion recognition result of the voice signal is 'happy'.

According to the embodiment of the disclosure, the judgment of the voice emotion is directly obtained through the comparison of the final attribute value and the standard range, the shallow neural network of the voting mechanism is not required to be trained, and the training cost is saved.

In other embodiments, the voting mechanism may be a minority-subject mechanism, such as assuming 5 emotion recognition models are used to recognize the emotion classification of the first speech signal. In one case, assuming that three of the emotion recognition models all output the same emotion category (e.g., happy) and one of the emotion recognition models outputs another different emotion category (e.g., restless), one of the emotion recognition models outputs yet another different emotion category (e.g., sad), then a unique emotion category, i.e., happy, would be identified based on the votes of the 5 emotion recognition models. In this case, this emotion category is regarded as the final emotion category recognized by the plurality of emotion recognition models.

In other embodiments, the voting mechanism may be a minority-to-majority mechanism, and weights of the models are set to be different between the voting mechanisms in different business scenes, i.e. different models have different numbers of votes in different business scenes. Assuming that the weights of the 3 emotion recognition models which output happiness are 1, 1 and 2 respectively, the weight of the 1 emotion recognition model which outputs impatience is 3, and the weight of the 1 emotion recognition model which outputs sadness is 3. The number of happy votes is 4, which will be the final emotion classification identified by the plurality of emotion recognition models. And if the ticket numbers of the two emotion categories are the same, re-identifying.

Fig. 9 schematically illustrates a flow chart of a pre-obtained voting mechanism according to an embodiment of the present disclosure.

Before N emotion recognition results are obtained, as shown in fig. 9, the pre-obtained voting mechanism of this embodiment includes operations S910 to S930.

In operation S910, M training sets corresponding to M service scenarios one by one are determined, where each training set includes a speech emotion feature sample and an emotion tag in the corresponding service scenario.

Illustratively, such as deposit, withdrawal, transfer, loan, remittance, financial or payment, each financial business scenario corresponds to a training set that includes speech emotion feature samples and emotion tags for a plurality of emotion categories. Such as after-sales, consultation, complaints, reservations, registration, refunds, etc., where each non-financial business scenario corresponds to a training set that includes speech emotion feature samples and emotion tags for a plurality of emotion categories.

In operation S920, training is performed on the N emotion recognition models to be trained and the corresponding voting mechanisms by using each training set, where each voting mechanism includes a respective learnable weight of the N emotion recognition models, and training includes adjusting model parameters and learnable weights of the N emotion recognition models.

For example, training is performed independently under each service scene, for each voice emotion feature sample, according to N emotion recognition results output by N emotion recognition models, a final emotion recognition result is obtained by using a corresponding voting mechanism, and then a loss function value between the final emotion recognition result and an emotion label of the sample is calculated. And according to the magnitude of the loss function value, running a back propagation algorithm to adjust model parameters and learnable weights of the N emotion recognition models.

In operation S930, trained N emotion recognition models and M voting mechanisms are obtained, with N and M being greater than or equal to 2.

In this embodiment, in the process of performing operations S210 to S240, operation S230 may be performed first to determine a service scenario, and then N emotion recognition models and voting mechanisms that are obtained based on training in the service scenario may be invoked. I.e. each business scenario may correspond to a set of trained N emotion recognition models and voting mechanisms.

According to the embodiment of the disclosure, in a specific service scene, the voice and emotion of a user show commonalities, and the model trained for the scene can learn the commonalities better, so that the recognition performance is higher in the scene. The voting mechanism is more in line with scene requirements, the emotion focus points of the users corresponding to different business scenes are different, and the voting mechanism can make different weight settings with respect to the scenes so as to obtain the real emotion experience of the users in the scenes.

Fig. 10 schematically illustrates a flowchart of a speech emotion recognition method according to another embodiment of the present disclosure.

Referring to fig. 10, in a voice customer service conversation, a user voice signal is input (operation S1010). Next, audio collection is performed to convert the user analog signal into a digital signal (operation S1020). After the user digital signal is acquired, wiener filtering is performed using the DSP chip to remove the noise signal (operation S1030). The denoised signal is subjected to speech emotion feature extraction (operation S1040). After the signal features are extracted, the speech emotion feature classification is performed, and a preliminary result of speech emotion classification judgment is obtained (operation S1050). Then, the voting stage is entered, and the user' S voice emotion judgment result is output through the voting mechanism of the current service scene (operation S1060). Besides the hardware of the DSP chip, the audio acquisition, the voice emotion feature extraction, the voice emotion feature classification, the service scene recognition and the voting mechanism can be respectively corresponding to respective software systems.

Based on the voice emotion recognition method, the disclosure also provides a voice emotion recognition device. The device will be described in detail below with reference to fig. 11.

Fig. 11 schematically shows a block diagram of a speech emotion recognition device according to an embodiment of the present disclosure.

As shown in fig. 11, the speech emotion recognition device 1100 of this embodiment includes a feature extraction module 1110, an emotion recognition module 1120, a mechanism determination module 1130, and a final recognition module 1140.

The feature extraction module 1110 may perform operation S210 for extracting a voice emotion feature based on the first voice signal of the user.

In some embodiments, the feature extraction module 1110 may perform operations S410 to S420, which are not described herein.

The emotion recognition module 1120 may perform operation S220, configured to input the speech emotion features to the N emotion recognition models, respectively, and obtain N emotion recognition results output by the N emotion recognition models one by one, where N is greater than or equal to 2.

The mechanism determining module 1130 may perform operation S230, configured to determine, according to a current service scenario of a user, a target voting mechanism corresponding to the service scenario, where M voting mechanisms corresponding to M service scenarios one to one are provided in advance, and M is greater than or equal to 2.

The final recognition module 1140 may perform operation S240 for determining a final emotion recognition result according to the target voting mechanism and the N emotion recognition results.

In some embodiments, the final identifying module 1140 may perform operations S810-S830, which are not described herein.

In some embodiments, the speech emotion recognition device 1100 may include a scene confirmation module for performing operations S710 to S720, which are not described herein.

In some embodiments, the speech emotion recognition device 1100 may include a training module for performing operations S910 to S930, which are not described herein.

Note that, the speech emotion recognition device 1100 includes modules for performing the respective steps of any of the embodiments described above with reference to fig. 2 to 10. The implementation manner, the solved technical problems, the realized functions and the realized technical effects of each module/unit/sub-unit and the like in the apparatus part embodiment are the same as or similar to the implementation manner, the solved technical problems, the realized functions and the realized technical effects of each corresponding step in the method part embodiment, and are not repeated herein.

Any of the plurality of modules of feature extraction module 1110, emotion recognition module 1120, mechanism determination module 1130, and final recognition module 1140 may be combined in one module or any of the plurality of modules may be split into a plurality of modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module.

At least one of feature extraction module 1110, emotion recognition module 1120, mechanism determination module 1130, and final recognition module 1140 may be implemented, at least in part, as hardware circuitry, such as a Field Programmable Gate Array (FPGA), programmable Logic Array (PLA), system-on-chip, system-on-substrate, system-on-package, application Specific Integrated Circuit (ASIC), or by hardware or firmware, such as any other reasonable manner of integrating or packaging circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware, in accordance with embodiments of the present disclosure. Alternatively, at least one of the feature extraction module 1110, the emotion recognition module 1120, the mechanism determination module 1130, and the final recognition module 1140 may be at least partially implemented as computer program modules that, when executed, perform the corresponding functions.

As shown in fig. 12, an electronic device 1200 according to an embodiment of the present disclosure includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1 203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM1203, various programs and data required for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM1203 are connected to each other through a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or RAM 1203. Note that the program may be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in one or more memories.

According to an embodiment of the disclosure, the electronic device 1200 may also include an input/output (I/O) interface 1205, the input/output (I/O) interface 1205 also being connected to the bus 1204. The electronic device 1200 may also include one or more of the following components connected to the I/O interface 1205: including an input section 1206 for a keyboard, mouse, etc. Including an output portion 1207 such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc. Including a storage portion 1208 of a hard disk or the like. And a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments. Or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include the ROM 1202 and/or the RAM 1203 and/or one or more memories other than the ROM 1202 and the RAM 1203 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, and downloaded and installed via a communication portion 1209, and/or from a removable medium 1211. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of speech emotion recognition, comprising:

extracting voice emotion characteristics based on a first voice signal of a user;

respectively inputting the voice emotion characteristics into N emotion recognition models to obtain N emotion recognition results which are output by the N emotion recognition models one by one, wherein N is more than or equal to 2;

determining a target voting mechanism corresponding to the service scene according to the current service scene of the user, wherein M voting mechanisms corresponding to M service scenes one by one are provided in advance, and M is greater than or equal to 2;

and determining a final emotion recognition result according to the target voting mechanism and the N emotion recognition results.

2. The method of claim 1, wherein the speech emotion recognition method is applied to a remote telephone banking system, the method further comprising, prior to extracting speech emotion features based on the first speech signal of the user:

and acquiring the first voice signal in response to the user talking to customer service using the remote telephone banking system.

3. The method of claim 2, wherein providing a filtering device coupled to the customer service communication device, the filtering device comprising a DSP chip, extracting speech emotion features based on the first speech signal of the user comprises:

filtering the first voice signal based on the DSP chip; and

and extracting a mel-cepstral coefficient based on the filtered first voice signal.

4. The method of claim 2, wherein the M business scenarios include at least one financial business scenario and at least one non-financial business scenario, each of the M voting schemes including a respective weight of the N emotion recognition models, any one voting scheme having at least one different weight from the other at least one voting scheme.

5. The method of claim 1, wherein prior to determining the target voting mechanism, the method further comprises:

Identifying first speaking content based on the first speech signal and/or identifying second speaking content based on a second speech signal of at least one person other than the user;

and determining the current service scene according to the first description content and/or the second speaking content.

6. The method of claim 5, wherein,

the target voting mechanism comprises the weights of the N emotion recognition models, and the N emotion recognition results comprise N emotion recognition categories and voice attribute values corresponding to the N emotion recognition categories;

according to the target voting mechanism and the N emotion recognition results, determining a final emotion recognition result comprises:

multiplying the weight of each of the N emotion recognition models with the voice attribute value corresponding to each of the N emotion recognition results;

summing the multiplied results to obtain a final voice attribute value;

and determining the final emotion recognition result according to the final voice attribute value.

7. The method of claim 6, wherein determining the final emotion recognition result from the final voice attribute value comprises:

comparing the final voice attribute value with S emotion category standard ranges, wherein S is greater than or equal to 1;

And determining the final emotion recognition result according to the comparison result.

8. The method of claim 1, wherein prior to obtaining the N emotion recognition results, the method further comprises:

m training sets in one-to-one correspondence are determined according to the M business scenes, wherein each training set comprises a voice emotion characteristic sample and an emotion label in the corresponding business scene;

training N emotion recognition models to be trained and corresponding voting mechanisms by using each training set, wherein each voting mechanism comprises respective learnable weights of the N emotion recognition models, and the training comprises the adjustment of model parameters of the N emotion recognition models and the learnable weights;

obtaining trained N emotion recognition models and the M voting mechanisms, each of N and M being greater than or equal to 2.

9. A speech emotion recognition device, comprising:

the feature extraction module is used for extracting voice emotion features based on the first voice signals of the user;

the emotion recognition module is used for respectively inputting the voice emotion characteristics into N emotion recognition models to obtain N emotion recognition results which are output by the N emotion recognition models one by one, wherein N is greater than or equal to 2;

the mechanism determining module is used for determining a target voting mechanism corresponding to the service scene according to the current service scene of the user, wherein M voting mechanisms corresponding to M service scenes one to one are provided in advance, and M is greater than or equal to 2;

And the final recognition module is used for determining a final emotion recognition result according to the target voting mechanism and the N emotion recognition results.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.