WO2021174757A1

WO2021174757A1 - Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium

Info

Publication number: WO2021174757A1
Application number: PCT/CN2020/105543
Authority: WO
Inventors: 王德勋; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-03-03
Filing date: 2020-07-29
Publication date: 2021-09-10
Also published as: CN111429946A

Abstract

The present application relates to the technical field of artificial intelligence and relates to a method and apparatus for recognizing the emotion in a voice, an electronic device and a computer-readable storage medium. The method comprises: when a user voice is received, extracting multiple types of audio features of the user voice; matching the audio features with feature samples in an emotion feature library, and obtaining an emotion tag corresponding to a feature sample that matches each audio feature; constructing a feature tag matrix of the user voice on the basis of the audio features and the emotion tags corresponding to the matched feature samples; inputting the feature tag matrix into a multi-emotion recognition model, and obtaining a plurality of emotion sets and scene tags corresponding to the emotion sets; and acquiring a scene tag that matches the voice scene of the user voice so as to determine the emotion set corresponding to the matched scene tag as a recognized emotion in the user voice. According to the present application, various potential emotions can be efficiently and accurately recognized from a voice.

Description

Voice emotion recognition method, device, electronic equipment and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 3, 2020, the application number is 202010138561.3, and the invention title is "Voice Emotion Recognition Method, Apparatus, Medium and Electronic Equipment", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic device, and computer-readable storage medium for voice emotion recognition.

Background technique

Emotional computing is an important technology that gives intelligent machines the ability to perceive, understand and express various emotional states. As an important carrier of emotional information expression, voice technology has also received more and more attention. Although the current voice emotion detection has good results, the inventor realizes that due to problems such as the quality of the data set and the subjective annotation of emotions, most models can only judge a single emotion, and there are fewer types of emotions that can be judged, which cannot be accurately described. For the hidden emotions in complex speech, it is difficult to determine the boundaries of the multiple emotions that may be contained in a speech. These problems greatly limit the promotion and development of speech emotion recognition technology.

Summary of the invention

In order to solve the above technical problems, an object of the present application is to provide a voice emotion recognition method, device, electronic equipment, and computer-readable storage medium.

Among them, the technical solution adopted in this application is:

In a first aspect, a voice emotion recognition method includes: when a user voice is received, extracting multiple types of audio features of the user voice; Emotion labels corresponding to the feature samples matched by the audio features; construct the feature label matrix of the user voice based on the audio features and the emotion labels corresponding to the matched feature samples; input the feature label matrix into multiple emotions A recognition model is used to obtain a plurality of emotion sets and a scene label corresponding to each of the emotion sets; to obtain a scene label matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene label as the recognition The user’s voice emotions.

In a second aspect, a voice emotion recognition device includes: an extraction module for extracting multiple types of audio features of the user's voice when a user voice is received; a matching module for separately comparing the audio feature with the emotional feature library Matching the feature samples in each of the audio features to obtain an emotion label corresponding to the feature sample that matches each of the audio features; a construction module for constructing the user based on the audio feature and the emotion label corresponding to the matched feature sample A feature label matrix of speech; a prediction module, used to input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; a determination module, used to obtain the user voice The scene tag matched by the voice scene of the user is determined to determine the emotion set corresponding to the matched scene tag as the recognized voice emotion of the user.

In a third aspect, an electronic device includes: a processor; and a memory for storing computer program instructions of the processor; wherein the processor is configured to execute the above method by executing the computer program instructions.

In a fourth aspect, a computer-readable storage medium has computer program instructions stored thereon, and when the computer program instructions are executed by a processor, the above method is implemented.

In the above technical solution, firstly, when the user voice is received, the multiple types of audio features of the user voice are extracted; the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives. Characterize the user's emotions. Then, the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained. Emotions, in turn, can guide the identification of various hidden emotions of the user in the subsequent steps. Then, based on the audio features and the corresponding emotional tags of the matched feature samples, construct the feature tag matrix of the user's voice; different possibilities of different types of audio features and corresponding feature samples of each similarity can be reflected The sentiment labels of, which are structurally linked through the feature label matrix, can reflect the possible laws of emotional changes. Furthermore, the feature label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions. Finally, the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the application.

Description of the drawings

Fig. 1 schematically shows a flow chart of a method for speech emotion recognition.

Fig. 2 schematically shows an example diagram of an application scenario of a voice emotion recognition method.

Fig. 3 schematically shows a flow chart of a feature extraction method.

Fig. 4 schematically shows a block diagram of a voice emotion recognition device.

Fig. 5 schematically shows an example block diagram of an electronic device for implementing the above-mentioned voice emotion recognition method.

Fig. 6 schematically shows a computer-readable storage medium for implementing the aforementioned voice emotion recognition method.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, the provision of these embodiments makes this application more comprehensive and complete, and fully conveys the concept of the example embodiments To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present application. However, those skilled in the art will realize that the technical solutions of the present application can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail to avoid overwhelming the crowd and obscure all aspects of the present application.

In addition, the drawings are only schematic illustrations of the application and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

This exemplary embodiment first provides a voice emotion recognition method. The voice emotion recognition method can be run on a server, a server cluster or a cloud server, etc. Of course, those skilled in the art can also run the present invention on other platforms according to their needs. The method is not specifically limited in this exemplary embodiment. As shown in FIG. 1, the voice emotion recognition method may include the following steps:

Step S110, when a user voice is received, extract multiple types of audio features of the user voice;

Step S120, respectively matching the audio features with the feature samples in the emotion feature library, to obtain the emotion label corresponding to the feature sample that matches each of the audio features;

Step S130, based on the audio features and the emotion labels corresponding to the matched feature samples, construct a feature label matrix of the user voice;

Step S140: Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;

Step S150: Obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.

In the above voice emotion recognition method, first, when a user voice is received, multiple types of audio features of the user’s voice are extracted; the multiple types of audio features obtained in this way can reflect the change characteristics of the user’s voice from different perspectives, that is, from different perspectives. The angle characterizes the user's emotions. Then, the audio features are matched with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches each of the audio features; in this way, the suspects represented by each feature vector can be obtained. Emotions, in turn, can guide the identification of various hidden emotions of the user in the subsequent steps. Then, based on the audio features and the emotional tags corresponding to the matched feature samples, construct the feature tag matrix of the user voice; different types of audio features and corresponding feature samples of each similarity can be reflected in different possibilities The sentiment labels of, which are structurally linked through the feature label matrix, can reflect the possible laws of emotional changes. Furthermore, the feature label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets; the multi-emotion recognition model can be used to efficiently and accurately analyze multiple possibilities based on the feature label matrix Scenes and corresponding emotions. Finally, the scene tags matched by the voice scene of the user’s voice are acquired to determine the emotion set corresponding to the matched scene tags as the recognized user’s voice emotions; in this way, the voice information can be obtained according to the matching of the real scene of the voice. Speech emotion recognition results. In this way, various potential emotions can be recognized efficiently and accurately from speech.

Hereinafter, each step in the above-mentioned voice emotion recognition method in this exemplary embodiment will be explained and described in detail with reference to the accompanying drawings.

In step S110, when a user voice is received, multiple types of audio features of the user voice are extracted.

In the embodiment of this example, referring to FIG. 2, the server 201 receives the user voice sent by the server 202, and then the server 201 can extract multiple types of audio features of the user voice, and then perform emotion recognition in the subsequent steps. Among them, the server 201 can be any terminal that has the function of executing program instructions and storage, such as a cloud server, a mobile phone, a computer, etc.; the server 202 can be any terminal that has a storage function, such as a mobile phone, a computer, and the like.

Audio features can be: zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, Mel cepstrum coefficient feature, etc. Audio characteristics. These features can be extracted from a piece of audio using existing audio feature extraction methods. The extracted multi-type audio features of the user's voice can reflect the change characteristics of the user's voice from different angles, that is, it can represent the user's emotions from different angles, for example, short-term energy reflects the strength of the signal at different times , Which can reflect the change process of the user’s emotional stability in a segment of speech; audio has periodic characteristics, and the short-term average amplitude difference can be used to better observe the periodic characteristics in the case of steady noise, and the short-term average amplitude difference can reflect the user’s The periodicity of the middle mood; the formant is the resonance characteristic when the quasi-periodic pulse at the glottis is excited into the vocal tract, resulting in a set of resonance frequencies. This set of resonance frequencies is called the formant frequency or formant for short. Formant parameters Including the frequency of the formant and the width of the frequency band, it is an important parameter to distinguish different finals, and can characterize the user's emotions from a language perspective.

In this way, by extracting multiple types of audio features of the user's voice, the user's emotions can be analyzed based on the multiple types of audio features in the subsequent steps.

In an implementation of this example, referring to FIG. 3, when a user voice is received, extracting multiple types of audio features of the user voice includes:

Step S310, when the user voice is received, convert the user voice into text;

Step S320, matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text;

Step S330: Extract audio features of multiple feature categories associated with the text sample from the user voice.

When the user's voice is received, the user's voice is converted into text, and the real content expressed by the user can be obtained. Then, the converted text is matched with the text sample in the feature extraction category database to obtain a text sample that matches the converted text , The feature extraction category database stores the feature categories of multiple audio features that can clearly reflect emotions when texts with different semantic meanings are expressed. Furthermore, by extracting audio features of multiple feature categories associated with the text sample from the user's voice, emotion recognition can be efficiently and accurately performed in the subsequent steps.

In an embodiment, the multiple types of audio characteristics include at least zero-crossing rate characteristics, short-term energy characteristics, short-term average amplitude difference characteristics, pronunciation frame number characteristics, pitch frequency characteristics, formant characteristics, harmonic-to-noise ratio characteristics, and There are three characteristics of Mel cepstrum coefficients.

Multiple types of audio features include at least three of the zero-crossing rate feature, short-term energy feature, short-term average amplitude difference feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, and Mel cepstrum coefficient feature. Therefore, multiple emotion recognition can be realized with high accuracy.

In step S120, the audio features are matched with the feature samples in the emotion feature library, respectively, to obtain an emotion label corresponding to the feature samples that match each of the audio features.

In the implementation of this example, the emotion feature library stores feature samples of audio features of various categories, and each feature sample is associated with a category of emotion label. The audio feature is matched with the feature samples in the emotional feature library, and the similarity between the audio feature and the feature sample can be calculated through Euclidean distance or Hamming distance, and then multiple feature samples matching each audio feature (such as similar (Feature samples with a degree greater than 50%) corresponding emotion labels, so that multiple suspicious emotions represented by each feature vector can be obtained, which can guide the subsequent steps to identify various hidden emotions of the user.

In an implementation manner of this example, the respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples matching each of the audio features includes:

Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;

Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.

The predetermined threshold can be set according to accuracy requirements. The predetermined threshold corresponds to the number of audio features. That is, the value of the predetermined threshold is determined by the number of audio features. The more the number of audio features, the more the predetermined threshold. The smaller. In this way, by separately comparing the audio features with the feature samples in the emotion feature library, multiple feature samples whose similarity to each audio feature exceeds a predetermined threshold are obtained, and then the emotion label corresponding to each feature sample is obtained from the emotion feature library , It can ensure that the emotion recognition of each audio feature knows the reliability.

In step S130, a feature tag matrix of the user voice is constructed based on the audio feature and the emotion tag corresponding to the matched feature sample.

In the implementation of this example, the feature label matrix stores the audio features of the user's voice and the corresponding emotion labels. Emotion tags that can reflect the audio features of the user’s voice and the possible emotions reflected. Then, the different types of audio features and the corresponding similarity feature samples are structurally linked to the emotion tags of different possibilities, through the emotion tags Form the constraints of the audio feature combination. Different types of audio features and corresponding emotion labels of different possibilities embodied by feature samples of respective similarities can be structurally linked through a feature label matrix, which can reflect the law of potential potential emotion changes.

In an implementation manner of this example, the constructing the feature tag matrix of the user voice based on the audio feature and the corresponding emotion tag of the matched feature sample includes:

Adding the audio feature to the first row of the matrix;

The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.

Each audio feature is added to the first row of the empty matrix, and then each column corresponds to an audio feature. The emotion label corresponding to each audio feature is added to the corresponding column of each audio feature in the order of the similarity between each feature sample and the audio feature to obtain the feature label matrix, for example, A audio feature and A1 feature sample If the similarity is 63%, the Qin Xu label corresponding to the A1 feature sample can be added to the row in the 60%-70% interval of the column where the A audio feature is located. Each row of the matrix corresponds to a similarity range, for example, rows with a similarity range of 60%-70%.

In step S140, the vector label matrix is input into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets.

In the embodiment of this example, the multi-emotion recognition model is a pre-trained machine learning model that can recognize multiple emotions at once. The vector label matrix is input to the multi-emotion recognition model, which can be used for multi-class audio based on the structured label matrix. The constraint of the feature vector allows the machine learning model to easily calculate the possible emotions of the user’s voice, obtain multiple emotional combinations, predict multiple emotional sets of the user’s voice, and possible scenarios for each emotional set (such as shopping scenes, chat scenes) ) Scene label. In this way, multiple possible scenarios and corresponding multiple emotions can be analyzed efficiently and accurately based on the vector label matrix through the multi-emotion recognition model.

In an implementation of this example, the method for constructing the multi-emotion recognition model includes:

Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;

A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.

First, use the AISHELL Chinese voiceprint database to train the restnet34 model. After the training, the first n-layer network is taken out as the pre-training model. After that, the multi-layer fully connected layer is used as the classifier. Finally, the labeled speech emotion data set is used for the The model is trained to obtain the final model. For the imbalance of positive and negative samples encountered in it, the ratio of positive and negative samples can be calculated in each training batch as the weighting matrix of the loss function, so that it can pay more attention to the small sample data and improve the model’s performance. Accuracy.

In an implementation of this example, the first multi-emotion recognition model and the second multi-emotion recognition model are initialized at the same time, and the first multi-emotion recognition model is trained on the first multi-emotion recognition model using raw data with labels and unlabeled. A prediction value, and the classification error loss value of the labeled data part is obtained;

Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;

Calculating an error between the first predicted value and the second predicted value as a consistency loss value;

The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.

The original model can be improved by means of semi-supervised learning Mean-Teacher, and a large amount of unlabeled data can be reused. Initialize two models at the same time: the first multi-emotion recognition model Model _student and the second multi-emotion recognition model Model _teacher . Use the raw data with label and unlabeled _{to train on Model student} to obtain the probability value of each emotion P _student , and obtain the The classification error loss value loss _{classification of the} label data part, and then use the exponential moving average to update the Model _teacher . The moving average can make the model more robust on the test data. Then, input the noise-added data into Model _teacher training to obtain the predicted value P _teacher , calculate the _{error between P teacher} and P _student as the consistency loss value loss _consistency , and update the first multiple emotion recognition with the loss value _{of loss classification} + loss _consistency Model _student .

Combining the above two embodiments to construct a multi-emotion recognition model, transfer learning and semi-supervised learning technology can be used to effectively improve the classification effect of the model under a small amount of data sets, and also alleviate the model overfitting problem to a certain extent. After testing, the program can not only accurately detect the displayed emotions in the voice, but also accurately identify a variety of potential emotions, improving and expanding the voice emotion recognition technology.

In step S150, a scene tag matched by the voice scene of the user's voice is acquired, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.

In the implementation of this example, the scene of the user's voice can be determined by pre-calibrating or locating the voice source (such as customer service voice). The emotion set corresponding to the scene tag matched by the scene of the user's voice is determined as the recognized user's voice emotion, so as to ensure the accuracy of the recognition boundary, which can further ensure the accuracy of the emotion recognition of the user's voice. According to the matching of the real scene of the voice, the voice emotion recognition result of the voice is obtained.

In this way, various potential emotions can be recognized efficiently and accurately from speech.

The application also provides a voice emotion recognition device. As shown in FIG. 4, the voice emotion recognition device may include an extraction module 410, a matching module 420, a construction module 430, a prediction module 440 and a determination module 450. in:

The extraction module 410 may be used to extract multiple types of audio feature vectors of the user voice when the user voice is received;

The matching module 420 may be configured to respectively match the audio feature vector with feature vector samples in the emotion feature library to obtain an emotion label corresponding to each feature vector sample that matches the audio feature vector;

The construction module 430 may be configured to construct a vector label matrix of the user voice based on the audio feature vector and the corresponding emotion label of the matched feature vector sample;

The prediction module 440 may be used to input the vector label matrix into a multi-emotion recognition model to obtain multiple emotion sets and a scene label corresponding to each emotion set;

The determining module 450 may be used to obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.

The specific details of each module in the above-mentioned voice emotion recognition device have been described in detail in the corresponding voice emotion recognition method, so it will not be repeated here.

It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

In addition, although the various steps of the method in the present application are described in a specific order in the drawings, this does not require or imply that these steps must be performed in the specific order, or that all the steps shown must be performed to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.

Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present application.

In an exemplary embodiment of the present application, an electronic device capable of implementing the above method is also provided.

Those skilled in the art can understand that various aspects of the present invention can be implemented as a system, a method, or a program product. Therefore, various aspects of the present invention can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System".

The electronic device 500 according to this embodiment of the present invention will be described below with reference to FIG. 5. The electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.

As shown in FIG. 5, the electronic device 500 is represented in the form of a general-purpose computing device. The components of the electronic device 500 may include, but are not limited to: the aforementioned at least one processing unit 510, the aforementioned at least one storage unit 520, and a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510).

Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 510, so that the processing unit 510 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Steps of implementation. For example, the processing unit 510 may perform as shown in FIG. 1:

The storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202, and may further include a read-only storage unit (ROM) 5203.

The storage unit 520 may also include a program/utility tool 5204 having a set (at least one) program module 5205. Such program module 5205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.

The bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.

The electronic device 500 can also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable customers to interact with the electronic device 500, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 500 to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface 550, and may also include a display unit 540 connected to the input/output (I/O) interface 550. In addition, the electronic device 500 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 560. As shown in the figure, the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.

In an exemplary embodiment of the present application, as shown in FIG. 6, a computer-readable storage medium is also provided. The computer-readable storage medium may be non-volatile or volatile, and stored thereon Program products that can implement the above-mentioned methods in this specification. In some possible implementation manners, various aspects of the present invention may also be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned "Exemplary Method" section of this specification.

Referring to FIG. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer. However, the program product of the present invention is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.

The program product can use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, in which readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.

The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

The program code used to perform the operations of the present invention can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural styles. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the client computing device, partly executed on the client device, executed as an independent software package, partly executed on the client computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a client computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).

In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiment of the present invention, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.

After considering the specification and practicing the invention disclosed herein, those skilled in the art will easily think of other embodiments of the present application. This application is intended to cover any variations, uses, or adaptive changes of this application. These variations, uses, or adaptive changes follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the application are pointed out by the claims.

Claims

A voice emotion recognition method, which includes:

When the user voice is received, extract multiple types of audio features of the user voice;

Respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples that match each of the audio features;

Constructing a feature tag matrix of the user's voice based on the audio feature and the emotion tag corresponding to the matched feature sample;

Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;

Acquire the scene tag matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
The method according to claim 1, wherein, when the user voice is received, extracting multiple types of audio features of the user voice comprises:

When the user voice is received, convert the user voice into text;

Matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text;

From the user voice, audio features of a plurality of feature categories associated with the text sample are extracted.
The method according to claim 1, wherein the respectively matching the audio feature with the feature samples in the emotional feature library to obtain the emotional label corresponding to the feature sample matched with each of the audio features comprises:

Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;

Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
The method according to claim 1, wherein the constructing a feature label matrix of the user voice based on the audio feature and the corresponding emotion label of the matched feature sample comprises:

Adding the audio feature to the first row of the matrix;

The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
The method according to claim 1, wherein the method for constructing the multi-emotion recognition model comprises:

Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;

A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
The method according to claim 5, further comprising:

Initialize the first multi-emotion recognition model and the second multi-emotion recognition model at the same time, and use the labeled mixed unlabeled raw data to train on the first multi-emotion recognition model to obtain the first predicted value, and obtain the labeled data part The loss of classification error;

Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;

Calculating an error between the first predicted value and the second predicted value as a consistency loss value;

The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
The method according to claim 1 or 2, wherein the multiple types of audio characteristics include at least zero-crossing rate characteristics, short-term energy characteristics, short-term average amplitude difference characteristics, pronunciation frame number characteristics, pitch frequency characteristics, formant characteristics , Three of the characteristics of the harmonic to noise ratio and the Mel cepstrum coefficient.
A voice emotion recognition device, which includes:

The extraction module is used to extract multiple types of audio features of the user voice when the user voice is received;

A matching module, configured to respectively match the audio features with the feature samples in the emotion feature library to obtain the emotion label corresponding to the feature sample that matches each of the audio features;

A construction module, configured to construct a feature tag matrix of the user voice based on the audio feature and the corresponding emotion tag of the matched feature sample;

A prediction module, configured to input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;

The determining module is configured to obtain a scene tag matched by the voice scene of the user's voice, so as to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
An electronic device, comprising: a processor; and a memory for storing computer program instructions of the processor; wherein the processor is configured to execute the following processing by executing the computer program instructions:

When the user voice is received, extract multiple types of audio features of the user voice;

Respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples that match each of the audio features;

Constructing a feature tag matrix of the user's voice based on the audio feature and the emotion tag corresponding to the matched feature sample;

Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;

Acquire the scene tag matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
The electronic device according to claim 9, wherein said extracting multiple types of audio features of the user voice when the user voice is received comprises:

When the user voice is received, convert the user voice into text;

Matching the text with a text sample in the feature extraction category database to obtain a text sample matching the text;

From the user voice, audio features of a plurality of feature categories associated with the text sample are extracted.
The electronic device according to claim 9, wherein the matching the audio features with the feature samples in the emotional feature library respectively to obtain the emotional label corresponding to the feature sample that matches with each of the audio features comprises:

Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;

Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
9. The electronic device according to claim 9, wherein the constructing the feature label matrix of the user voice based on the audio feature and the emotion label corresponding to the matched feature sample comprises:

Adding the audio feature to the first row of the matrix;

The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
The electronic device according to claim 9, wherein the method for constructing the multi-emotion recognition model comprises:

Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;

A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
The electronic device according to claim 13, further comprising:

Initialize the first multi-emotion recognition model and the second multi-emotion recognition model at the same time, and use the labeled mixed unlabeled raw data to train on the first multi-emotion recognition model to obtain the first predicted value, and obtain the labeled data part The loss of classification error;

Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;

Calculating an error between the first predicted value and the second predicted value as a consistency loss value;

The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.
A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions execute the following processing when executed by a processor:

When the user voice is received, extract multiple types of audio features of the user voice;

Respectively matching the audio features with the feature samples in the emotion feature library to obtain the emotion labels corresponding to the feature samples that match each of the audio features;

Constructing a feature tag matrix of the user's voice based on the audio feature and the emotion tag corresponding to the matched feature sample;

Input the feature label matrix into a multi-emotion recognition model to obtain multiple emotion sets and scene labels corresponding to each of the emotion sets;

Acquire the scene tag matched by the voice scene of the user's voice to determine the emotion set corresponding to the matched scene tag as the recognized user's speech emotion.
15. The computer-readable storage medium according to claim 15, wherein the extracting multiple types of audio features of the user voice when the user voice is received includes:

When the user voice is received, convert the user voice into text;

Matching the text with a text sample in a feature extraction category database to obtain a text sample matching the text;

From the user voice, audio features of a plurality of feature categories associated with the text sample are extracted.
15. The computer-readable storage medium according to claim 15, wherein said respectively matching said audio feature with a feature sample in an emotional feature library to obtain an emotional tag corresponding to each feature sample matching said audio feature ,include:

Respectively comparing the audio feature with the feature samples in the emotional feature library to obtain a plurality of feature samples whose similarity to each of the audio features exceeds a predetermined threshold, where the predetermined threshold corresponds to the number of the audio features;

Obtain the emotion label corresponding to each of the feature samples from the emotion feature library.
15. The computer-readable storage medium according to claim 15, wherein the constructing the feature label matrix of the user voice based on the audio feature and the emotion label corresponding to the matched feature sample comprises:

Adding the audio feature to the first row of the matrix;

The emotion label corresponding to each audio feature is added to the column corresponding to each audio feature in the descending order of the similarity between each feature sample and the audio feature to obtain the The feature label matrix, wherein each row of the matrix corresponds to a similarity range.
15. The computer-readable storage medium of claim 15, wherein the method for constructing the multi-emotion recognition model comprises:

Use the AISHELL Chinese voiceprint database to train the restnet34 model, and take out the first n-layer network as a pre-training model after the training is complete;

A multi-layer fully connected layer is used as a classifier for the pre-training model to obtain a recognition model, and a multi-emotion recognition model is obtained by training the recognition model using the labeled speech emotion data set.
The computer-readable storage medium according to claim 19, further comprising:

Initialize the first multi-emotion recognition model and the second multi-emotion recognition model at the same time, and use the labeled mixed unlabeled raw data to train on the first multi-emotion recognition model to obtain the first predicted value, and obtain the labeled data part The loss of classification error;

Updating the second multi-emotion recognition model by using an exponential moving average, and inputting noise-added data into the updated second multi-emotion recognition model to obtain a second predicted value;

Calculating an error between the first predicted value and the second predicted value as a consistency loss value;

The first multiple emotion recognition model is updated by using the sum of the classification error loss value and the consistency loss value.