CN115240713A

CN115240713A - Voice emotion recognition method and device based on multi-modal features and contrast learning

Info

Publication number: CN115240713A
Application number: CN202210825038.7A
Authority: CN
Inventors: 谭真; 张俊丰; 赵翔; 唐九阳; 王俞涵; 吴菲; 葛斌
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2022-10-25
Anticipated expiration: 2042-07-14
Also published as: CN115240713B

Abstract

The application relates to a voice emotion recognition method and device based on multi-modal features and contrast learning. The method comprises the following steps: the constructed speech emotion recognition model utilizes a Fast RCNN preprocessing model and a bidirectional GRU model to extract 3D convolution network features to obtain speech emotion features, text emotion features and high-level emotion features; performing emotion feature enhancement representation according to a comparison learning method, splicing the enhanced emotion features, and then decoding and outputting probability distribution of emotion types; constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, training a pre-constructed speech emotion recognition model by using the cross entropy loss function and the loss function in comparison learning, and performing speech emotion recognition on speech video data to be recognized according to the trained speech emotion recognition model. By adopting the method, the speech emotion recognition accuracy can be improved.

Description

Voice emotion recognition method and device based on multi-modal features and contrast learning

Technical Field

The present application relates to the field of signal processing and artificial intelligence technologies, and in particular, to a speech emotion recognition method and apparatus, a computer device, and a storage medium based on multi-modal feature and contrast learning.

Background

In daily life, the way people transmit emotions is mainly through voice. In the process of man-machine interaction, voice is also one of the main approaches. In addition, the recognition of the emotion in the voice can enable the machine to better understand the intention and the idea of the user, and further enable the machine to be more intelligent and humanized.

However, early research on emotion speech recognition was limited to speech data only, resulting in a bottleneck in recognition accuracy. In fact, when human beings transmit emotions through voice, the emotion and hand movement changes are generally accompanied, and the voice contains text information besides acoustic information, which is called a modality. It is born that the voice emotion recognition task based on multi-modal features is aided by video and voice transcribed text that contains facial expressions and hand movements when speaking. Because the voice and the synchronously acquired video and the text transcribed by the voice contain the same emotional information, the multi-modal features have certain similarity in the aspect of emotional features. However, at present, the research on the speech emotion recognition based on the multi-modal features ignores the relation among the multi-modal features, so that the multi-modal emotion feature representation is not accurate enough, and the speech emotion recognition accuracy is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a speech emotion recognition method, apparatus, computer device and storage medium based on multi-modal feature and contrast learning, which can improve speech emotion recognition accuracy.

A speech emotion recognition method based on multi-modal feature and contrast learning, the method comprising:

acquiring voice video data to be recognized; the voice video data comprises voice text and video data;

constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;

performing data preprocessing on the voice text to obtain a voice vector and a word vector;

extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;

respectively extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;

performing enhancement representation on the voice emotion feature, the text emotion feature and the high-level emotion feature according to a comparison learning method to obtain the enhanced voice emotion feature, the text emotion feature and the high-level emotion feature;

splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;

constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model;

and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.

In one embodiment, the voice text comprises voice data and text data; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, wherein the method comprises the following steps:

taking the voice data as a frame according to a window with a fixed time period length, sliding the window backwards, wherein the position of the window after each movement and the position of the previous window have an overlapping part, and obtaining a plurality of single-frame voice data;

converting the single-frame voice data according to an OpenSMILE tool to obtain a voice vector;

taking the sentence length containing most words in the text data as the maximum length, and carrying out zero filling processing on the sentences with the length less than the maximum length in the text data to obtain a plurality of sentences with equal length;

converting words in the sentences with equal length according to the Bert preprocessing model to obtain word vectors; the word vector includes semantic information of the context of the word corresponding thereto.

In one embodiment, the extracting of emotion features of the speech vector and the word vector according to the bidirectional GRU model to obtain speech emotion features and text emotion features includes:

according to the bidirectional GRU model, the voice vector and the word vector are respectively and simultaneously input into a forward GRU model and a reverse GRU model, and the state information vectors output by the two GRU models at the same time are spliced to obtain the voice emotion feature and the text emotion feature which respectively correspond to the voice emotion feature and the text emotion feature.

In one embodiment, the method for enhancing and representing the speech emotion feature, the text emotion feature and the high-level emotion feature according to a contrast learning method to obtain the enhanced speech emotion feature, the text emotion feature and the high-level emotion feature includes:

and according to the contrast learning method, the distances among the speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are drawn up by training a reduction loss function, so that the enhanced speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are obtained.

In one embodiment, the loss function is loss _cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that regulate the level of feature representation.

In one embodiment, the method for processing emotion information includes splicing enhanced speech emotion features, text emotion features and high-level emotion features, decoding the spliced speech emotion features, text emotion features and high-level emotion features through a decoder composed of full connection layers, and outputting probability distribution of emotion categories, and includes:

splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding the spliced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics through a decoder consisting of all connection layers, and outputting probability distribution of emotion types

Wherein, F _r Representing multi-modal features after stitching, p _j Representing the probability that the currently identified emotion is in the jth category,

for the jth feature parameter of the multi-modal feature,

is the ith feature parameter of the multi-modal feature.

In one embodiment, the cross entropy loss function is constructed according to the probability distribution of the emotion classes and the labeled real emotion class labels, and comprises the following steps:

constructing a cross entropy loss function according to the probability distribution of the emotion classes and the labeled real emotion class labels as

Wherein y represents the real distribution of the emotion types, x represents the emotion type labels, n represents the number of the emotion type labels, and i represents the emotion type label serial number.

A speech emotion recognition apparatus based on multi-modal features and contrast learning, the apparatus comprising:

the model building module is used for acquiring voice video data to be recognized; the voice video data comprises voice text and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;

the data preprocessing module is used for preprocessing the data of the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;

the feature extraction module is used for respectively extracting the emotional features of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional features and text emotional features; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;

the comparison learning module is used for performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a comparison learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics; splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;

the speech emotion recognition module is used for constructing a cross entropy loss function according to the probability distribution of emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a comparison learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

respectively extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting the emotional characteristics of the fused video data by using a 3D convolution network to obtain high-grade emotional characteristics;

According to the voice emotion recognition method, device, computer equipment and storage medium based on multi-modal features and contrast learning, feature extraction is respectively carried out on audio and video data through a Fast RCNN preprocessing model and a bidirectional GRU model, multi-modal features capable of expressing voice emotion are obtained, emotion information from the multi-modes is effectively utilized, the similarity between the multi-modal emotion features is effectively reduced through the contrast learning method, more accurate emotion feature expression is obtained, probability distribution of emotion classes is carried out according to the more accurate emotion feature expression, a cross entropy loss function is constructed with a real emotion class label, model training is carried out on a pre-constructed voice emotion recognition model through the loss function in the contrast learning, voice emotion recognition is carried out through the trained model, and the accuracy of the voice emotion recognition is improved.

Drawings

FIG. 1 is a schematic flow chart of a speech emotion recognition method based on multi-modal feature and contrast learning in one embodiment;

FIG. 2 is a block diagram of a method for speech emotion recognition based on multi-modal features and contrast learning, under an embodiment;

FIG. 3 is a schematic diagram of a speech emotion recognition apparatus based on multi-modal features and contrast learning, under an embodiment;

FIG. 4 is a diagram of the internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a speech emotion recognition method based on multi-modal features and contrast learning is provided, which comprises the following steps:

step 102, acquiring voice video data to be recognized; the voice video data comprises voice text and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;

step 104, performing data preprocessing on the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements.

The voice text contains words and emotions with emotional characteristics of a speaker, the voice vector and the word vector which are obtained by preprocessing the voice text can be better used for characteristic extraction, when the speaker sends voice, the voice vector and the word vector can be generally accompanied with expression change and hand action change.

Step 106, extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; and performing emotional feature extraction on the fused video data by using a 3D convolutional network to obtain high-grade emotional features.

The bidirectional GRU model comprises a forward GRU model and a reverse GRU model, the GRU model is a variant with a good effect of a long-time and short-time memory network, the structure is simpler than that of an LSTM network, the effect is good, the bidirectional GRU model is used for feature extraction, and a state information vector combining state information before and after the current time can be obtained, so that the obtained state information vector is more combined with a context, and the feature extraction is more accurate. Meanwhile, the essence of the video data is a multi-channel image formed by each frame of picture, the method adopts a 3D convolution mode to extract emotional characteristics of the video, adopts a 3D convolution kernel to perform convolution operation on the whole multi-channel three-dimensional image, and compared with the convolution operation on each 2D channel by 2D convolution, the 3D convolution can better model time information. The invention adopts a multilayer 3DCNN and a 3D pooling layer to extract high-level emotional characteristics in the video.

108, performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a contrast learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics; and splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, decoding by a decoder consisting of all connection layers, and outputting the probability distribution of emotion types.

The videos synchronously acquired with the voice and the texts transcribed by the voice contain the same emotion information, so that certain similarity exists between the extracted multi-modal features, the similarity between the multi-modal features can be drawn by a comparison learning method, the emotion information can be more accurately learned, higher-order feature representation can be obtained, and the accuracy of emotion feature recognition is improved. And then splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, and constructing a cross entropy loss function by using the probability distribution of emotion types decoded and output by a decoder consisting of a full connection layer and the labeled real emotion type labels to train a pre-constructed voice emotion recognition model so as to improve the accuracy of the voice emotion recognition model being the type of the voice emotion.

Step 110, constructing a cross entropy loss function according to the probability distribution of emotion types and labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.

And using the probability distribution of the emotion classes and the one-hot codes of the real emotion class labels to construct a cross entropy loss function, wherein the expression is as follows:

the real emotion category label is a label which is accurately labeled in advance, and the probability distribution of the emotion categories and the real emotion category label are used for constructing a cross entropy loss function to train the model, so that the model is trainedMore accurate emotion classification can be output during emotion recognition, and loss function loss in comparison learning is realized _cons In combination with the cross-entropy loss function, the loss function is then minimized by continuously updating the parameters with a random gradient descent, as follows:

loss _θ ＝loss _cons +loss _crE

where θ represents all trainable parameters in the model and η is the learning rate of the model during training.

The loss function in the comparison learning is utilized to train the pre-constructed speech emotion recognition model, so that errors generated in the comparison learning in the prior art can be reduced, the model is more accurate, more accurate emotion types can be obtained when speech emotion recognition is carried out on speech video data to be recognized according to the trained speech emotion recognition model, and the process of the speech emotion recognition is shown in fig. 2.

In the speech emotion recognition method based on multi-modal characteristics and contrast learning, the multi-modal characteristics capable of expressing speech emotion are obtained by respectively extracting characteristics of speech audio and video data through a Fast RCNN preprocessing model and a bidirectional GRU model, emotion information from multiple modes is effectively utilized, the similarity between the multi-modal emotion characteristics is effectively reduced through the contrast learning method, more accurate emotion characteristic representation is obtained, then probability distribution of emotion types is carried out according to the more accurate emotion characteristic representation, a cross entropy loss function is constructed with a real emotion type label, model training is carried out on a pre-constructed speech emotion recognition model through the loss function in the contrast learning, speech emotion recognition is carried out through the trained model, and the accuracy of the emotion speech emotion recognition is improved.

In one embodiment, the voice text comprises voice data and text data; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, and comprising the following steps:

converting words in sentences with equal length according to a Bert preprocessing model to obtain word vectors; the word vector contains semantic information of the context of the word corresponding to the word vector.

In a specific embodiment, a window of a voice according to a certain time period length is used as a frame, the window is moved backwards, meanwhile, in order to enable transition between each frame of the voice to be natural, an overlapping part exists between the window after each movement and the previous window, and finally, each frame of the voice is converted into a mel cepstrum coefficient characteristic parameter through an OpenSMILE tool, namely, the voice is vectorized. Meanwhile, because the model can only process text vectors of isometric sequences, the length of sentences containing most words in the text is taken as the maximum length, and the sentences which are not longer than the maximum length are subjected to zero filling, so that all sentences are isometric; and then converting the words in the sentences with equal length into word vectors through a Bert preprocessing model, wherein each word vector contains the semantic information of the context.

The GRU model is a variant with good effect of a long-time memory network, is simpler than the structure of an LSTM network, and has good effect.

In one embodiment, the enhancing representation of the speech emotion feature, the text emotion feature and the high-level emotion feature according to a contrast learning method to obtain enhanced speech emotion feature, text emotion feature and high-level emotion feature includes:

In one embodiment, the loss function is loss _cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that adjust the feature representation level.

In one embodiment, the method for processing the speech emotion feature, the text emotion feature and the high-level emotion feature comprises the steps of splicing the enhanced speech emotion feature, the enhanced text emotion feature and the high-level emotion feature, decoding the speech emotion feature, the text emotion feature and the high-level emotion feature by a decoder consisting of full connection layers, and outputting probability distribution of emotion categories, wherein the method comprises the following steps:

for the jth feature parameter of the multi-modal feature,

is the ith feature parameter of the multi-modal feature.

In the specific embodiment, the video synchronously acquired with the voice and the text transcribed by the voice contain the same emotion information, so that certain similarity exists between the extracted multi-modal characteristics. Then, by adopting the idea of comparative learning, the distance between the speech and the emotional characteristics of the text and the video is shortened by training a reduction loss function, wherein the loss function is as follows:

loss _cons ＝log(exp(D(S,T)/τ)+exp(D(S,V)/σ))

wherein D is an L2 distance function to measure the distance represented by two emotion characteristics, S is a speech emotion characteristic, T is a text emotion characteristic, V is a high-level emotion characteristic, and tau and sigma are parameters for adjusting the representation level of the characteristics. And splicing the expression of the emotional characteristics with higher orders of the three modes obtained after the comparison and learning, wherein the expression is as follows:

F＝concat(S _C ,T _C ,M _C )

wherein S _C ,T _C ,M _C Respectively representing higher-order emotional characteristics in the voice, the text and the video which are obtained after comparative learning. Inputting the spliced multi-modal emotional characteristics F into the full connection layer and the Relu activation function layer to expressThe formula is as follows:

F _r ＝Relu(W ^T F)＝max(0,W ^T F)

where W represents a trainable parameter matrix.

And finally, outputting probability distribution corresponding to the emotion types through a softmax function, and selecting the emotion type corresponding to the maximum probability to obtain the emotion type. The expression is as follows:

by comparing and learning to draw up the similarity among the multi-modal features, the emotion information can be more accurately learned, so that higher-order feature representation is obtained, and the accuracy of emotion recognition is enhanced.

Wherein y represents the real distribution of the emotion types, x represents the emotion type labels, n represents the number of the emotion type labels, and i represents the serial numbers of the emotion type labels.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 3, there is provided a speech emotion recognition apparatus based on multi-modal feature and contrast learning, including: a model construction module 302, a data preprocessing module 304, a feature extraction module 306, a comparison learning module 308 and a speech emotion recognition module 310, wherein:

a model building module 302, configured to obtain voice and video data to be recognized; the voice video data comprises voice text and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;

the data preprocessing module 304 is configured to perform data preprocessing on the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;

the feature extraction module 306 is configured to perform emotion feature extraction on the speech vector and the word vector according to the bidirectional GRU model, so as to obtain speech emotion features and text emotion features; extracting the emotional characteristics of the fused video data by using a 3D convolution network to obtain high-grade emotional characteristics;

the contrast learning module 308 is configured to perform enhancement representation on the speech emotion feature, the text emotion feature and the high-level emotion feature according to a contrast learning method to obtain an enhanced speech emotion feature, a text emotion feature and a high-level emotion feature; splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;

the speech emotion recognition module 310 is configured to construct a cross entropy loss function according to the probability distribution of emotion categories and the labeled real emotion category labels, and train a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.

In one embodiment, the data preprocessing module 304 is further configured to include voice data and text data in the voice text; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, wherein the method comprises the following steps:

In one embodiment, the feature extraction module 306 is further configured to perform emotion feature extraction on the speech vector and the word vector according to the bidirectional GRU model, respectively, to obtain speech emotion features and text emotion features, including:

according to the bidirectional GRU model, the voice vector and the word vector are simultaneously input into a forward GRU model and a backward GRU model respectively, and state information vectors output by the two GRU models at the same time are spliced to obtain corresponding voice emotion characteristics and text emotion characteristics respectively.

In one embodiment, the contrast learning module 308 is further configured to perform enhanced representation on the speech emotional feature, the text emotional feature and the high-level emotional feature according to a contrast learning method, so as to obtain an enhanced speech emotional feature, a text emotional feature and a high-level emotional feature, including:

In one embodiment, the contrast learning module 308 is further configured to concatenate the enhanced speech emotion features, text emotion features, and high-level emotion features, decode the concatenated features by a decoder composed of full concatenation layers, and output a probability distribution of emotion classes, where the probability distribution includes:

for the jth feature parameter of the multi-modal feature,

is the ith feature parameter of the multi-modal feature. .

In one embodiment, the speech emotion recognition module 310 is further configured to construct a cross entropy loss function according to the probability distribution of emotion classes and the labeled real emotion class labels, including:

For specific limitations of the speech emotion recognition device based on multi-modal features and contrast learning, reference may be made to the above limitations of the speech emotion recognition method based on multi-modal features and contrast learning, and details are not repeated here. The modules in the speech emotion recognition device based on multi-modal features and contrast learning can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech emotion recognition method based on multi-modal features and contrast learning. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech emotion recognition method based on multi-modal features and contrast learning, the method comprising:

carrying out data preprocessing on the voice text to obtain a voice vector and a word vector;

extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after the local characteristics are expanded to obtain fused video data; the local features include facial expressions and hand movements;

extracting the emotional characteristics of the voice vector and the word vector according to a bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;

splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, decoding the voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics through a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;

2. The method according to claim 1, wherein the speech text comprises speech data and text data; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, and comprising the following steps:

taking the sentence length containing the most words in the text data as the maximum length, and carrying out zero filling processing on the sentences with the maximum length in the text data to obtain a plurality of sentences with equal length;

converting words in the sentences with the same length according to a Bert preprocessing model to obtain word vectors; the word vector contains semantic information of the context of the word corresponding to the word vector.

3. The method of claim 1, wherein performing emotion feature extraction on the speech vector and the word vector according to a bidirectional GRU model to obtain speech emotion features and text emotion features respectively comprises:

and according to the bidirectional GRU model, the voice vector and the word vector are respectively and simultaneously input into a forward GRU model and a reverse GRU model, and the state information vectors output by the two GRU models at the same time are spliced to obtain the voice emotion characteristic and the text emotion characteristic which respectively correspond to each other.

4. The method according to any one of claims 1 to 3, wherein the enhancing expression of the speech emotional features, the text emotional features and the high-level emotional features is performed according to a contrast learning method, so as to obtain enhanced speech emotional features, text emotional features and high-level emotional features, and the method comprises:

and according to the contrast learning method, the distances among the speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are narrowed through training a narrowing loss function, so that the enhanced speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are obtained.

5. The method of claim 4, wherein the loss function is loss _cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that adjust the feature representation level.

6. The method of claim 1, wherein the concatenating the enhanced speech emotion feature, the enhanced text emotion feature and the enhanced high-level emotion feature, decoding the concatenated features by a decoder composed of full concatenation layers, and outputting a probability distribution of emotion classes, comprises:

splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, decoding the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics through a decoder consisting of all connection layers, and outputting probability distribution of emotion types as

for the jth feature parameter of the multi-modal feature,

is the ith feature parameter of the multi-modal feature.

7. The method of claim 6, wherein constructing a cross-entropy loss function according to the probability distribution of the emotion classes and the labeled real emotion class labels comprises:

constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels as

8. A speech emotion recognition apparatus based on multi-modal features and contrast learning, the apparatus comprising:

the model building module is used for acquiring voice video data to be recognized; the voice video data comprises voice texts and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;

the data preprocessing module is used for preprocessing the data of the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; the local features include facial expressions and hand movements;

the comparison learning module is used for performing enhancement representation on the voice emotional features, the text emotional features and the high-level emotional features according to a comparison learning method to obtain enhanced voice emotional features, text emotional features and high-level emotional features; splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;

the speech emotion recognition module is used for constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.