CN111931795A

CN111931795A - Multi-modal emotion recognition method and system based on subspace sparse feature fusion

Info

Publication number: CN111931795A
Application number: CN202011019175.9A
Authority: CN
Inventors: 李树涛; 马付严; 孙斌
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2020-11-13
Anticipated expiration: 2040-09-25
Also published as: CN111931795B

Abstract

The invention discloses a multi-modal emotion recognition method and system based on subspace sparse feature fusion, and the method comprises the steps of obtaining feature sequences of multiple modes, carrying out word level alignment, normalization processing and position coding, then inputting a corresponding multi-branch sparse attention module, decomposing the feature sequences to a low-dimensional feature subspace to obtain low-dimensional features, cascading all the low-dimensional features in the low-dimensional feature subspace based on weights, obtaining fused multi-modal information through training in a multi-branch sparse attention network, and then inputting a pre-trained emotion classifier to obtain the current emotion category of an object to be recognized, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-modal information and the emotion category. According to the method, the multi-modal information is decomposed into multiple subspaces for fusion by considering the association sparsity among the time sequence information, so that the context information in and among the modalities can be captured, and the accuracy of multi-modal emotion recognition is improved.

Description

Multi-modal emotion recognition method and system based on subspace sparse feature fusion

Technical Field

The invention relates to a multi-modal man-machine natural interaction technology, in particular to a multi-modal emotion recognition method and system based on subspace sparse feature fusion.

Background

The multi-modal human-computer natural interaction faces emotional challenges, and to overcome the emotional challenges in the multi-modal human-computer natural interaction, the problem that the robot understands and recognizes human emotions must be solved first, so emotion recognition is an important research subject in the field of human-computer interaction, and rapid development is achieved in recent years. The accuracy of emotion recognition by solely using a facial image or a voice signal is in a bottleneck state, and the robustness is poor. Compared with single-modal emotion recognition, the multi-modal emotion recognition can more comprehensively utilize emotion signals in voice, facial expression images and texts, and further improves the emotion recognition level. Thus, an increasing number of researchers are focusing their attention on multimodal emotion recognition studies.

However, there are many challenges to be solved and overcome in multi-modal emotion recognition, which mainly include: first, the representation and fusion of different modal emotional features. The audio and video information is collected by different sensors, the data format and the capture rate are different, and the problems of unified representation and fusion of emotional characteristics in the multi-mode signals are not solved. Second, modality information is missing. The existing multi-modal emotion recognition method generally assumes that multi-modal information is completely acquired, and the absence of a certain modality is not considered, but the absence of audio and video modalities can be caused by noise and shielding in a real environment. Thirdly, the uncertainty factor of the emotional characteristics. Language, gender and culture can lead to differences in the expression of specific emotional states in different scenarios.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: in order to solve the problems in the prior art, the invention provides a multi-modal emotion recognition method and system based on subspace sparse feature fusion.

In order to solve the technical problems, the invention adopts the technical scheme that:

a multi-modal emotion recognition method based on subspace sparse feature fusion comprises the following steps:

1) acquiring a characteristic sequence of multiple current modals of an identified object;

2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;

3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;

4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;

5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;

6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.

Optionally, the features of the plurality of modalities in step 1) include a text feature sequence, an audio feature sequence, and a video feature sequence.

Optionally, the step of step 2) comprises: aligning the audio characteristic sequence and the video characteristic sequence according to the text characteristic sequence, recording the starting time and the ending time of the ith word, respectively averaging the characteristics in the corresponding time periods of the audio characteristic sequence and the video characteristic sequence, normalizing the aligned text characteristic sequence, audio characteristic sequence and video characteristic sequence to be in a range of [0,1], finally limiting the length of the text content, intercepting the excess part, complementing 0 for the insufficient part, and respectively unifying the characteristic dimensions of the text characteristic sequence, the audio characteristic sequence and the video characteristic sequence to be (20, 300), (20, 74) and (20, 35).

Optionally, the function expression of the position code in step 3) is as follows:

in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE _(pos,2i)representing a position-coding matrixPEMiddle positionposIs encoded in the position of dimension 2i of (a),PE _(pos,2i+1)indicating a locationposIs encoded in the position of dimension 2i +1,X ₀a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix.

Optionally, the processing step of the multi-branch sparse attention module in step 3) on the input feature sequence of the imported location information includes: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;

performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:

in the above formula, the first and second carbon atoms are,W _i ^q 、W _i ^k 、W _i ^vrespectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X _i ^q 、X _i ^k 、X _i ^vrespectively is a query feature, a key feature and a value feature projected into 6 different feature spaces;

wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead _i：

In the above formula, the first and second carbon atoms are,head _iis as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X _i ^q 、 X _i ^k 、X _i ^vare respectively projected to the secondiQuery features, key features, value features in the individual feature space,d _kin order to input the dimensions of the sequence of features,sparse(X _i ^q X _i ^kT) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X _i ^q X _i ^kT) The functional expression of (a) is:

in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X _i ^q 、X _i ^k 、X _i ^vare respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;

then the head in each feature spacehead _iCascading to the resulting multi-headed feature using:

in the above formula, the first and second carbon atoms are,MultiHead(X _i ^q 、X _i ^k 、X _i ^v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head ₁～head ₆the heads in the 1 st to 6 th feature spaces,W ^Ois an output weight matrix;

the functional expression of the result output by the gate control linear unit activation is shown as the following formula:

in the above formula, the first and second carbon atoms are,W、Vfor the purpose of the different convolution kernels, the,b、cas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h _l (X)the result of the output is activated for gating the linear cell.

Optionally, decomposing the high-dimensional features corresponding to each modality into a low-dimensional feature subspace in step 4) to obtain a functional expression of the low-dimensional features as follows:

in the above formula，X _i ^L 、X _i ^A 、X _i ^VRespectively the low-dimensional features of each modality in the low-dimensional feature subspace,X _f ^L 、X _f ^A 、X _f ^Vthree single-modality features containing context information,W _i ^L 、W _i ^A 、W _i ^Vdimension reduction matrixes corresponding to the three modes are respectively adopted;

when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:

α _i + β _i + γ _i=1,i=1,2,…,6

in the above formula, the first and second carbon atoms are,α _i ,β _i ,γ _iweights assigned to the low-dimensional features, respectively;

the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:

in the above formula, the first and second carbon atoms are,F _iis shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α _i ,β _i ,γ _ithe weights assigned to the low-dimensional features respectively,X _i ^L 、X _i ^A 、X _i ^Vrespectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.

Optionally, the step of step 5) comprises:

5.1) calculating the head of the multi-branch sparse attention network according to the following formula;

in the above formula, the first and second carbon atoms are,head _i ^ffirst to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F _iis shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;

5.2) calculating the fused multi-modal information according to the following formula;

in the above formula, the first and second carbon atoms are,F _frepresents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head ₁ ^f～head ₆ ^ffor the 1 st to 6 th heads of the multi-branch sparse attention network,W _f ^Ois an output weight matrix.

Optionally, the emotion classifier in step 5) is composed of a full connection layer and

function composition, wherein the calculation function expression of the full connection layer is as follows:

Y=WF _f +B

in the above formula, the first and second carbon atoms are,F _frepresents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;

wherein the content of the first and second substances,

calculation function of functionThe expression is as follows:

in the above formula, the first and second carbon atoms are,y _iis as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, Y_iCorresponding to the output of the fully connected layer for the ith emotion class,Y _jis as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories.

In addition, the invention also provides a multi-modal emotion recognition system based on subspace sparse feature fusion, which comprises a computer device, wherein the computer device at least comprises a microprocessor and a memory, the microprocessor is programmed or configured to execute the steps of the multi-modal emotion recognition method based on subspace sparse feature fusion, or the memory stores a computer program which is programmed or configured to execute the multi-modal emotion recognition method based on subspace sparse feature fusion.

In addition, the invention also provides a computer readable storage medium, wherein a computer program programmed or configured to execute the multi-modal emotion recognition method based on subspace sparse feature fusion is stored in the computer readable storage medium.

Compared with the prior art, the invention has the following advantages: acquiring a characteristic sequence of multiple modes, performing word-level alignment, normalization processing and position coding, inputting a corresponding multi-branch sparse attention module, decomposing to a low-dimensional characteristic subspace to obtain low-dimensional characteristics, cascading all the low-dimensional characteristics in the low-dimensional characteristic subspace based on weight, and training in a multi-branch sparse attention network to obtain fused multi-mode information; inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category. Based on the means, the multi-modal emotion recognition method and the device have the advantages that by considering the correlation sparsity among the time sequence information, the multi-modal information is decomposed into multiple subspaces to be fused, the context information in the modes and among the modes can be captured, and the accuracy of the multi-modal emotion recognition is improved.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a frame structure of the method according to the embodiment of the present invention.

Fig. 3 is a schematic diagram of a frame structure of a multi-branch sparse attention module according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the multi-modal emotion recognition method based on subspace sparse feature fusion in the present embodiment includes:

It should be noted that the method of the present embodiment does not depend on a specific modality or a combination of modalitiesIt is actually a method that can be compatible with various modalities that are present or even may appear in the future. As an optional implementation manner, the features of the multiple modalities in step 1) of this embodiment include a text feature sequence, an audio feature sequence, and a video feature sequence. In this embodiment, the generating steps of the text feature sequence, the audio feature sequence, and the video feature sequence include: firstly, respectively acquiring audio and user face video information by using a voice development board and a camera, and automatically identifying the voice subjected to noise reduction as a text through a voice open platform; then, the text feature sequence of the text content obtained in the step one of pre-trained Glove extraction is l = { l = l =₁,l₂,l₃,…,l_Nl );l_n∈R³⁰⁰In which l_nRepresenting word vector characteristics, wherein the dimension of a single text characteristic sequence is 300 dimensions, and Nl is the number of words of the identified text content; extracting an audio feature sequence as a = { a using covapr₁,a₂,a₃,…,a_Na );a_n∈R⁷⁴In which a is_nThe expressed word vector characteristics, the dimension of a single audio characteristic sequence is 74-dimensional, and Na is the segmented frame number of the audio; extracting video feature sequences as = { v } using Facet₁,v₂,v₃,…,v_Nv );v_n∈R⁷⁴In which v is_nThe expressed word vector features, the dimension of a single video feature sequence is 35-dimensional, and Nv is the total frame number of the video.

In this embodiment, the step 2) includes: aligning the audio feature sequence and the video feature sequence according to the text feature sequence, recording the starting time and the ending time of the ith word, respectively averaging the features in the corresponding time periods of the audio feature sequence and the video feature sequence, normalizing the aligned text feature sequence, audio feature sequence and video feature sequence to be in a range of [0,1], finally limiting the length of text content (for example, the length is 20 in the embodiment), intercepting the excess part, complementing the 0 for the insufficient part, and respectively setting the feature dimensions of the unified text feature sequence, audio feature sequence and video feature sequence to be (20, 300), (20, 74) and (20, 35).

In this embodiment, the function expression of the position code in step 3) is shown as follows:

in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE _(pos,2i)representing a position-coding matrixPEThe position code of dimension 2i of the middle position pos,PE _(pos,2i+1)a position code of dimension 2i +1 representing position pos,X ₀a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix. The position coding is used for reserving relative position information in the characteristic sequence, performing sine transformation at even positions of the characteristic sequence and performing cosine transformation at odd positions of the characteristic sequence to obtain a position coding matrixPEFinally, the original input is accumulatedX ₀Introducing position information into the signature sequenceX. In this embodiment, the aligned text feature sequence, audio feature sequence and video feature sequence are usedl、a、vRespectively inputting multi-branch sparse attention module to learn single-mode context information through position codingX ^L 、X ^A 、X ^V。

As shown in fig. 3, the processing step of the multi-branch sparse attention module in step 3) of this embodiment on the input feature sequence of the imported location information includes: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;

in the above formula, the first and second carbon atoms are,W _i ^q 、W _i ^k 、W _i ^vrespectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X _i ^q 、X _i ^k 、X _i ^vrespectively is a query feature, a key feature and a value feature projected into 6 different feature spaces; inputting the characteristic sequence by the query weight matrix, the key weight matrix and the value weight matrix

Projecting into 6 different feature spaces;

in the above formula, the first and second carbon atoms are,MultiHead(X _i ^q 、X _i ^k 、X _i ^v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head ₁～head ₆the heads in the 1 st to 6 th feature spaces,W ^Ois an output weight matrix;contactfor cascading features obtained from 6 different feature spaces at output time with output weightMatrix arrayW ^OMultiplying to obtain an output; considering the sparsity of the long feature sequence in time sequence, the multi-head sparse attention module in this embodiment calculates the second step by using the following formulaiHead in a feature spacehead _i：

In the above formula, the first and second carbon atoms are,SparseAttentionin order to compute a network for sparse attention,X _i ^q 、X _i ^k 、X _i ^vfor query features, key features, value features,d _kis the dimension of the input feature sequence.

in the above formula, the first and second carbon atoms are,W、Vfor the purpose of the different convolution kernels, the,b、cas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h _l (X)the result of the output is activated for gating the linear cell. Output of convolution branchh _l (X)Is the product of the convolution layer output without nonlinear transformation and the convolution layer output after nonlinear transformation by sigmoid function.

In the embodiment, a sigmoid function is used as activation output of a threshold in the process of calculating sparse attention, and a calculation formula of the sigmoid function is as follows:

in the above equation, the sigmoid function is an activation function,e ^xan exponential function is defined as the function of the exponent,m(i,j)threshold to be activated representing inputDrawing (A)(i, j)The value of the element of the position is,Τ(i,j)graph representing output thresholds to be activated(i,j)A threshold map of locations.

In this embodiment, a linear rectification function is used to sparsify the correlation weight matrixreluThe calculation formula is as follows:

in the above formula, the first and second carbon atoms are,f(M-Τ)represents the result of the sparsification of the correlation weight matrix,Ma matrix of the correlation is represented and,Τrepresenting a threshold matrix. Correlation matrixMValue and threshold matrix inΤThe difference comparison is carried out on the value of (1) and the difference comparison is carried out through a linear rectification functionreluA final sparse attention matrix is obtained.

In this embodiment, element correspondence addition is performed on the context feature information captured by sparse attention and the local feature information extracted by the convolution branch to obtain a monomodal featureX _f ^L 、X _f ^A 、X _f ^V。

In step 4) of this embodiment, decomposing the high-dimensional features corresponding to each modality into a low-dimensional feature subspace to obtain a functional expression of the low-dimensional features as follows:

in the above formula, the first and second carbon atoms are,X _i ^L 、X _i ^A 、X _i ^Vrespectively the low-dimensional features of each modality in the low-dimensional feature subspace,X _f ^L 、X _f ^A 、X _f ^Vthree single-modality features containing context information,W _i ^L 、W _i ^A 、W _i ^Vdimension reduction matrixes corresponding to three modes respectively (through the matrixes, the input single-mode characteristic sequence can be obtained

Projection into 6 different low-dimensional feature spaces);

α _i + β _i + γ _i=1,i=1,2,…,6

In this embodiment, the step 5) includes:

in the above formula, the first and second carbon atoms are,F _frepresents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head ₁ ^f～head ₆ ^ffor the 1 st to 6 th heads of the multi-branch sparse attention network,W _f ^Ois an output weight matrix. The emotion classifier in step 5) of the embodiment is composed of a full connection layer andsoftmaxfunction composition, wherein the calculation function expression of the full connection layer is as follows:

Y=WF _f +B

wherein the content of the first and second substances,softmaxthe computational function expression of the function is:

in the above formula, the first and second carbon atoms are,y _iis as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, Y_iCorresponding to the output of the fully connected layer for the ith emotion class,Y _jis as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories. In the training process of the emotion classifier, the cascaded low-dimensional features are trained in a multi-branch sparse attention network to obtain fused multi-mode information, emotion category information is obtained through the emotion classifier, and network parameters are solved. And the loss function used to solve the network parameters isL1LossThe calculation function expression is:

in the above formula, the first and second carbon atoms are,y _i ^prepresenting the probability of prediction as the ith emotion class,y _iis as followsiThe probability of each emotion category, and n is the number of emotion categories.

And analyzing and calculating the input audio, the face video and the text by using the trained multi-mode emotion recognition network model, and predicting the user emotion types contained in the multi-mode data. High-dimensional features are extracted from audio, video and text, and feature alignment is performed in units of words. And introducing position information into the characteristic sequence through a position coding module. For three modalities, a multi-branch sparse attention module is used to extract context information within the modalities. And reducing the dimensions of the three modal characteristics into different subspaces respectively, and endowing the modal characteristics in each subspace with corresponding weights. And extracting context information among the modes by using a multi-branch sparse attention module, and finally obtaining the user emotion category contained in the multi-mode data through an emotion classifier.

Experimental verification is carried out on the multi-modal emotion recognition method based on subspace sparse feature fusion In the embodiment, and Table 1 shows emotion recognition results based on a multi-modal subspace information fusion network under CMU-MOSI and CMU-MOSEI data sets, wherein Acc2 represents binary emotion classification accuracy, Acc7 represents seven-element emotion classification accuracy, F1 represents binary emotion F1 score, MFN is a method In Memory fusion network for multi-view search learning (In third-Second AAAI Conference assessment, 2018 a.) published by Amir Zadeh et al, and RAVEN is a method In Words shift: dynamic added prediction statistics analysis approaches (In Conference assessment, In 2019A 7216).

Table 1: the method of the embodiment evaluates the result on the disclosed emotion recognition data set.

As can be seen from the table 1, the multi-modal emotion recognition method based on subspace sparse feature fusion can realize accurate classification of multi-modal emotion recognition under CMU-MOSI and CMU-MOSEI data sets.

In addition, the present embodiment also provides a multi-modal emotion recognition system based on subspace sparse feature fusion, which includes a computer device, where the computer device at least includes a microprocessor and a memory, where the microprocessor is programmed or configured to execute the steps of the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion, or the memory stores a computer program that is programmed or configured to execute the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion. In addition, the embodiment also provides a computer readable storage medium, which stores a computer program programmed or configured to execute the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A multi-modal emotion recognition method based on subspace sparse feature fusion is characterized by comprising the following steps:

2. The method for multi-modal emotion recognition based on subspace sparse feature fusion as recited in claim 1, wherein the features of the plurality of modalities in step 1) comprise a text feature sequence, an audio feature sequence and a video feature sequence.

3. The method for multi-modal emotion recognition based on subspace sparse feature fusion as recited in claim 2, wherein the step of step 2) comprises: aligning the audio characteristic sequence and the video characteristic sequence according to the text characteristic sequence, recording the starting time and the ending time of the ith word, respectively averaging the characteristics in the corresponding time periods of the audio characteristic sequence and the video characteristic sequence, normalizing the aligned text characteristic sequence, audio characteristic sequence and video characteristic sequence to be in a range of [0,1], finally limiting the length of the text content, intercepting the excess part, complementing 0 for the insufficient part, and respectively unifying the characteristic dimensions of the text characteristic sequence, the audio characteristic sequence and the video characteristic sequence to be (20, 300), (20, 74) and (20, 35).

4. The method for multi-modal emotion recognition based on subspace sparse feature fusion as claimed in claim 1, wherein the position-encoded function expression in step 3) is as follows:

5. The method for multi-modal emotion recognition based on subspace sparse feature fusion as claimed in claim 2, wherein the processing step of the multi-branch sparse attention module in step 3) on the input feature sequence with the introduced position information comprises the following steps: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;

6. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 2, wherein the functional expression of the low-dimensional features obtained by decomposing the high-dimensional features corresponding to each modality into the low-dimensional feature subspace in step 4) is as follows:

in the above formula, the first and second carbon atoms are,X _i ^L 、X _i ^A 、X _i ^Vrespectively the low-dimensional features of each modality in the low-dimensional feature subspace,X _f ^L 、X _f ^A 、X _f ^Vthree single-modality features containing context information,W _i ^L 、W _i ^A 、W _i ^Vdimension reduction matrixes corresponding to the three modes are respectively adopted;

α _i + β _i + γ _i=1,i=1,2,…,6

7. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 6, wherein the step of step 5) comprises:

1) calculating a head of the multi-branch sparse attention network according to the following formula;

2) calculating fused multi-modal information according to the following formula;

8. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 7, wherein the emotion classifier in step 5) is composed of all-around emotionA connecting layer andsoftmaxfunction composition, wherein the calculation function expression of the full connection layer is as follows:

Y=WF _f +B

9. A multi-modal emotion recognition system based on subspace sparse feature fusion, comprising a computer device including at least a microprocessor and a memory, wherein the microprocessor is programmed or configured to perform the steps of the multi-modal emotion recognition method based on subspace sparse feature fusion of any one of claims 1 to 8, or the memory has stored therein a computer program programmed or configured to perform the multi-modal emotion recognition method based on subspace sparse feature fusion of any one of claims 1 to 8.

10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and is programmed or configured to execute the method for multi-modal emotion recognition based on subspace sparse feature fusion according to any one of claims 1 to 8.