CN116933051A

CN116933051A - Multi-mode emotion recognition method and system for modal missing scene

Info

Publication number: CN116933051A
Application number: CN202310840266.6A
Authority: CN
Inventors: 罗威; 赖韩江; 印鉴
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-10-24

Abstract

The application relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system for a mode missing scene, comprising the following steps: acquiring missing condition characteristics and multi-mode joint characteristics; carrying out missing mode feature reconstruction on the high-level features, the multi-mode joint features and the missing condition features of each mode by utilizing a self-attention mechanism to obtain multi-mode reconstruction features; and mapping the reconstructed visual features and the reconstructed audio features to a reconstructed text feature space, and carrying out feature fusion between two modes by utilizing a multi-mode gating fusion mechanism to obtain text visual fusion features and text audio fusion features so as to classify emotion categories and obtain emotion category prediction results. According to the application, through reconstructing missing multi-mode data and considering semantic feature differences among modes, classification robustness under a mode missing scene is enhanced, emotion classification accuracy is improved, and the method has a good development prospect in practical application.

Description

Multi-mode emotion recognition method and system for modal missing scene

Technical Field

The application relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system for a mode missing scene.

Background

Along with the rapid development of social media, network live broadcast, online short video and other applications, the speed of people generating and transmitting information is faster and faster, in these applications, a great amount of visual, audio and text data generated by users carry information about feelings and emotions, however, the conventional emotion recognition method is often only based on one or several data sources therein, the importance of fusion of various data sources is ignored, aiming at the problem, the multi-modal emotion recognition technology is generated, wherein the multi-modal emotion recognition is a method for realizing emotion recognition by acquiring multi-modal signals of visual, audio and text and integrating various signals, as compared with a single mode, the multi-modal emotion recognition is more comprehensive and more accurate in terms of capturing information, so that researchers gradually explore the fusion of multi-modal means such as visual, audio and text, how to better apply the multi-modal emotion recognition technology in different application scenes, the multi-modal emotion recognition technology shows wide development prospect in the fields such as video monitoring, internet and intelligent customer service, how to effectively capture the data of the prior art is based on the fact that the prior art of acquiring the multi-modal pattern emotion recognition data is a new mode, how to actually input the data is lost in the aspect of the fact that the quality of the prior art is lost, and the quality of the data is not lost in the aspect of the fact that the prior art is lost, or the quality of the fact that the data is lost in the prior to the fact that the network is lost.

At present, the multi-mode emotion recognition method under the condition of modal deficiency is mainly divided into the following two types:

1) The method for recovering the data of the missing mode through the data interpolation or the generation type network is adopted, the generation type network adopted by the method has higher requirements on the training data quantity, the network is difficult to converge under the condition of limited training data, and the generation effect of the data of the missing mode is poor;

2) The feature fusion-based method is characterized in that features of a plurality of extracted modes are subjected to feature fusion to obtain joint characterization, and emotion classification is carried out by utilizing the joint characterization, and the feature fusion-based method has the difficulty that the mode deletion exists in an input sample, so that robust joint characterization is difficult to directly learn, meanwhile, the method does not consider the factor that contributions of all modes are different, the features of the plurality of modes are simply fused, the semanteme rich difference of different modes is ignored, and the accuracy of emotion classification is low.

Disclosure of Invention

The application provides a multi-mode emotion recognition method and system for a modal missing scene, which solve the technical problems that the existing multi-mode emotion recognition method based on modal missing not only needs a larger training data amount, but also ignores the semantic rich differences of different modalities.

In order to solve the technical problems, the application provides a multi-mode emotion recognition method and system for a mode missing scene.

In a first aspect, the present application provides a method for identifying multi-modal emotion for a modal missing scene, the method comprising the steps of:

extracting features of an original video sample to obtain primary features, and carrying out deletion condition coding on the primary features to obtain deletion condition features;

extracting the high-level features of each mode according to the primary features, and splicing and fusing the high-level features of each mode to obtain multi-mode joint features;

carrying out missing mode feature reconstruction on the high-level features of each mode, the multi-mode joint features and the missing condition features by using a self-attention mechanism to obtain multi-mode reconstruction features; wherein the multi-modal reconstructed features include reconstructed visual features, reconstructed audio features, and reconstructed text features;

mapping the reconstructed visual features and the reconstructed audio features to the reconstructed text feature space through a linear layer, and carrying out feature fusion between every two modes by utilizing a multi-mode gating fusion mechanism to obtain text visual fusion features and text audio fusion features;

and carrying out emotion type classification according to the text visual fusion characteristics and the text audio fusion characteristics to obtain emotion type prediction results.

In a further embodiment, the primary features include primary visual features, primary audio features, and primary text features, the modal high-level features include high-level visual features, high-level audio features, and high-level text features, and the step of extracting modal high-level features from the primary features includes:

coding the primary visual features through a long-short-period memory network to obtain a visual coding output sequence, and carrying out maximum pooling on the visual coding output sequence to obtain advanced visual features;

coding the primary audio features through a long-short-period memory network to obtain an audio coding output sequence, and carrying out maximum pooling on the audio coding output sequence to obtain advanced audio features;

and encoding the primary text features through a text classification network to obtain advanced text features.

In a further embodiment, the expression for the missing feature is:

f _i ＝MLP([I _v ,I _a ,I _t ])

wherein f _i Representing missing condition features; MLP represents a multi-layer perceptron; i _v Indicating the presence of a visual modality; i _a Representing the presence of an audio modality; i _t Indicating the presence of a text modality.

In a further embodiment, the step of reconstructing missing mode features of the high-level features of each mode, the multi-mode joint features and the missing condition features by using a self-attention mechanism to obtain multi-mode reconstructed features includes:

splicing the high-level features of each mode, the multi-mode joint features and the missing mode features to obtain an input feature sequence;

mapping the input feature sequence into query matrix features, key matrix features and value matrix features through a linear layer;

according to the query matrix characteristics and the key matrix characteristics, calculating to obtain a self-attention matrix;

and carrying out dot product operation on the self-attention matrix and the value matrix characteristic to obtain a multi-mode reconstruction characteristic.

In a further embodiment, the self-attention matrix is calculated by the formula:

wherein A' represents a self-attention matrix; softmax represents the normalization operation; t represents a transposed symbol; k represents key matrix features; q represents a query matrix feature; dim represents a linear layer network dimension used to encode the query matrix feature, the key matrix feature, and the value matrix feature.

In a further embodiment, the loss function for training the reconstruction of missing modality features is a reconstruction loss function, and the loss function for training the emotion classification is a classification cross entropy loss function, wherein the calculation formula of the reconstruction loss function is:

in the method, in the process of the application,representing a reconstruction loss function; MSE represents the mean square error of the reconstructed and pre-trained features; f's' _s Representation ofA multi-modal reconstruction feature; />Representing pre-acquired pre-training features; v represents a visual modality; a represents an audio modality; t represents a text modality.

In a further embodiment, the text visual fusion feature is calculated as:

h _t,v ＝z*h _t +(1―z)*h _v

h _t ＝tanh(W _t ·f′ _t )

in the formula, h _t,v Representing a text visual fusion feature; z represents the relative importance of the text modality and the visual modality; w (W) _t A weight matrix representing a text modality; f's' _t Representing the reconstructed text feature; w (W) _v A weight matrix representing a visual modality;representing the visual features of the reconstructed visual features mapped to the reconstructed text feature space by the linear layer; w (W) _z Representing a relative importance weight matrix.

In a second aspect, the present application provides a multi-modal emotion recognition system for a modal absence scenario, the system comprising:

the deletion condition coding module is used for extracting the characteristics of the original video sample to obtain primary characteristics, and carrying out deletion condition coding on the primary characteristics to obtain deletion condition characteristics;

the high-level feature extraction module is used for extracting high-level features of all modes according to the primary features, and splicing and fusing the high-level features of all modes to obtain multi-mode joint features;

the missing mode reconstruction module is used for reconstructing missing mode features of the high-level features of each mode, the multi-mode joint features and the missing condition features by using a self-attention mechanism to obtain multi-mode reconstruction features; wherein the multi-modal reconstructed features include reconstructed visual features, reconstructed audio features, and reconstructed text features;

the feature mapping fusion module is used for mapping the reconstructed visual features and the reconstructed audio features to the reconstructed text feature space through a linear layer, and carrying out feature fusion between every two modes by utilizing a multi-mode gating fusion mechanism to obtain text visual fusion features and text audio fusion features;

and the emotion classification and identification module is used for classifying emotion types according to the text visual fusion characteristics and the text audio fusion characteristics to obtain emotion type prediction results.

Meanwhile, in a third aspect, the present application also provides a computer device, including a processor and a memory, where the processor is connected to the memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computer device performs steps for implementing the method.

In a fourth aspect, the present application also provides a computer readable storage medium having stored therein a computer program which when executed by a processor performs the steps of the above method.

The application provides a multi-mode emotion recognition method and a system for a modal missing scene, wherein the method comprises the steps of encoding missing condition features and multi-mode joint features, and carrying out missing modal feature reconstruction on each modal advanced feature, the multi-mode joint features and the missing condition features by adopting a self-attention mechanism to obtain multi-mode reconstruction features; mapping the reconstructed visual features and the reconstructed audio features to a reconstructed text feature space, and carrying out feature fusion between two modes by utilizing a multi-mode gating fusion mechanism to obtain text visual fusion features and text audio fusion features; and carrying out emotion type classification according to the text visual fusion characteristics and the text audio fusion characteristics to obtain emotion type prediction results. Compared with the prior art, the method realizes the reconstruction of the missing mode features by a self-attention mechanism, takes the multi-mode joint feature codes and the multi-mode missing condition codes as additional input information for the reconstruction of the missing mode features, thereby assisting the reconstruction of the missing mode features and improving the classification robustness under the mode missing scene; meanwhile, the application dynamically carries out the two-mode fusion by utilizing a multi-mode gating fusion mechanism, fully considers the importance difference between modes and improves the accuracy of emotion classification and identification.

Drawings

FIG. 1 is a schematic flow chart of a multi-mode emotion recognition method for a modal missing scene provided by an embodiment of the application;

FIG. 2 is a diagram illustrating a multi-modal emotion recognition process according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a missing modality feature reconstruction process provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a multi-mode gating fusion mechanism according to an embodiment of the present application;

FIG. 5 is a block diagram of a multimodal emotion recognition system for a modality deficiency scene provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following examples are given for the purpose of illustration only and are not to be construed as limiting the application, including the drawings for reference and description only, and are not to be construed as limiting the scope of the application as many variations thereof are possible without departing from the spirit and scope of the application.

Referring to fig. 1, an embodiment of the present application provides a multi-modal emotion recognition method for a modal missing scene, as shown in fig. 1, the method includes the following steps:

s1, extracting features of an original video sample to obtain primary features, and carrying out deletion condition coding on the primary features to obtain deletion condition features; wherein the primary features include primary visual features, primary audio features, and primary text features.

Specifically, after the original video sample is obtained, in order to enable the original video sample to be input into the neural network, the original video sample needs to undergo preliminary feature extraction to obtain frame-level preliminary features, namely, facial expression frames are first detected and intercepted from the original video sample, and then primary visual features x are extracted by adopting DenseNet pre-trained on a Facial Expression Recognition Plus (FER+) corpus _v Its dimension isWherein T is the number of frames of the video clip of the input sentence, < >>Is the dimension of the primary visual feature; extracting frame-level primary audio features x using OpenSMILE tool _a Its dimension is->Extracting primary text features x using a pre-trained BERT-Large model _t Its dimension is

The present embodiment assumes that there are a loss of one or two modalities in the input, and for a three-modality input, there are a total of 6 possible loss modes, and the loss-case code refers to a digital code in each loss mode, such as: if the audio mode of the multi-mode input (including the visual mode, the audio mode and the text mode) is missing, the corresponding missing mode code of the multi-mode input can be used as [1,0,1]In the representation, namely, in the missing condition coding, the embodiment preferentially adopts 0 to represent the modal missing, 1 to represent the modal existence, taking the audio modal missing as shown in fig. 2 as an example, the multi-modal input of the network is [ x ] _v ,x _a(miss) ,x _t ]The corresponding missing condition code of the multi-mode input is [1,0,1 ]]Inputting the missing condition code into a multi-layer perceptron to encode to obtain a missing condition feature f with a dimension d _i And will miss case feature f _i As additional information in the feature reconstruction input, the reconstruction of the missing modal feature is assisted, and in this embodiment, the expression of the missing situation feature is:

f _i ＝MLP([I _v ,I _a ,I _t ])

wherein f _i Representing missing condition features; MLP represents a multi-layer perceptron; i _v Indicating the existence of a visual modality, and if the visual modality exists, encoding as 1; if the visual mode does not exist, the code is 0; i _a Indicating the existence of an audio mode, and if the audio mode exists, encoding to be 1; if the audio mode does not exist, the coding is 0; i _t Indicating the existence of the text mode, and if the text mode exists, encoding as 1; if the text mode does not exist, the code is 0; v represents a visual modality; a represents an audio modality; t represents a text modality.

In the network training process, since the condition of modal missing needs to be simulated by the input of the network, the embodiment performs manual construction according to the existing multi-modal emotion recognition data set to obtain the multi-modal emotion recognition data set containing modal missing, and assumes thatOne input sample of the original data set extracted by the primary features can respectively construct 6 input modes containing modal missing aiming at the condition that one or two modalities are missing, namely:

where i represents the ith sample of the original dataset; y is ⁱ Representing the true emotion type corresponding to the sample; the miss indicates that the corresponding modality is in a missing state in the input.

S2, extracting the high-level features of all modes according to the primary features, and performing splicing and fusion on the high-level features of all modes to obtain multi-mode joint features.

In this embodiment, the primary features are frame-level sequence features, for any mode, a segment of sentence video segment may be extracted into a sequence with a length of T, and there is usually a context association between the front frame and the back frame in the sequence, where the primary features need to be further encoded to obtain mode advanced features, where each mode advanced feature includes advanced visual features, advanced audio features, and advanced text features, and the step of extracting each mode advanced feature according to the primary features includes:

to capture contextual information of a sequence of images, the primary visual features x are used to determine the visual characteristics of the image sequence _v Coding through a long-short-term memory network LSTM to obtain a visual coding output sequence, and carrying out max pooling on the visual coding output sequence through a max-pooling layer to obtain an advanced visual feature f _v The dimension of the material is d;

coding the primary audio features through a long-short-term memory network LSTM to obtain an audio coding output sequence, and carrying out max-pooling on the audio coding output sequence to obtain an advanced audio feature f _a The dimension is d;

coding the primary text feature through a text classification network textCNN to obtain an advanced text feature f _t Its dimension is also d.

Then, the embodiment splices the advanced visual feature, the advanced audio feature and the advanced text feature along the feature dimension to obtain the dimension d _m ＝d _v +d _a +d _t The splicing characteristic of 3d is input into a multi-layer perceptron for fusion to obtain a multi-mode combined characteristic f _m Multimode joint feature f _m The calculation formula of (2) is as follows:

f _m ＝MLP(Concat([f _v ,f _a ,f _t ]))

it should be noted that the multi-mode joint feature f obtained through the above processing procedure _m The information of the existing modes can be fused into a single feature, which is equivalent to the global shared information of a plurality of modes, and f is adopted in the embodiment _m As additional information in the subsequent missing modal feature reconstruction input, the reconstruction of the missing modal features can be assisted, and the classification robustness in a modal missing scene is improved.

S3, carrying out missing mode feature reconstruction on the high-level features of each mode, the multi-mode joint features and the missing condition features by using a self-attention mechanism to obtain multi-mode reconstruction features; wherein the multi-modal reconstructed features include reconstructed visual features, reconstructed audio features, and reconstructed text features.

In this embodiment, the step of obtaining the multi-modal reconstruction feature by performing the missing modal feature reconstruction on the high-level feature of each mode, the multi-modal joint feature and the missing condition feature by using a self-attention mechanism includes:

Specifically, as shown in fig. 3, in this embodiment, a self-attention mechanism is used to reconstruct missing mode features, and it should be noted that the self-attention mechanism may capture internal correlation of a sequence, and splice advanced features, multi-mode joint features and missing condition features where a mode missing occurs to obtain an input feature sequence f= [ F ] _v ,f _a ,f _t ,f _m ,f _i ]The input feature sequence F is mapped into a query matrix feature Q, a key matrix feature K and a value matrix feature V through a linear layer, wherein Q=W ^q ·F，K＝W ^k ·F，V＝W ^v ·F，W ^q 、W ^k 、W ^v For easy understanding, a Query matrix feature Q (Query) is a matrix calculated from an input feature sequence, and is used for calculating a similarity score between each element and other elements in the sequence; the key matrix feature K is a matrix obtained by calculating an input feature sequence and is used for carrying out dot product operation with the query matrix to obtain the weight of each sequence element; the value matrix feature V is a representation of the input feature sequence that is used to calculate a weighted sum of each sequence element, and is typically used to represent the corresponding input feature or hidden state.

Then, the embodiment calculates a self-attention matrix a 'by querying the matrix feature Q and the key matrix feature K, where a calculation formula of the self-attention matrix a' is as follows:

It should be noted that, the self-attention moment array in this embodiment is used to represent the autocorrelation of the input feature sequence F, that is, the similarity score between each feature and all features in the sequence, and the dot product operation is performed by combining the value matrix feature V to obtain a multi-mode reconstruction feature, where the calculation formula of the multi-mode reconstruction feature is as follows:

F′＝VA′

wherein F' represents a multi-modal reconstruction feature; v represents a value matrix feature; a' represents a self-attention matrix.

In this embodiment, the dimension of the multi-modal reconstruction feature F 'is consistent with the input feature sequence, and F' may be represented as F '= [ F ]' _v ,f′ _a ,f′ _t ,f′ _m ,f′ _i ]Corresponds to the input feature sequence and reconstructs the first three features [ f ]' _v ,f′ _a ,f′ _t ]As the reconstructed visual feature, the reconstructed audio feature and the reconstructed text feature, the embodiment carries out the operation of a self-attention mechanism, each feature in the output multi-mode reconstructed feature is obtained by weighting and summing each feature in the input feature sequence, the self-attention matrix A' is used as the weight and is obtained by carrying out the dynamic calculation according to the feature interaction inside the input feature sequence, and the input feature sequence is obtained by carrying out the high-level feature of three modes and the multi-mode joint feature f _m Characteristics of missing situation f _i The method is constructed so that all the information is considered in the generation of each of the output multi-modal reconstruction features, wherein the loss function for training the reconstruction of the missing modal features is a reconstruction loss function, and the calculation formula of the reconstruction loss function is as follows:

in the method, in the process of the application,representing a reconstruction loss function; MSE represents the mean square error of the reconstructed and pre-trained features; f's' _s Representing a multi-modal reconstruction feature; />Representing pre-acquired pre-training features, which in this embodiment areExtracted from a set of pre-trained feature extractorsThe extractor adopts a network structure composed of LSTM and textCNN, but the characteristic extractor adopts complete multi-mode input, after extracting the advanced characteristics of each single mode, the characteristics are directly spliced and input into a linear layer for classification pre-training, so that the characteristics are pre-trained, further, the mean square error is calculated according to the reconstruction characteristics and the pre-training characteristics, the quality degree of reconstruction is balanced, wherein the smaller the mean square error is, the better the characteristic reconstruction effect is, and one of the training targets of the network is to reduce the mean square error.

It should be noted that the high-level features of the single mode provide information specific to the mode (mode-specific information), such as: visual expression change, audio speech intonation, mood assistance of text, etc.; multi-modal joint feature f _m Shared (generic) information between modalities is provided, such as: emotion tendency information contained in the modality; and missing case feature f _i The method provides the missing mode information in the input data, and the reconstruction of the missing mode depends on the existing mode due to a certain correlation among the multiple modes, so that the missing condition of the mode in the input data is explicitly indicated to the reconstruction network, the learning of the self-attention matrix A' is guided, the reconstruction of the missing mode characteristic is more focused on the generation of the missing mode characteristic, and the effect of the reconstruction of the missing mode is improved.

S4, mapping the reconstructed visual features and the reconstructed audio features to the reconstructed text feature space through a linear layer, and carrying out feature fusion between two modes by utilizing a multi-mode gating fusion mechanism to obtain text visual fusion features and text audio fusion features.

As shown in fig. 4, in order to reduce the semantic difference between the three modes and promote the subsequent multi-mode fusion effect, the embodiment will reconstruct the visual feature f' _v And reconstructing the audio feature f' _a Mapping to a reconstructed text feature space through a linear layer, and representing the mapped reconstructed visual features and reconstructed audio features asAnd->The calculation formulas of the reconstructed visual features after mapping and the reconstructed audio features after mapping are respectively as follows:

in the method, in the process of the application,representing the reconstructed visual features after mapping; />Representing the mapped reconstructed audio features.

After the reconstructed visual features and the reconstructed audio features are mapped to the reconstructed text feature space, the difference of importance degrees among modes is mined by utilizing a multi-mode gating fusion unit, feature fusion between every two modes is dynamically carried out, and the calculation modes of the text visual fusion features and the text audio fusion features can be respectively expressed as h _t,v Taking the fusion of a text mode and a visual mode as a text audio fusion characteristic as an example, the calculation mode of the multi-mode gating fusion unit is as follows:

h _t,v ＝z*h _t +(1―z)*h _v

h _t ＝tanh(W _t ·f′ _t )

in the formula, h _t,v Representing a text visual fusion feature; z represents the relative importance of the text mode and the visual mode, and the value range is set as [0,1 ] preferentially in the embodiment]；W _t A weight matrix representing a text modality; f's' _t Representing the reconstructed text feature; w (W) _v A weight matrix representing a visual modality;representing the visual features of the reconstructed visual features mapped to the reconstructed text feature space by the linear layer; w (W) _z Representing a relative importance weight matrix.

In fig. 4, tanh represents a linear mapping operation+tanh activation function operation; sigma represents a sigmoid activation function; 1-represents 1 minus the value of the cell to which it is directed.

S5, carrying out emotion type classification according to the text visual fusion characteristics and the text audio fusion characteristics to obtain emotion type prediction results.

Specifically, the present embodiment fuses text visual features h _t,v Text-to-audio fusion feature h _t,a After splicing, carrying out emotion type classification through a linear layer and a softmax function to obtain predicted emotion type probability distribution p, wherein the corresponding type with the maximum prediction probability is the final emotion type prediction result.

p＝softmax(Linear(h _t,v ,h _t,a ))

In this embodiment, in the training process of the whole network, the loss function for training emotion classification is a classification cross entropy loss function, and the expression of the classification cross entropy loss function is:

in the method, in the process of the application,representing a classification cross entropy loss function; n represents the number of samples in the dataset; i represents the i-th sample currently being calculated; h represents a cross entropy function; p represents the emotion type probability distribution of model prediction; q represents the true one-hot probability distribution of the sample.

According to the embodiment, the loss function is reconstructed, the cross entropy loss function is classified as training loss of the whole network, an Adam optimizer is adopted to learn the learnable parameters in the network, after the network is trained, if a section of video needs to be identified, the video is preprocessed into three modes of vision, audio and text, then the high-level characteristics of each mode are extracted, if one or two modes are missing, the corresponding high-level characteristics are replaced by zero vector filling, and therefore emotion identification on a video sample which is possibly missing in the mode is achieved.

The embodiment of the application provides a multi-mode emotion recognition method for a modal missing scene, which takes missing condition features and multi-mode joint features as additional input information for missing modal feature reconstruction based on a self-attention mechanism, assists in missing modal feature reconstruction to obtain multi-mode reconstruction features, maps feature space after completing missing modal feature reconstruction, and maps reconstructed visual features and reconstructed audio features to reconstructed text feature space, thereby dynamically fusing modes two by two through a multi-mode gating fusion mechanism to obtain final fusion characterization for emotion classification. Compared with the traditional multi-mode emotion recognition method, the missing mode feature reconstruction method based on the self-attention mechanism can better recover the feature semantics of the missing mode, and improves the classification robustness in the mode missing scene; and meanwhile, semantic difference among modes can be better captured through a multi-mode gating fusion mechanism, so that the final emotion classification accuracy is improved, and the method has good popularization and application values.

It should be noted that, the sequence number of each process does not mean that the execution sequence of each process is determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

In one embodiment, as shown in fig. 5, an embodiment of the present application provides a multi-modal emotion recognition system for a modal missing scene, the system including:

the missing condition coding module 101 is configured to perform feature extraction on an original video sample to obtain primary features, and perform missing condition coding on the primary features to obtain missing condition features;

the advanced feature extraction module 102 is configured to extract advanced features of each mode according to the primary features, and splice and fuse the advanced features of each mode to obtain multi-mode joint features;

the missing mode reconstruction module 103 is configured to reconstruct missing mode features of the high-level features of each mode, the multi-mode joint features and the missing condition features by using a self-attention mechanism, so as to obtain multi-mode reconstruction features; wherein the multi-modal reconstructed features include reconstructed visual features, reconstructed audio features, and reconstructed text features;

the feature mapping fusion module 104 is configured to map the reconstructed visual feature and the reconstructed audio feature to the reconstructed text feature space through a linear layer, and perform feature fusion between two modes by using a multi-mode gating fusion mechanism, so as to obtain a text visual fusion feature and a text audio fusion feature;

and the emotion classification and identification module 105 is used for classifying emotion types according to the text visual fusion characteristics and the text audio fusion characteristics to obtain emotion type prediction results.

For a specific limitation of a multi-modal emotion recognition system for a modal missing scene, reference may be made to the above limitation of a multi-modal emotion recognition method for a modal missing scene, and the description thereof will not be repeated here. Those of ordinary skill in the art will appreciate that the various modules and steps described in connection with the disclosed embodiments of the application may be implemented in hardware, software, or a combination of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application provides a multi-mode emotion recognition system for a modal missing scene, which acquires missing condition features and multi-mode joint features through a missing condition coding module and an advanced feature extraction module; the missing condition feature and the multi-mode joint feature are used as additional input information for the missing mode feature reconstruction based on a self-attention mechanism through a missing mode reconstruction module, and the missing mode feature reconstruction is assisted to obtain multi-mode reconstruction features; the feature mapping fusion module and the emotion classification recognition module dynamically fuse the modes between text-audio and text-vision in pairs, so that the importance difference between the modes is fully mined, and the final emotion classification accuracy is improved. The system utilizes the multi-mode gating fusion module to mine the mode differences between text-audio and text-vision to dynamically fuse the characteristics, takes the missing condition characteristics and the multi-mode joint characteristics as additional output, improves the reconstruction effect of the missing mode characteristics, ensures that the characteristics input for emotion classification contain richer information, and improves the accuracy of emotion classification.

FIG. 6 is a diagram of a computer device including a memory, a processor, and a transceiver connected by a bus, according to an embodiment of the present application; the memory is used to store a set of computer program instructions and data and the stored data may be transferred to the processor, which may execute the program instructions stored by the memory to perform the steps of the above-described method.

Wherein the memory may comprise volatile memory or nonvolatile memory, or may comprise both volatile and nonvolatile memory; the processor may be a central processing unit, a microprocessor, an application specific integrated circuit, a programmable logic device, or a combination thereof. By way of example and not limitation, the programmable logic device described above may be a complex programmable logic device, a field programmable gate array, general purpose array logic, or any combination thereof.

In addition, the memory may be a physically separate unit or may be integrated with the processor.

It will be appreciated by those of ordinary skill in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than those shown, or may combine some of the components, or have the same arrangement of components.

In one embodiment, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.

The multi-mode emotion recognition method and system for the modal missing scene provided by the embodiment of the application can capture the deep modal differences among different modes, provide missing condition features and multi-mode combined features to assist in missing modal feature reconstruction, and improve the emotion classification accuracy and robustness under the modal missing scene.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., SSD), etc.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed, may comprise the steps of embodiments of the methods described above.

The foregoing examples represent only a few preferred embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and substitutions should also be considered to be within the scope of the present application. Therefore, the protection scope of the patent of the application is subject to the protection scope of the claims.

Claims

1. The multi-mode emotion recognition method for the modal missing scene is characterized by comprising the following steps of:

2. The method for multimodal emotion recognition for a modality-missing scene of claim 1, wherein said primary features include primary visual features, primary audio features, and primary textual features, and wherein each modality-advanced feature includes an advanced visual feature, an advanced audio feature, and an advanced textual feature, and wherein said step of extracting each modality-advanced feature from said primary features comprises:

3. The method for identifying multi-modal emotion in a modal absence scenario of claim 1, wherein the expression of the absence profile is:

f _i ＝MLP([I _v ,I _a ,I _t ])

wherein f _i Representing missing condition features; MLP represents a multi-layer perceptron; i _v Indicating the presence of a visual modality; i _a Representing the presence of an audio modality; i _t Representation ofThe existence of text modalities.

4. The method for identifying multi-modal emotion in a modal missing scene as claimed in claim 1, wherein said step of reconstructing missing modal features from said high-level features of each modal, said multi-modal joint features and said missing condition features by using a self-attention mechanism includes:

5. The method for identifying multi-modal emotion in a modal absence scenario of claim 4, wherein the self-attention matrix is calculated by the formula:

6. The method for identifying multi-modal emotion in a modal absence scenario of claim 1, wherein the loss function for training the reconstruction of missing modal features is a reconstruction loss function, and the loss function for training emotion classification is a classification cross entropy loss function, and wherein the calculation formula of the reconstruction loss function is as follows:

in the method, in the process of the application,representing a reconstruction loss function; MSE represents the mean square error of the reconstructed and pre-trained features; f's' _s Representing a multi-modal reconstruction feature; />Representing pre-acquired pre-training features; v represents a visual modality; a represents an audio modality; t represents a text modality.

7. The method for identifying multi-modal emotion in a modal absence scenario of claim 1, wherein the text visual fusion feature is calculated by the formula:

h _t,v ＝z*h _t +(1-z)*h _v

h _t ＝tanh(W _t ·f′ _t )

8. A multi-modal emotion recognition system for a modal absence scenario, the system comprising:

9. A computer device, characterized by: comprising a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the computer device to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized by: the computer readable storage medium having stored therein a computer program which, when executed, implements the method of any of claims 1 to 7.