CN111931795A - Multi-modal emotion recognition method and system based on subspace sparse feature fusion - Google Patents
Multi-modal emotion recognition method and system based on subspace sparse feature fusion Download PDFInfo
- Publication number
- CN111931795A CN111931795A CN202011019175.9A CN202011019175A CN111931795A CN 111931795 A CN111931795 A CN 111931795A CN 202011019175 A CN202011019175 A CN 202011019175A CN 111931795 A CN111931795 A CN 111931795A
- Authority
- CN
- China
- Prior art keywords
- feature
- features
- low
- sparse
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/513—Sparse representations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-modal emotion recognition method and system based on subspace sparse feature fusion, and the method comprises the steps of obtaining feature sequences of multiple modes, carrying out word level alignment, normalization processing and position coding, then inputting a corresponding multi-branch sparse attention module, decomposing the feature sequences to a low-dimensional feature subspace to obtain low-dimensional features, cascading all the low-dimensional features in the low-dimensional feature subspace based on weights, obtaining fused multi-modal information through training in a multi-branch sparse attention network, and then inputting a pre-trained emotion classifier to obtain the current emotion category of an object to be recognized, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-modal information and the emotion category. According to the method, the multi-modal information is decomposed into multiple subspaces for fusion by considering the association sparsity among the time sequence information, so that the context information in and among the modalities can be captured, and the accuracy of multi-modal emotion recognition is improved.
Description
Technical Field
The invention relates to a multi-modal man-machine natural interaction technology, in particular to a multi-modal emotion recognition method and system based on subspace sparse feature fusion.
Background
The multi-modal human-computer natural interaction faces emotional challenges, and to overcome the emotional challenges in the multi-modal human-computer natural interaction, the problem that the robot understands and recognizes human emotions must be solved first, so emotion recognition is an important research subject in the field of human-computer interaction, and rapid development is achieved in recent years. The accuracy of emotion recognition by solely using a facial image or a voice signal is in a bottleneck state, and the robustness is poor. Compared with single-modal emotion recognition, the multi-modal emotion recognition can more comprehensively utilize emotion signals in voice, facial expression images and texts, and further improves the emotion recognition level. Thus, an increasing number of researchers are focusing their attention on multimodal emotion recognition studies.
However, there are many challenges to be solved and overcome in multi-modal emotion recognition, which mainly include: first, the representation and fusion of different modal emotional features. The audio and video information is collected by different sensors, the data format and the capture rate are different, and the problems of unified representation and fusion of emotional characteristics in the multi-mode signals are not solved. Second, modality information is missing. The existing multi-modal emotion recognition method generally assumes that multi-modal information is completely acquired, and the absence of a certain modality is not considered, but the absence of audio and video modalities can be caused by noise and shielding in a real environment. Thirdly, the uncertainty factor of the emotional characteristics. Language, gender and culture can lead to differences in the expression of specific emotional states in different scenarios.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: in order to solve the problems in the prior art, the invention provides a multi-modal emotion recognition method and system based on subspace sparse feature fusion.
In order to solve the technical problems, the invention adopts the technical scheme that:
a multi-modal emotion recognition method based on subspace sparse feature fusion comprises the following steps:
1) acquiring a characteristic sequence of multiple current modals of an identified object;
2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;
3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;
4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;
5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;
6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.
Optionally, the features of the plurality of modalities in step 1) include a text feature sequence, an audio feature sequence, and a video feature sequence.
Optionally, the step of step 2) comprises: aligning the audio characteristic sequence and the video characteristic sequence according to the text characteristic sequence, recording the starting time and the ending time of the ith word, respectively averaging the characteristics in the corresponding time periods of the audio characteristic sequence and the video characteristic sequence, normalizing the aligned text characteristic sequence, audio characteristic sequence and video characteristic sequence to be in a range of [0,1], finally limiting the length of the text content, intercepting the excess part, complementing 0 for the insufficient part, and respectively unifying the characteristic dimensions of the text characteristic sequence, the audio characteristic sequence and the video characteristic sequence to be (20, 300), (20, 74) and (20, 35).
Optionally, the function expression of the position code in step 3) is as follows:
in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE (pos,2i) representing a position-coding matrixPEMiddle positionposIs encoded in the position of dimension 2i of (a),PE (pos,2i+1) indicating a locationposIs encoded in the position of dimension 2i +1,X 0 a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix.
Optionally, the processing step of the multi-branch sparse attention module in step 3) on the input feature sequence of the imported location information includes: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;
performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:
in the above formula, the first and second carbon atoms are,W i q 、W i k 、W i v respectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X i q 、X i k 、X i v respectively is a query feature, a key feature and a value feature projected into 6 different feature spaces;
wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead i :
In the above formula, the first and second carbon atoms are,head i is as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X i q 、 X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in the individual feature space,d k in order to input the dimensions of the sequence of features,sparse(X i q X i kT ) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X i q X i kT ) The functional expression of (a) is:
in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X i q 、X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;
then the head in each feature spacehead i Cascading to the resulting multi-headed feature using:
in the above formula, the first and second carbon atoms are,MultiHead(X i q 、X i k 、X i v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head 1 ~head 6 the heads in the 1 st to 6 th feature spaces,W O is an output weight matrix;
the functional expression of the result output by the gate control linear unit activation is shown as the following formula:
in the above formula, the first and second carbon atoms are,W、Vfor the purpose of the different convolution kernels, the,b、cas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h l (X)the result of the output is activated for gating the linear cell.
Optionally, decomposing the high-dimensional features corresponding to each modality into a low-dimensional feature subspace in step 4) to obtain a functional expression of the low-dimensional features as follows:
in the above formula,X i L 、X i A 、X i V Respectively the low-dimensional features of each modality in the low-dimensional feature subspace,X f L 、X f A 、X f V three single-modality features containing context information,W i L 、W i A 、W i V dimension reduction matrixes corresponding to the three modes are respectively adopted;
when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:
α i + β i + γ i =1,i=1,2,…,6
in the above formula, the first and second carbon atoms are,α i ,β i ,γ i weights assigned to the low-dimensional features, respectively;
the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:
in the above formula, the first and second carbon atoms are,F i is shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α i ,β i ,γ i the weights assigned to the low-dimensional features respectively,X i L 、X i A 、X i V respectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.
Optionally, the step of step 5) comprises:
5.1) calculating the head of the multi-branch sparse attention network according to the following formula;
in the above formula, the first and second carbon atoms are,head i f first to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F i is shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;
5.2) calculating the fused multi-modal information according to the following formula;
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head 1 f ~head 6 f for the 1 st to 6 th heads of the multi-branch sparse attention network,W f O is an output weight matrix.
Optionally, the emotion classifier in step 5) is composed of a full connection layer andfunction composition, wherein the calculation function expression of the full connection layer is as follows:
Y=WF f +B
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;
wherein the content of the first and second substances,calculation function of functionThe expression is as follows:
in the above formula, the first and second carbon atoms are,y i is as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, YiCorresponding to the output of the fully connected layer for the ith emotion class,Y j is as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories.
In addition, the invention also provides a multi-modal emotion recognition system based on subspace sparse feature fusion, which comprises a computer device, wherein the computer device at least comprises a microprocessor and a memory, the microprocessor is programmed or configured to execute the steps of the multi-modal emotion recognition method based on subspace sparse feature fusion, or the memory stores a computer program which is programmed or configured to execute the multi-modal emotion recognition method based on subspace sparse feature fusion.
In addition, the invention also provides a computer readable storage medium, wherein a computer program programmed or configured to execute the multi-modal emotion recognition method based on subspace sparse feature fusion is stored in the computer readable storage medium.
Compared with the prior art, the invention has the following advantages: acquiring a characteristic sequence of multiple modes, performing word-level alignment, normalization processing and position coding, inputting a corresponding multi-branch sparse attention module, decomposing to a low-dimensional characteristic subspace to obtain low-dimensional characteristics, cascading all the low-dimensional characteristics in the low-dimensional characteristic subspace based on weight, and training in a multi-branch sparse attention network to obtain fused multi-mode information; inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category. Based on the means, the multi-modal emotion recognition method and the device have the advantages that by considering the correlation sparsity among the time sequence information, the multi-modal information is decomposed into multiple subspaces to be fused, the context information in the modes and among the modes can be captured, and the accuracy of the multi-modal emotion recognition is improved.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a frame structure of the method according to the embodiment of the present invention.
Fig. 3 is a schematic diagram of a frame structure of a multi-branch sparse attention module according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1 and fig. 2, the multi-modal emotion recognition method based on subspace sparse feature fusion in the present embodiment includes:
1) acquiring a characteristic sequence of multiple current modals of an identified object;
2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;
3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;
4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;
5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;
6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.
It should be noted that the method of the present embodiment does not depend on a specific modality or a combination of modalitiesIt is actually a method that can be compatible with various modalities that are present or even may appear in the future. As an optional implementation manner, the features of the multiple modalities in step 1) of this embodiment include a text feature sequence, an audio feature sequence, and a video feature sequence. In this embodiment, the generating steps of the text feature sequence, the audio feature sequence, and the video feature sequence include: firstly, respectively acquiring audio and user face video information by using a voice development board and a camera, and automatically identifying the voice subjected to noise reduction as a text through a voice open platform; then, the text feature sequence of the text content obtained in the step one of pre-trained Glove extraction is l = { l = l =1,l2,l3,…,lNl );ln∈R300In which lnRepresenting word vector characteristics, wherein the dimension of a single text characteristic sequence is 300 dimensions, and Nl is the number of words of the identified text content; extracting an audio feature sequence as a = { a using covapr1,a2,a3,…,aNa );an∈R74In which a isnThe expressed word vector characteristics, the dimension of a single audio characteristic sequence is 74-dimensional, and Na is the segmented frame number of the audio; extracting video feature sequences as = { v } using Facet1,v2,v3,…,vNv );vn∈R74In which v isnThe expressed word vector features, the dimension of a single video feature sequence is 35-dimensional, and Nv is the total frame number of the video.
In this embodiment, the step 2) includes: aligning the audio feature sequence and the video feature sequence according to the text feature sequence, recording the starting time and the ending time of the ith word, respectively averaging the features in the corresponding time periods of the audio feature sequence and the video feature sequence, normalizing the aligned text feature sequence, audio feature sequence and video feature sequence to be in a range of [0,1], finally limiting the length of text content (for example, the length is 20 in the embodiment), intercepting the excess part, complementing the 0 for the insufficient part, and respectively setting the feature dimensions of the unified text feature sequence, audio feature sequence and video feature sequence to be (20, 300), (20, 74) and (20, 35).
In this embodiment, the function expression of the position code in step 3) is shown as follows:
in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE (pos,2i) representing a position-coding matrixPEThe position code of dimension 2i of the middle position pos,PE (pos,2i+1) a position code of dimension 2i +1 representing position pos,X 0 a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix. The position coding is used for reserving relative position information in the characteristic sequence, performing sine transformation at even positions of the characteristic sequence and performing cosine transformation at odd positions of the characteristic sequence to obtain a position coding matrixPEFinally, the original input is accumulatedX 0 Introducing position information into the signature sequenceX. In this embodiment, the aligned text feature sequence, audio feature sequence and video feature sequence are usedl、a、vRespectively inputting multi-branch sparse attention module to learn single-mode context information through position codingX L 、X A 、X V 。
As shown in fig. 3, the processing step of the multi-branch sparse attention module in step 3) of this embodiment on the input feature sequence of the imported location information includes: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;
performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:
in the above formula, the first and second carbon atoms are,W i q 、W i k 、W i v respectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X i q 、X i k 、X i v respectively is a query feature, a key feature and a value feature projected into 6 different feature spaces; inputting the characteristic sequence by the query weight matrix, the key weight matrix and the value weight matrixProjecting into 6 different feature spaces;
wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead i :
In the above formula, the first and second carbon atoms are,head i is as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X i q 、 X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in the individual feature space,d k in order to input the dimensions of the sequence of features,sparse(X i q X i kT ) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X i q X i kT ) The functional expression of (a) is:
in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X i q 、X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;
then the head in each feature spacehead i Cascading to the resulting multi-headed feature using:
in the above formula, the first and second carbon atoms are,MultiHead(X i q 、X i k 、X i v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head 1 ~head 6 the heads in the 1 st to 6 th feature spaces,W O is an output weight matrix;contactfor cascading features obtained from 6 different feature spaces at output time with output weightMatrix arrayW O Multiplying to obtain an output; considering the sparsity of the long feature sequence in time sequence, the multi-head sparse attention module in this embodiment calculates the second step by using the following formulaiHead in a feature spacehead i :
In the above formula, the first and second carbon atoms are,SparseAttentionin order to compute a network for sparse attention,X i q 、X i k 、X i v for query features, key features, value features,d k is the dimension of the input feature sequence.
The functional expression of the result output by the gate control linear unit activation is shown as the following formula:
in the above formula, the first and second carbon atoms are,W、Vfor the purpose of the different convolution kernels, the,b、cas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h l (X)the result of the output is activated for gating the linear cell. Output of convolution branchh l (X)Is the product of the convolution layer output without nonlinear transformation and the convolution layer output after nonlinear transformation by sigmoid function.
In the embodiment, a sigmoid function is used as activation output of a threshold in the process of calculating sparse attention, and a calculation formula of the sigmoid function is as follows:
in the above equation, the sigmoid function is an activation function,e x an exponential function is defined as the function of the exponent,m(i,j)threshold to be activated representing inputDrawing (A)(i, j)The value of the element of the position is,Τ(i,j)graph representing output thresholds to be activated(i,j)A threshold map of locations.
In this embodiment, a linear rectification function is used to sparsify the correlation weight matrixreluThe calculation formula is as follows:
in the above formula, the first and second carbon atoms are,f(M-Τ)represents the result of the sparsification of the correlation weight matrix,Ma matrix of the correlation is represented and,Τrepresenting a threshold matrix. Correlation matrixMValue and threshold matrix inΤThe difference comparison is carried out on the value of (1) and the difference comparison is carried out through a linear rectification functionreluA final sparse attention matrix is obtained.
In this embodiment, element correspondence addition is performed on the context feature information captured by sparse attention and the local feature information extracted by the convolution branch to obtain a monomodal featureX f L 、X f A 、X f V 。
In step 4) of this embodiment, decomposing the high-dimensional features corresponding to each modality into a low-dimensional feature subspace to obtain a functional expression of the low-dimensional features as follows:
in the above formula, the first and second carbon atoms are,X i L 、X i A 、X i V respectively the low-dimensional features of each modality in the low-dimensional feature subspace,X f L 、X f A 、X f V three single-modality features containing context information,W i L 、W i A 、W i V dimension reduction matrixes corresponding to three modes respectively (through the matrixes, the input single-mode characteristic sequence can be obtainedProjection into 6 different low-dimensional feature spaces);
when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:
α i + β i + γ i =1,i=1,2,…,6
in the above formula, the first and second carbon atoms are,α i ,β i ,γ i weights assigned to the low-dimensional features, respectively;
the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:
in the above formula, the first and second carbon atoms are,F i is shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α i ,β i ,γ i the weights assigned to the low-dimensional features respectively,X i L 、X i A 、X i V respectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.
In this embodiment, the step 5) includes:
5.1) calculating the head of the multi-branch sparse attention network according to the following formula;
in the above formula, the first and second carbon atoms are,head i f first to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F i is shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;
5.2) calculating the fused multi-modal information according to the following formula;
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head 1 f ~head 6 f for the 1 st to 6 th heads of the multi-branch sparse attention network,W f O is an output weight matrix. The emotion classifier in step 5) of the embodiment is composed of a full connection layer andsoftmaxfunction composition, wherein the calculation function expression of the full connection layer is as follows:
Y=WF f +B
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;
wherein the content of the first and second substances,softmaxthe computational function expression of the function is:
in the above formula, the first and second carbon atoms are,y i is as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, YiCorresponding to the output of the fully connected layer for the ith emotion class,Y j is as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories. In the training process of the emotion classifier, the cascaded low-dimensional features are trained in a multi-branch sparse attention network to obtain fused multi-mode information, emotion category information is obtained through the emotion classifier, and network parameters are solved. And the loss function used to solve the network parameters isL1LossThe calculation function expression is:
in the above formula, the first and second carbon atoms are,y i p representing the probability of prediction as the ith emotion class,y i is as followsiThe probability of each emotion category, and n is the number of emotion categories.
And analyzing and calculating the input audio, the face video and the text by using the trained multi-mode emotion recognition network model, and predicting the user emotion types contained in the multi-mode data. High-dimensional features are extracted from audio, video and text, and feature alignment is performed in units of words. And introducing position information into the characteristic sequence through a position coding module. For three modalities, a multi-branch sparse attention module is used to extract context information within the modalities. And reducing the dimensions of the three modal characteristics into different subspaces respectively, and endowing the modal characteristics in each subspace with corresponding weights. And extracting context information among the modes by using a multi-branch sparse attention module, and finally obtaining the user emotion category contained in the multi-mode data through an emotion classifier.
Experimental verification is carried out on the multi-modal emotion recognition method based on subspace sparse feature fusion In the embodiment, and Table 1 shows emotion recognition results based on a multi-modal subspace information fusion network under CMU-MOSI and CMU-MOSEI data sets, wherein Acc2 represents binary emotion classification accuracy, Acc7 represents seven-element emotion classification accuracy, F1 represents binary emotion F1 score, MFN is a method In Memory fusion network for multi-view search learning (In third-Second AAAI Conference assessment, 2018 a.) published by Amir Zadeh et al, and RAVEN is a method In Words shift: dynamic added prediction statistics analysis approaches (In Conference assessment, In 2019A 7216).
Table 1: the method of the embodiment evaluates the result on the disclosed emotion recognition data set.
As can be seen from the table 1, the multi-modal emotion recognition method based on subspace sparse feature fusion can realize accurate classification of multi-modal emotion recognition under CMU-MOSI and CMU-MOSEI data sets.
In addition, the present embodiment also provides a multi-modal emotion recognition system based on subspace sparse feature fusion, which includes a computer device, where the computer device at least includes a microprocessor and a memory, where the microprocessor is programmed or configured to execute the steps of the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion, or the memory stores a computer program that is programmed or configured to execute the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion. In addition, the embodiment also provides a computer readable storage medium, which stores a computer program programmed or configured to execute the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
Claims (10)
1. A multi-modal emotion recognition method based on subspace sparse feature fusion is characterized by comprising the following steps:
1) acquiring a characteristic sequence of multiple current modals of an identified object;
2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;
3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;
4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;
5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;
6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.
2. The method for multi-modal emotion recognition based on subspace sparse feature fusion as recited in claim 1, wherein the features of the plurality of modalities in step 1) comprise a text feature sequence, an audio feature sequence and a video feature sequence.
3. The method for multi-modal emotion recognition based on subspace sparse feature fusion as recited in claim 2, wherein the step of step 2) comprises: aligning the audio characteristic sequence and the video characteristic sequence according to the text characteristic sequence, recording the starting time and the ending time of the ith word, respectively averaging the characteristics in the corresponding time periods of the audio characteristic sequence and the video characteristic sequence, normalizing the aligned text characteristic sequence, audio characteristic sequence and video characteristic sequence to be in a range of [0,1], finally limiting the length of the text content, intercepting the excess part, complementing 0 for the insufficient part, and respectively unifying the characteristic dimensions of the text characteristic sequence, the audio characteristic sequence and the video characteristic sequence to be (20, 300), (20, 74) and (20, 35).
4. The method for multi-modal emotion recognition based on subspace sparse feature fusion as claimed in claim 1, wherein the position-encoded function expression in step 3) is as follows:
in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE (pos,2i) representing a position-coding matrixPEMiddle positionposIs encoded in the position of dimension 2i of (a),PE (pos,2i+1) indicating a locationposIs encoded in the position of dimension 2i +1,X 0 a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix.
5. The method for multi-modal emotion recognition based on subspace sparse feature fusion as claimed in claim 2, wherein the processing step of the multi-branch sparse attention module in step 3) on the input feature sequence with the introduced position information comprises the following steps: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;
performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:
in the above formula, the first and second carbon atoms are,W i q 、W i k 、W i v respectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X i q 、X i k 、X i v respectively is a query feature, a key feature and a value feature projected into 6 different feature spaces;
wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead i :
In the above formula, the first and second carbon atoms are,head i is as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X i q 、 X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in the individual feature space,d k in order to input the dimensions of the sequence of features,sparse(X i q X i kT ) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X i q X i kT ) The functional expression of (a) is:
in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X i q 、X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;
then the head in each feature spacehead i Cascading to the resulting multi-headed feature using:
in the above formula, the first and second carbon atoms are,MultiHead(X i q 、X i k 、X i v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head 1 ~head 6 the heads in the 1 st to 6 th feature spaces,W O is an output weight matrix;
the functional expression of the result output by the gate control linear unit activation is shown as the following formula:
in the above formula, the first and second carbon atoms are,W、Vfor the purpose of the different convolution kernels, the,b、cas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h l (X)the result of the output is activated for gating the linear cell.
6. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 2, wherein the functional expression of the low-dimensional features obtained by decomposing the high-dimensional features corresponding to each modality into the low-dimensional feature subspace in step 4) is as follows:
in the above formula, the first and second carbon atoms are,X i L 、X i A 、X i V respectively the low-dimensional features of each modality in the low-dimensional feature subspace,X f L 、X f A 、X f V three single-modality features containing context information,W i L 、W i A 、W i V dimension reduction matrixes corresponding to the three modes are respectively adopted;
when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:
α i + β i + γ i =1,i=1,2,…,6
in the above formula, the first and second carbon atoms are,α i ,β i ,γ i weights assigned to the low-dimensional features, respectively;
the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:
in the above formula, the first and second carbon atoms are,F i is shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α i ,β i ,γ i the weights assigned to the low-dimensional features respectively,X i L 、X i A 、X i V respectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.
7. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 6, wherein the step of step 5) comprises:
1) calculating a head of the multi-branch sparse attention network according to the following formula;
in the above formula, the first and second carbon atoms are,head i f first to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F i is shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;
2) calculating fused multi-modal information according to the following formula;
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head 1 f ~head 6 f for the 1 st to 6 th heads of the multi-branch sparse attention network,W f O is an output weight matrix.
8. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 7, wherein the emotion classifier in step 5) is composed of all-around emotionA connecting layer andsoftmaxfunction composition, wherein the calculation function expression of the full connection layer is as follows:
Y=WF f +B
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;
wherein the content of the first and second substances,softmaxthe computational function expression of the function is:
in the above formula, the first and second carbon atoms are,y i is as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, YiCorresponding to the output of the fully connected layer for the ith emotion class,Y j is as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories.
9. A multi-modal emotion recognition system based on subspace sparse feature fusion, comprising a computer device including at least a microprocessor and a memory, wherein the microprocessor is programmed or configured to perform the steps of the multi-modal emotion recognition method based on subspace sparse feature fusion of any one of claims 1 to 8, or the memory has stored therein a computer program programmed or configured to perform the multi-modal emotion recognition method based on subspace sparse feature fusion of any one of claims 1 to 8.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and is programmed or configured to execute the method for multi-modal emotion recognition based on subspace sparse feature fusion according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011019175.9A CN111931795B (en) | 2020-09-25 | 2020-09-25 | Multi-modal emotion recognition method and system based on subspace sparse feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011019175.9A CN111931795B (en) | 2020-09-25 | 2020-09-25 | Multi-modal emotion recognition method and system based on subspace sparse feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111931795A true CN111931795A (en) | 2020-11-13 |
CN111931795B CN111931795B (en) | 2020-12-25 |
Family
ID=73335167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011019175.9A Active CN111931795B (en) | 2020-09-25 | 2020-09-25 | Multi-modal emotion recognition method and system based on subspace sparse feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111931795B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112598067A (en) * | 2020-12-25 | 2021-04-02 | 中国联合网络通信集团有限公司 | Emotion classification method and device for event, electronic equipment and storage medium |
CN112633364A (en) * | 2020-12-21 | 2021-04-09 | 上海海事大学 | Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism |
CN113177546A (en) * | 2021-04-30 | 2021-07-27 | 中国科学技术大学 | Target detection method based on sparse attention module |
CN113378942A (en) * | 2021-06-16 | 2021-09-10 | 中国石油大学(华东) | Small sample image classification method based on multi-head feature cooperation |
CN113870259A (en) * | 2021-12-02 | 2021-12-31 | 天津御锦人工智能医疗科技有限公司 | Multi-modal medical data fusion assessment method, device, equipment and storage medium |
CN114022668A (en) * | 2021-10-29 | 2022-02-08 | 北京有竹居网络技术有限公司 | Method, device, equipment and medium for aligning text with voice |
CN114863520A (en) * | 2022-04-25 | 2022-08-05 | 陕西师范大学 | Video expression recognition method based on C3D-SA |
CN114926716A (en) * | 2022-04-08 | 2022-08-19 | 山东师范大学 | Learning participation degree identification method, device and equipment and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN110209824A (en) * | 2019-06-13 | 2019-09-06 | 中国科学院自动化研究所 | Text emotion analysis method based on built-up pattern, system, device |
US20190286889A1 (en) * | 2018-03-19 | 2019-09-19 | Buglife, Inc. | Lossy facial expression training data pipeline |
CN111460213A (en) * | 2020-03-20 | 2020-07-28 | 河海大学 | Music emotion classification method based on multi-mode learning |
-
2020
- 2020-09-25 CN CN202011019175.9A patent/CN111931795B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190286889A1 (en) * | 2018-03-19 | 2019-09-19 | Buglife, Inc. | Lossy facial expression training data pipeline |
CN110188343A (en) * | 2019-04-22 | 2019-08-30 | 浙江工业大学 | Multi-modal emotion identification method based on fusion attention network |
CN110209824A (en) * | 2019-06-13 | 2019-09-06 | 中国科学院自动化研究所 | Text emotion analysis method based on built-up pattern, system, device |
CN111460213A (en) * | 2020-03-20 | 2020-07-28 | 河海大学 | Music emotion classification method based on multi-mode learning |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633364A (en) * | 2020-12-21 | 2021-04-09 | 上海海事大学 | Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism |
CN112633364B (en) * | 2020-12-21 | 2024-04-05 | 上海海事大学 | Multimode emotion recognition method based on transducer-ESIM attention mechanism |
CN112598067A (en) * | 2020-12-25 | 2021-04-02 | 中国联合网络通信集团有限公司 | Emotion classification method and device for event, electronic equipment and storage medium |
CN113177546A (en) * | 2021-04-30 | 2021-07-27 | 中国科学技术大学 | Target detection method based on sparse attention module |
CN113378942A (en) * | 2021-06-16 | 2021-09-10 | 中国石油大学(华东) | Small sample image classification method based on multi-head feature cooperation |
CN114022668A (en) * | 2021-10-29 | 2022-02-08 | 北京有竹居网络技术有限公司 | Method, device, equipment and medium for aligning text with voice |
CN114022668B (en) * | 2021-10-29 | 2023-09-22 | 北京有竹居网络技术有限公司 | Method, device, equipment and medium for aligning text with voice |
CN113870259A (en) * | 2021-12-02 | 2021-12-31 | 天津御锦人工智能医疗科技有限公司 | Multi-modal medical data fusion assessment method, device, equipment and storage medium |
CN113870259B (en) * | 2021-12-02 | 2022-04-01 | 天津御锦人工智能医疗科技有限公司 | Multi-modal medical data fusion assessment method, device, equipment and storage medium |
WO2023098524A1 (en) * | 2021-12-02 | 2023-06-08 | 天津御锦人工智能医疗科技有限公司 | Multi-modal medical data fusion evaluation method and apparatus, device, and storage medium |
CN114926716A (en) * | 2022-04-08 | 2022-08-19 | 山东师范大学 | Learning participation degree identification method, device and equipment and readable storage medium |
CN114863520A (en) * | 2022-04-25 | 2022-08-05 | 陕西师范大学 | Video expression recognition method based on C3D-SA |
Also Published As
Publication number | Publication date |
---|---|
CN111931795B (en) | 2020-12-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111931795B (en) | Multi-modal emotion recognition method and system based on subspace sparse feature fusion | |
CN109522818B (en) | Expression recognition method and device, terminal equipment and storage medium | |
CN108804530B (en) | Subtitling areas of an image | |
WO2021233112A1 (en) | Multimodal machine learning-based translation method, device, equipment, and storage medium | |
CN113420807A (en) | Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method | |
Huang et al. | Multimodal continuous emotion recognition with data augmentation using recurrent neural networks | |
CN113822192A (en) | Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion | |
CN113723166A (en) | Content identification method and device, computer equipment and storage medium | |
Wöllmer et al. | Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions | |
Zhang et al. | Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects | |
CN113449801B (en) | Image character behavior description generation method based on multi-level image context coding and decoding | |
CN110705490B (en) | Visual emotion recognition method | |
CN111985243B (en) | Emotion model training method, emotion analysis device and storage medium | |
CN112926525A (en) | Emotion recognition method and device, electronic equipment and storage medium | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN113435421B (en) | Cross-modal attention enhancement-based lip language identification method and system | |
Ousmane et al. | Automatic recognition system of emotions expressed through the face using machine learning: Application to police interrogation simulation | |
Cornia et al. | A unified cycle-consistent neural model for text and image retrieval | |
CN117132923A (en) | Video classification method, device, electronic equipment and storage medium | |
Xue et al. | Lcsnet: End-to-end lipreading with channel-aware feature selection | |
Hou et al. | Confidence-guided self refinement for action prediction in untrimmed videos | |
Yang et al. | Self-adaptive context and modal-interaction modeling for multimodal emotion recognition | |
CN116226347A (en) | Fine granularity video emotion content question-answering method and system based on multi-mode data | |
WO2021129410A1 (en) | Method and device for text processing | |
CN114462073A (en) | De-identification effect evaluation method and device, storage medium and product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |