CN111931795A - Multi-modal emotion recognition method and system based on subspace sparse feature fusion - Google Patents

Multi-modal emotion recognition method and system based on subspace sparse feature fusion Download PDF

Info

Publication number
CN111931795A
CN111931795A CN202011019175.9A CN202011019175A CN111931795A CN 111931795 A CN111931795 A CN 111931795A CN 202011019175 A CN202011019175 A CN 202011019175A CN 111931795 A CN111931795 A CN 111931795A
Authority
CN
China
Prior art keywords
feature
features
low
sparse
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011019175.9A
Other languages
Chinese (zh)
Other versions
CN111931795B (en
Inventor
李树涛
马付严
孙斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202011019175.9A priority Critical patent/CN111931795B/en
Publication of CN111931795A publication Critical patent/CN111931795A/en
Application granted granted Critical
Publication of CN111931795B publication Critical patent/CN111931795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/513Sparse representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-modal emotion recognition method and system based on subspace sparse feature fusion, and the method comprises the steps of obtaining feature sequences of multiple modes, carrying out word level alignment, normalization processing and position coding, then inputting a corresponding multi-branch sparse attention module, decomposing the feature sequences to a low-dimensional feature subspace to obtain low-dimensional features, cascading all the low-dimensional features in the low-dimensional feature subspace based on weights, obtaining fused multi-modal information through training in a multi-branch sparse attention network, and then inputting a pre-trained emotion classifier to obtain the current emotion category of an object to be recognized, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-modal information and the emotion category. According to the method, the multi-modal information is decomposed into multiple subspaces for fusion by considering the association sparsity among the time sequence information, so that the context information in and among the modalities can be captured, and the accuracy of multi-modal emotion recognition is improved.

Description

Multi-modal emotion recognition method and system based on subspace sparse feature fusion
Technical Field
The invention relates to a multi-modal man-machine natural interaction technology, in particular to a multi-modal emotion recognition method and system based on subspace sparse feature fusion.
Background
The multi-modal human-computer natural interaction faces emotional challenges, and to overcome the emotional challenges in the multi-modal human-computer natural interaction, the problem that the robot understands and recognizes human emotions must be solved first, so emotion recognition is an important research subject in the field of human-computer interaction, and rapid development is achieved in recent years. The accuracy of emotion recognition by solely using a facial image or a voice signal is in a bottleneck state, and the robustness is poor. Compared with single-modal emotion recognition, the multi-modal emotion recognition can more comprehensively utilize emotion signals in voice, facial expression images and texts, and further improves the emotion recognition level. Thus, an increasing number of researchers are focusing their attention on multimodal emotion recognition studies.
However, there are many challenges to be solved and overcome in multi-modal emotion recognition, which mainly include: first, the representation and fusion of different modal emotional features. The audio and video information is collected by different sensors, the data format and the capture rate are different, and the problems of unified representation and fusion of emotional characteristics in the multi-mode signals are not solved. Second, modality information is missing. The existing multi-modal emotion recognition method generally assumes that multi-modal information is completely acquired, and the absence of a certain modality is not considered, but the absence of audio and video modalities can be caused by noise and shielding in a real environment. Thirdly, the uncertainty factor of the emotional characteristics. Language, gender and culture can lead to differences in the expression of specific emotional states in different scenarios.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: in order to solve the problems in the prior art, the invention provides a multi-modal emotion recognition method and system based on subspace sparse feature fusion.
In order to solve the technical problems, the invention adopts the technical scheme that:
a multi-modal emotion recognition method based on subspace sparse feature fusion comprises the following steps:
1) acquiring a characteristic sequence of multiple current modals of an identified object;
2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;
3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;
4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;
5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;
6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.
Optionally, the features of the plurality of modalities in step 1) include a text feature sequence, an audio feature sequence, and a video feature sequence.
Optionally, the step of step 2) comprises: aligning the audio characteristic sequence and the video characteristic sequence according to the text characteristic sequence, recording the starting time and the ending time of the ith word, respectively averaging the characteristics in the corresponding time periods of the audio characteristic sequence and the video characteristic sequence, normalizing the aligned text characteristic sequence, audio characteristic sequence and video characteristic sequence to be in a range of [0,1], finally limiting the length of the text content, intercepting the excess part, complementing 0 for the insufficient part, and respectively unifying the characteristic dimensions of the text characteristic sequence, the audio characteristic sequence and the video characteristic sequence to be (20, 300), (20, 74) and (20, 35).
Optionally, the function expression of the position code in step 3) is as follows:
Figure 194962DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE (pos,2i) representing a position-coding matrixPEMiddle positionposIs encoded in the position of dimension 2i of (a),PE (pos,2i+1) indicating a locationposIs encoded in the position of dimension 2i +1,X 0 a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix.
Optionally, the processing step of the multi-branch sparse attention module in step 3) on the input feature sequence of the imported location information includes: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;
performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:
Figure 905298DEST_PATH_IMAGE002
in the above formula, the first and second carbon atoms are,W i q 、W i k 、W i v respectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X i q 、X i k 、X i v respectively is a query feature, a key feature and a value feature projected into 6 different feature spaces;
wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead i
Figure 335142DEST_PATH_IMAGE003
In the above formula, the first and second carbon atoms are,head i is as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X i q X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in the individual feature space,d k in order to input the dimensions of the sequence of features,sparse(X i q X i kT ) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X i q X i kT ) The functional expression of (a) is:
Figure 576768DEST_PATH_IMAGE004
in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X i q 、X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;
then the head in each feature spacehead i Cascading to the resulting multi-headed feature using:
Figure 851892DEST_PATH_IMAGE005
in the above formula, the first and second carbon atoms are,MultiHead(X i q 、X i k 、X i v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head 1 head 6 the heads in the 1 st to 6 th feature spaces,W O is an output weight matrix;
the functional expression of the result output by the gate control linear unit activation is shown as the following formula:
Figure 433046DEST_PATH_IMAGE006
in the above formula, the first and second carbon atoms are,WVfor the purpose of the different convolution kernels, the,bcas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h l (X)the result of the output is activated for gating the linear cell.
Optionally, decomposing the high-dimensional features corresponding to each modality into a low-dimensional feature subspace in step 4) to obtain a functional expression of the low-dimensional features as follows:
Figure 768212DEST_PATH_IMAGE007
in the above formula,X i L 、X i A 、X i V Respectively the low-dimensional features of each modality in the low-dimensional feature subspace,X f L 、X f A 、X f V three single-modality features containing context information,W i L 、W i A 、W i V dimension reduction matrixes corresponding to the three modes are respectively adopted;
when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:
α i + β i + γ i =1,i=1,2,…,6
in the above formula, the first and second carbon atoms are,α i i i weights assigned to the low-dimensional features, respectively;
the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:
Figure 497134DEST_PATH_IMAGE008
in the above formula, the first and second carbon atoms are,F i is shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α i i i the weights assigned to the low-dimensional features respectively,X i L 、X i A 、X i V respectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.
Optionally, the step of step 5) comprises:
5.1) calculating the head of the multi-branch sparse attention network according to the following formula;
Figure 575948DEST_PATH_IMAGE009
in the above formula, the first and second carbon atoms are,head i f first to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F i is shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;
5.2) calculating the fused multi-modal information according to the following formula;
Figure 73925DEST_PATH_IMAGE010
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head 1 f head 6 f for the 1 st to 6 th heads of the multi-branch sparse attention network,W f O is an output weight matrix.
Optionally, the emotion classifier in step 5) is composed of a full connection layer and
Figure 32523DEST_PATH_IMAGE011
function composition, wherein the calculation function expression of the full connection layer is as follows:
Y=WF f +B
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;
wherein the content of the first and second substances,
Figure 983162DEST_PATH_IMAGE011
calculation function of functionThe expression is as follows:
Figure 865667DEST_PATH_IMAGE012
in the above formula, the first and second carbon atoms are,y i is as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, YiCorresponding to the output of the fully connected layer for the ith emotion class,Y j is as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories.
In addition, the invention also provides a multi-modal emotion recognition system based on subspace sparse feature fusion, which comprises a computer device, wherein the computer device at least comprises a microprocessor and a memory, the microprocessor is programmed or configured to execute the steps of the multi-modal emotion recognition method based on subspace sparse feature fusion, or the memory stores a computer program which is programmed or configured to execute the multi-modal emotion recognition method based on subspace sparse feature fusion.
In addition, the invention also provides a computer readable storage medium, wherein a computer program programmed or configured to execute the multi-modal emotion recognition method based on subspace sparse feature fusion is stored in the computer readable storage medium.
Compared with the prior art, the invention has the following advantages: acquiring a characteristic sequence of multiple modes, performing word-level alignment, normalization processing and position coding, inputting a corresponding multi-branch sparse attention module, decomposing to a low-dimensional characteristic subspace to obtain low-dimensional characteristics, cascading all the low-dimensional characteristics in the low-dimensional characteristic subspace based on weight, and training in a multi-branch sparse attention network to obtain fused multi-mode information; inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category. Based on the means, the multi-modal emotion recognition method and the device have the advantages that by considering the correlation sparsity among the time sequence information, the multi-modal information is decomposed into multiple subspaces to be fused, the context information in the modes and among the modes can be captured, and the accuracy of the multi-modal emotion recognition is improved.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a frame structure of the method according to the embodiment of the present invention.
Fig. 3 is a schematic diagram of a frame structure of a multi-branch sparse attention module according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1 and fig. 2, the multi-modal emotion recognition method based on subspace sparse feature fusion in the present embodiment includes:
1) acquiring a characteristic sequence of multiple current modals of an identified object;
2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;
3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;
4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;
5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;
6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.
It should be noted that the method of the present embodiment does not depend on a specific modality or a combination of modalitiesIt is actually a method that can be compatible with various modalities that are present or even may appear in the future. As an optional implementation manner, the features of the multiple modalities in step 1) of this embodiment include a text feature sequence, an audio feature sequence, and a video feature sequence. In this embodiment, the generating steps of the text feature sequence, the audio feature sequence, and the video feature sequence include: firstly, respectively acquiring audio and user face video information by using a voice development board and a camera, and automatically identifying the voice subjected to noise reduction as a text through a voice open platform; then, the text feature sequence of the text content obtained in the step one of pre-trained Glove extraction is l = { l = l =1,l2,l3,…,lNl );ln∈R300In which lnRepresenting word vector characteristics, wherein the dimension of a single text characteristic sequence is 300 dimensions, and Nl is the number of words of the identified text content; extracting an audio feature sequence as a = { a using covapr1,a2,a3,…,aNa );an∈R74In which a isnThe expressed word vector characteristics, the dimension of a single audio characteristic sequence is 74-dimensional, and Na is the segmented frame number of the audio; extracting video feature sequences as = { v } using Facet1,v2,v3,…,vNv );vn∈R74In which v isnThe expressed word vector features, the dimension of a single video feature sequence is 35-dimensional, and Nv is the total frame number of the video.
In this embodiment, the step 2) includes: aligning the audio feature sequence and the video feature sequence according to the text feature sequence, recording the starting time and the ending time of the ith word, respectively averaging the features in the corresponding time periods of the audio feature sequence and the video feature sequence, normalizing the aligned text feature sequence, audio feature sequence and video feature sequence to be in a range of [0,1], finally limiting the length of text content (for example, the length is 20 in the embodiment), intercepting the excess part, complementing the 0 for the insufficient part, and respectively setting the feature dimensions of the unified text feature sequence, audio feature sequence and video feature sequence to be (20, 300), (20, 74) and (20, 35).
In this embodiment, the function expression of the position code in step 3) is shown as follows:
Figure 218151DEST_PATH_IMAGE013
in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE (pos,2i) representing a position-coding matrixPEThe position code of dimension 2i of the middle position pos,PE (pos,2i+1) a position code of dimension 2i +1 representing position pos,X 0 a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix. The position coding is used for reserving relative position information in the characteristic sequence, performing sine transformation at even positions of the characteristic sequence and performing cosine transformation at odd positions of the characteristic sequence to obtain a position coding matrixPEFinally, the original input is accumulatedX 0 Introducing position information into the signature sequenceX. In this embodiment, the aligned text feature sequence, audio feature sequence and video feature sequence are usedl、a、vRespectively inputting multi-branch sparse attention module to learn single-mode context information through position codingX L 、X A 、X V
As shown in fig. 3, the processing step of the multi-branch sparse attention module in step 3) of this embodiment on the input feature sequence of the imported location information includes: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;
performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:
Figure 98382DEST_PATH_IMAGE014
in the above formula, the first and second carbon atoms are,W i q 、W i k 、W i v respectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X i q 、X i k 、X i v respectively is a query feature, a key feature and a value feature projected into 6 different feature spaces; inputting the characteristic sequence by the query weight matrix, the key weight matrix and the value weight matrix
Figure 536317DEST_PATH_IMAGE015
Projecting into 6 different feature spaces;
wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead i
Figure 956934DEST_PATH_IMAGE016
In the above formula, the first and second carbon atoms are,head i is as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X i q X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in the individual feature space,d k in order to input the dimensions of the sequence of features,sparse(X i q X i kT ) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X i q X i kT ) The functional expression of (a) is:
Figure 429503DEST_PATH_IMAGE017
in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X i q 、X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;
then the head in each feature spacehead i Cascading to the resulting multi-headed feature using:
Figure 729904DEST_PATH_IMAGE018
in the above formula, the first and second carbon atoms are,MultiHead(X i q 、X i k 、X i v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head 1 head 6 the heads in the 1 st to 6 th feature spaces,W O is an output weight matrix;contactfor cascading features obtained from 6 different feature spaces at output time with output weightMatrix arrayW O Multiplying to obtain an output; considering the sparsity of the long feature sequence in time sequence, the multi-head sparse attention module in this embodiment calculates the second step by using the following formulaiHead in a feature spacehead i
Figure 655134DEST_PATH_IMAGE019
In the above formula, the first and second carbon atoms are,SparseAttentionin order to compute a network for sparse attention,X i q 、X i k 、X i v for query features, key features, value features,d k is the dimension of the input feature sequence.
The functional expression of the result output by the gate control linear unit activation is shown as the following formula:
Figure 879442DEST_PATH_IMAGE020
in the above formula, the first and second carbon atoms are,WVfor the purpose of the different convolution kernels, the,bcas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h l (X)the result of the output is activated for gating the linear cell. Output of convolution branchh l (X)Is the product of the convolution layer output without nonlinear transformation and the convolution layer output after nonlinear transformation by sigmoid function.
In the embodiment, a sigmoid function is used as activation output of a threshold in the process of calculating sparse attention, and a calculation formula of the sigmoid function is as follows:
Figure 940939DEST_PATH_IMAGE021
in the above equation, the sigmoid function is an activation function,e x an exponential function is defined as the function of the exponent,m(i,j)threshold to be activated representing inputDrawing (A)(i, j)The value of the element of the position is,Τ(i,j)graph representing output thresholds to be activated(i,j)A threshold map of locations.
In this embodiment, a linear rectification function is used to sparsify the correlation weight matrixreluThe calculation formula is as follows:
Figure 225290DEST_PATH_IMAGE022
in the above formula, the first and second carbon atoms are,f(M-Τ)represents the result of the sparsification of the correlation weight matrix,Ma matrix of the correlation is represented and,Τrepresenting a threshold matrix. Correlation matrixMValue and threshold matrix inΤThe difference comparison is carried out on the value of (1) and the difference comparison is carried out through a linear rectification functionreluA final sparse attention matrix is obtained.
In this embodiment, element correspondence addition is performed on the context feature information captured by sparse attention and the local feature information extracted by the convolution branch to obtain a monomodal featureX f L 、X f A 、X f V
In step 4) of this embodiment, decomposing the high-dimensional features corresponding to each modality into a low-dimensional feature subspace to obtain a functional expression of the low-dimensional features as follows:
Figure 575500DEST_PATH_IMAGE023
in the above formula, the first and second carbon atoms are,X i L 、X i A 、X i V respectively the low-dimensional features of each modality in the low-dimensional feature subspace,X f L 、X f A 、X f V three single-modality features containing context information,W i L 、W i A 、W i V dimension reduction matrixes corresponding to three modes respectively (through the matrixes, the input single-mode characteristic sequence can be obtained
Figure 603499DEST_PATH_IMAGE015
Projection into 6 different low-dimensional feature spaces);
when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:
α i + β i + γ i =1,i=1,2,…,6
in the above formula, the first and second carbon atoms are,α i i i weights assigned to the low-dimensional features, respectively;
the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:
Figure 785081DEST_PATH_IMAGE024
in the above formula, the first and second carbon atoms are,F i is shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α i i i the weights assigned to the low-dimensional features respectively,X i L 、X i A 、X i V respectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.
In this embodiment, the step 5) includes:
5.1) calculating the head of the multi-branch sparse attention network according to the following formula;
Figure 974754DEST_PATH_IMAGE025
in the above formula, the first and second carbon atoms are,head i f first to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F i is shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;
5.2) calculating the fused multi-modal information according to the following formula;
Figure 874577DEST_PATH_IMAGE026
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head 1 f head 6 f for the 1 st to 6 th heads of the multi-branch sparse attention network,W f O is an output weight matrix. The emotion classifier in step 5) of the embodiment is composed of a full connection layer andsoftmaxfunction composition, wherein the calculation function expression of the full connection layer is as follows:
Y=WF f +B
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;
wherein the content of the first and second substances,softmaxthe computational function expression of the function is:
Figure 627638DEST_PATH_IMAGE027
in the above formula, the first and second carbon atoms are,y i is as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, YiCorresponding to the output of the fully connected layer for the ith emotion class,Y j is as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories. In the training process of the emotion classifier, the cascaded low-dimensional features are trained in a multi-branch sparse attention network to obtain fused multi-mode information, emotion category information is obtained through the emotion classifier, and network parameters are solved. And the loss function used to solve the network parameters isL1LossThe calculation function expression is:
Figure 929307DEST_PATH_IMAGE028
in the above formula, the first and second carbon atoms are,y i p representing the probability of prediction as the ith emotion class,y i is as followsiThe probability of each emotion category, and n is the number of emotion categories.
And analyzing and calculating the input audio, the face video and the text by using the trained multi-mode emotion recognition network model, and predicting the user emotion types contained in the multi-mode data. High-dimensional features are extracted from audio, video and text, and feature alignment is performed in units of words. And introducing position information into the characteristic sequence through a position coding module. For three modalities, a multi-branch sparse attention module is used to extract context information within the modalities. And reducing the dimensions of the three modal characteristics into different subspaces respectively, and endowing the modal characteristics in each subspace with corresponding weights. And extracting context information among the modes by using a multi-branch sparse attention module, and finally obtaining the user emotion category contained in the multi-mode data through an emotion classifier.
Experimental verification is carried out on the multi-modal emotion recognition method based on subspace sparse feature fusion In the embodiment, and Table 1 shows emotion recognition results based on a multi-modal subspace information fusion network under CMU-MOSI and CMU-MOSEI data sets, wherein Acc2 represents binary emotion classification accuracy, Acc7 represents seven-element emotion classification accuracy, F1 represents binary emotion F1 score, MFN is a method In Memory fusion network for multi-view search learning (In third-Second AAAI Conference assessment, 2018 a.) published by Amir Zadeh et al, and RAVEN is a method In Words shift: dynamic added prediction statistics analysis approaches (In Conference assessment, In 2019A 7216).
Table 1: the method of the embodiment evaluates the result on the disclosed emotion recognition data set.
Figure 289881DEST_PATH_IMAGE029
As can be seen from the table 1, the multi-modal emotion recognition method based on subspace sparse feature fusion can realize accurate classification of multi-modal emotion recognition under CMU-MOSI and CMU-MOSEI data sets.
In addition, the present embodiment also provides a multi-modal emotion recognition system based on subspace sparse feature fusion, which includes a computer device, where the computer device at least includes a microprocessor and a memory, where the microprocessor is programmed or configured to execute the steps of the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion, or the memory stores a computer program that is programmed or configured to execute the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion. In addition, the embodiment also provides a computer readable storage medium, which stores a computer program programmed or configured to execute the foregoing multi-modal emotion recognition method based on subspace sparse feature fusion.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A multi-modal emotion recognition method based on subspace sparse feature fusion is characterized by comprising the following steps:
1) acquiring a characteristic sequence of multiple current modals of an identified object;
2) carrying out word-level alignment and normalization processing on the characteristic sequences of multiple modes;
3) respectively obtaining a characteristic sequence of introducing position information by position coding of the characteristic sequences of multiple modes of the identified object, and then respectively inputting the characteristic sequences of introducing the position information under each mode into a corresponding multi-branch sparse attention module to obtain high-dimensional characteristics corresponding to each mode;
4) decomposing the high-dimensional features corresponding to each mode into a low-dimensional feature subspace to obtain low-dimensional features, giving corresponding weights to the low-dimensional features, and then cascading all the low-dimensional features in the low-dimensional feature subspace based on the weights to obtain cascaded low-dimensional features;
5) training the cascaded low-dimensional features in a multi-branch sparse attention network to obtain fused multi-modal information;
6) inputting the fused multi-mode information into a pre-trained emotion classifier to obtain the current emotion category of the identified object, wherein the emotion classifier is pre-trained and suggests mapping between the fused multi-mode information and the emotion category.
2. The method for multi-modal emotion recognition based on subspace sparse feature fusion as recited in claim 1, wherein the features of the plurality of modalities in step 1) comprise a text feature sequence, an audio feature sequence and a video feature sequence.
3. The method for multi-modal emotion recognition based on subspace sparse feature fusion as recited in claim 2, wherein the step of step 2) comprises: aligning the audio characteristic sequence and the video characteristic sequence according to the text characteristic sequence, recording the starting time and the ending time of the ith word, respectively averaging the characteristics in the corresponding time periods of the audio characteristic sequence and the video characteristic sequence, normalizing the aligned text characteristic sequence, audio characteristic sequence and video characteristic sequence to be in a range of [0,1], finally limiting the length of the text content, intercepting the excess part, complementing 0 for the insufficient part, and respectively unifying the characteristic dimensions of the text characteristic sequence, the audio characteristic sequence and the video characteristic sequence to be (20, 300), (20, 74) and (20, 35).
4. The method for multi-modal emotion recognition based on subspace sparse feature fusion as claimed in claim 1, wherein the position-encoded function expression in step 3) is as follows:
Figure 579830DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,posrepresenting a sequence of features of a single feature at an inputXIn the position (a) of (b),ithe dimensions in which the features are located are represented,drepresents the overall dimension of the feature in question,PE (pos,2i) representing a position-coding matrixPEMiddle positionposIs encoded in the position of dimension 2i of (a),PE (pos,2i+1) indicating a locationposIs encoded in the position of dimension 2i +1,X 0 a sequence of features representing the input is represented,Xa characteristic sequence representing the incoming position information,PErepresenting a position-coding matrix.
5. The method for multi-modal emotion recognition based on subspace sparse feature fusion as claimed in claim 2, wherein the processing step of the multi-branch sparse attention module in step 3) on the input feature sequence with the introduced position information comprises the following steps: firstly, multi-head dimensionality reduction and sparse attention extraction are carried out on an input feature sequence introducing position informationSparseAttentionMeanwhile, aiming at the input characteristic sequence of the introduced position information, extracting the local correlation of the characteristic sequence of the introduced position information through a convolution layer, activating and outputting through a gate control linear unit, and then adding the extracted multi-head characteristic and the result of the activation and output of the gate control linear unit to obtain high-dimensional characteristics corresponding to each mode;
performing multi-head dimensionality reduction refers to projecting the multi-head dimensionality reduction data into 6 different feature spaces according to the following formula to obtain query features, key features and value features projected into the 6 different feature spaces:
Figure 496971DEST_PATH_IMAGE002
in the above formula, the first and second carbon atoms are,W i q 、W i k 、W i v respectively an inquiry weight matrix, a key weight matrix and a value weight matrix,X i q 、X i k 、X i v respectively is a query feature, a key feature and a value feature projected into 6 different feature spaces;
wherein extracting sparse attentionSparseAttentionIs that the first one is calculated according to the following formula for the query feature, key feature and value feature projected into 6 different feature spacesiSparse attention head in individual feature spacehead i
Figure 276708DEST_PATH_IMAGE003
In the above formula, the first and second carbon atoms are,head i is as followsiThe head in the feature space is determined,SparseAttentionin order to compute a network for sparse attention,X i q X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in the individual feature space,d k in order to input the dimensions of the sequence of features,sparse(X i q X i kT ) Is a sparse similarity matrix, and the sparse similarity matrix is calculatedsparse(X i q X i kT ) The functional expression of (a) is:
Figure 609600DEST_PATH_IMAGE004
in the above formula, the first and second carbon atoms are,Xin order to input the sequence of features,Mis a similarity matrix of the input features,softmaxto representsoftmaxThe function of the function is that of the function,X i q 、X i k 、X i v are respectively projected to the secondiQuery features, key features, value features in a feature space, T being an intermediate variable,sigmoidto representsigmoidThe function of the function is that of the function,linearin the form of a linear function,pool()a pooling window of size 2 x 2 with step size 1,CNN()represents the convolution operation with convolution kernel size of 1 x 1 and step size of 1,linearthe linear function regresses the threshold value of sparse attention according to the activation value after pooling;
then the head in each feature spacehead i Cascading to the resulting multi-headed feature using:
Figure 423973DEST_PATH_IMAGE005
in the above formula, the first and second carbon atoms are,MultiHead(X i q 、X i k 、X i v )in order to obtain a multi-start feature,concatfor the purpose of cascading operations in the feature dimension,head 1 head 6 the heads in the 1 st to 6 th feature spaces,W O is an output weight matrix;
the functional expression of the result output by the gate control linear unit activation is shown as the following formula:
Figure 246435DEST_PATH_IMAGE006
in the above formula, the first and second carbon atoms are,WVfor the purpose of the different convolution kernels, the,bcas a function of the offset parameter(s),Xfor the input sequence of features introducing the position information,h l (X)the result of the output is activated for gating the linear cell.
6. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 2, wherein the functional expression of the low-dimensional features obtained by decomposing the high-dimensional features corresponding to each modality into the low-dimensional feature subspace in step 4) is as follows:
Figure 513468DEST_PATH_IMAGE007
in the above formula, the first and second carbon atoms are,X i L 、X i A 、X i V respectively the low-dimensional features of each modality in the low-dimensional feature subspace,X f L 、X f A 、X f V three single-modality features containing context information,W i L 、W i A 、W i V dimension reduction matrixes corresponding to the three modes are respectively adopted;
when corresponding weights are given to the low-dimensional features in the step 4), the weights of the low-dimensional features meet the following conditions:
α i + β i + γ i =1,i=1,2,…,6
in the above formula, the first and second carbon atoms are,α i i i weights assigned to the low-dimensional features, respectively;
the function expression for cascading all the low-dimensional features in the low-dimensional feature subspace based on the weight in the step 4) is as follows:
Figure 712368DEST_PATH_IMAGE008
in the above formula, the first and second carbon atoms are,F i is shown asiThe low-dimensional features after the concatenation of the low-dimensional feature subspaces,α i i i the weights assigned to the low-dimensional features respectively,X i L 、X i A 、X i V respectively, the low-dimensional features of the respective modalities in the low-dimensional feature subspace.
7. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 6, wherein the step of step 5) comprises:
1) calculating a head of the multi-branch sparse attention network according to the following formula;
Figure 568198DEST_PATH_IMAGE009
in the above formula, the first and second carbon atoms are,head i f first to represent a multi-branch sparse attention networkiThe head of the device is provided with a plurality of heads,SparseAttentionin order to branch on the sparse attention network,F i is shown asiLow-dimensional features after the low-dimensional feature subspaces are cascaded;
2) calculating fused multi-modal information according to the following formula;
Figure 561562DEST_PATH_IMAGE010
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,MultiHead(F,F,F)representing a multi-headed calculation for the concatenated low-dimensional features,head 1 f head 6 f for the 1 st to 6 th heads of the multi-branch sparse attention network,W f O is an output weight matrix.
8. The multi-modal emotion recognition method based on subspace sparse feature fusion as claimed in claim 7, wherein the emotion classifier in step 5) is composed of all-around emotionA connecting layer andsoftmaxfunction composition, wherein the calculation function expression of the full connection layer is as follows:
Y=WF f +B
in the above formula, the first and second carbon atoms are,F f represents the fused multi-modal information and,Bfor the purpose of the corresponding offset, the offset,Wis the weight of all the neurons and is,Yis the output of the full link layer,Ythe dimension of (1) is the number of emotion categories;
wherein the content of the first and second substances,softmaxthe computational function expression of the function is:
Figure 315891DEST_PATH_IMAGE011
in the above formula, the first and second carbon atoms are,y i is as followsiThe probability of an individual emotion category,softmaxthe function being a normalized exponential function, YiCorresponding to the output of the fully connected layer for the ith emotion class,Y j is as followsjEach emotion category corresponds to the output of the fully connected layer,Nis the number of emotion categories.
9. A multi-modal emotion recognition system based on subspace sparse feature fusion, comprising a computer device including at least a microprocessor and a memory, wherein the microprocessor is programmed or configured to perform the steps of the multi-modal emotion recognition method based on subspace sparse feature fusion of any one of claims 1 to 8, or the memory has stored therein a computer program programmed or configured to perform the multi-modal emotion recognition method based on subspace sparse feature fusion of any one of claims 1 to 8.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and is programmed or configured to execute the method for multi-modal emotion recognition based on subspace sparse feature fusion according to any one of claims 1 to 8.
CN202011019175.9A 2020-09-25 2020-09-25 Multi-modal emotion recognition method and system based on subspace sparse feature fusion Active CN111931795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011019175.9A CN111931795B (en) 2020-09-25 2020-09-25 Multi-modal emotion recognition method and system based on subspace sparse feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011019175.9A CN111931795B (en) 2020-09-25 2020-09-25 Multi-modal emotion recognition method and system based on subspace sparse feature fusion

Publications (2)

Publication Number Publication Date
CN111931795A true CN111931795A (en) 2020-11-13
CN111931795B CN111931795B (en) 2020-12-25

Family

ID=73335167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011019175.9A Active CN111931795B (en) 2020-09-25 2020-09-25 Multi-modal emotion recognition method and system based on subspace sparse feature fusion

Country Status (1)

Country Link
CN (1) CN111931795B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112598067A (en) * 2020-12-25 2021-04-02 中国联合网络通信集团有限公司 Emotion classification method and device for event, electronic equipment and storage medium
CN112633364A (en) * 2020-12-21 2021-04-09 上海海事大学 Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism
CN113177546A (en) * 2021-04-30 2021-07-27 中国科学技术大学 Target detection method based on sparse attention module
CN113378942A (en) * 2021-06-16 2021-09-10 中国石油大学(华东) Small sample image classification method based on multi-head feature cooperation
CN113870259A (en) * 2021-12-02 2021-12-31 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion assessment method, device, equipment and storage medium
CN114022668A (en) * 2021-10-29 2022-02-08 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
CN114863520A (en) * 2022-04-25 2022-08-05 陕西师范大学 Video expression recognition method based on C3D-SA
CN114926716A (en) * 2022-04-08 2022-08-19 山东师范大学 Learning participation degree identification method, device and equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188343A (en) * 2019-04-22 2019-08-30 浙江工业大学 Multi-modal emotion identification method based on fusion attention network
CN110209824A (en) * 2019-06-13 2019-09-06 中国科学院自动化研究所 Text emotion analysis method based on built-up pattern, system, device
US20190286889A1 (en) * 2018-03-19 2019-09-19 Buglife, Inc. Lossy facial expression training data pipeline
CN111460213A (en) * 2020-03-20 2020-07-28 河海大学 Music emotion classification method based on multi-mode learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286889A1 (en) * 2018-03-19 2019-09-19 Buglife, Inc. Lossy facial expression training data pipeline
CN110188343A (en) * 2019-04-22 2019-08-30 浙江工业大学 Multi-modal emotion identification method based on fusion attention network
CN110209824A (en) * 2019-06-13 2019-09-06 中国科学院自动化研究所 Text emotion analysis method based on built-up pattern, system, device
CN111460213A (en) * 2020-03-20 2020-07-28 河海大学 Music emotion classification method based on multi-mode learning

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633364A (en) * 2020-12-21 2021-04-09 上海海事大学 Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism
CN112633364B (en) * 2020-12-21 2024-04-05 上海海事大学 Multimode emotion recognition method based on transducer-ESIM attention mechanism
CN112598067A (en) * 2020-12-25 2021-04-02 中国联合网络通信集团有限公司 Emotion classification method and device for event, electronic equipment and storage medium
CN113177546A (en) * 2021-04-30 2021-07-27 中国科学技术大学 Target detection method based on sparse attention module
CN113378942A (en) * 2021-06-16 2021-09-10 中国石油大学(华东) Small sample image classification method based on multi-head feature cooperation
CN114022668A (en) * 2021-10-29 2022-02-08 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
CN114022668B (en) * 2021-10-29 2023-09-22 北京有竹居网络技术有限公司 Method, device, equipment and medium for aligning text with voice
CN113870259A (en) * 2021-12-02 2021-12-31 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion assessment method, device, equipment and storage medium
CN113870259B (en) * 2021-12-02 2022-04-01 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion assessment method, device, equipment and storage medium
WO2023098524A1 (en) * 2021-12-02 2023-06-08 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion evaluation method and apparatus, device, and storage medium
CN114926716A (en) * 2022-04-08 2022-08-19 山东师范大学 Learning participation degree identification method, device and equipment and readable storage medium
CN114863520A (en) * 2022-04-25 2022-08-05 陕西师范大学 Video expression recognition method based on C3D-SA

Also Published As

Publication number Publication date
CN111931795B (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN111931795B (en) Multi-modal emotion recognition method and system based on subspace sparse feature fusion
CN109522818B (en) Expression recognition method and device, terminal equipment and storage medium
CN108804530B (en) Subtitling areas of an image
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
CN113420807A (en) Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
Huang et al. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks
CN113822192A (en) Method, device and medium for identifying emotion of escort personnel based on Transformer multi-modal feature fusion
CN113723166A (en) Content identification method and device, computer equipment and storage medium
Wöllmer et al. Analyzing the memory of BLSTM neural networks for enhanced emotion classification in dyadic spoken interactions
Zhang et al. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN110705490B (en) Visual emotion recognition method
CN111985243B (en) Emotion model training method, emotion analysis device and storage medium
CN112926525A (en) Emotion recognition method and device, electronic equipment and storage medium
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN113435421B (en) Cross-modal attention enhancement-based lip language identification method and system
Ousmane et al. Automatic recognition system of emotions expressed through the face using machine learning: Application to police interrogation simulation
Cornia et al. A unified cycle-consistent neural model for text and image retrieval
CN117132923A (en) Video classification method, device, electronic equipment and storage medium
Xue et al. Lcsnet: End-to-end lipreading with channel-aware feature selection
Hou et al. Confidence-guided self refinement for action prediction in untrimmed videos
Yang et al. Self-adaptive context and modal-interaction modeling for multimodal emotion recognition
CN116226347A (en) Fine granularity video emotion content question-answering method and system based on multi-mode data
WO2021129410A1 (en) Method and device for text processing
CN114462073A (en) De-identification effect evaluation method and device, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant