CN112541541B - Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion - Google Patents

Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion Download PDF

Info

Publication number
CN112541541B
CN112541541B CN202011452285.4A CN202011452285A CN112541541B CN 112541541 B CN112541541 B CN 112541541B CN 202011452285 A CN202011452285 A CN 202011452285A CN 112541541 B CN112541541 B CN 112541541B
Authority
CN
China
Prior art keywords
matrix
modal
matching
vector
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011452285.4A
Other languages
Chinese (zh)
Other versions
CN112541541A (en
Inventor
李康
孔万增
金宣妤
唐佳佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202011452285.4A priority Critical patent/CN112541541B/en
Publication of CN112541541A publication Critical patent/CN112541541A/en
Application granted granted Critical
Publication of CN112541541B publication Critical patent/CN112541541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Optimization (AREA)
  • Molecular Biology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Biophysics (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a lightweight multi-mode emotion analysis method based on multi-element layering depth fusion; the invention establishes the direct correlation among the multi-modal elements in a layering manner, and can capture short-time and long-time dependence among different modalities. To avoid reducing resolution and preserving the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcasting when selecting and emphasizing multi-modality information interactions. In addition, the invention also provides a new tensor operator called Ex-KR addition to fuse the multi-modal element information by using the shared information to obtain the global tensor characterization. This is an effective complement to the problem that most methods in the current multimodal emotion recognition field only focus on modeling in local time-sequence-multimodal space and cannot explicitly learn to get complete representations of all participating modalities fusion.

Description

Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion
Technical Field
The invention belongs to the field of multi-modal emotion analysis of natural language processing, computer vision and voice signal processing cross fusion, and particularly relates to a multi-element layering depth fusion technology based on cross-modal.
Background
With the recent progress of machine learning research, analysis of multi-modal time series data has become an increasingly important research field. Multimodal learning aims at constructing a neural network that can process and integrate information from multiple modalities, and multimodal emotion analysis is one research sub-field of multimodal learning. When a person expresses his or her emotion (negative emotion or positive emotion) in real life, various different categories of information are involved in such communication activities, including speech (text modality), facial expression and gestures (visual form modality), prosodic features of sound (acoustic modality). Video is an important source of multimodal data, providing three types of data, visual, acoustic and text, simultaneously. For unified presentation and ease of understanding, we refer to each word in speech as a text element, each frame in video as a visual element, and the prosodic features corresponding to each word as an acoustic element. Through research on three-modality data, we found that there are two types of element interactions (intra-modality element interactions and inter-modality element interactions) at the same time. Moreover, over time, the elemental interactions between modalities provide more complex complementary information between the time domain and the multimodal data. If the complementary information between the multi-modal data is taken into account during the analysis, we can analyze the emotional state of a person more reliably and more accurately.
However, heterogeneous properties of multimodal data, such as proper alignment between elements, short-time dependencies and long-time dependencies between different modalities, are important complementary information between multimodal data, and are also major challenges that researchers must face when analyzing cross-modal information. Existing multi-modal fusion methods typically focus on modeling in a local time-series-multi-modal space and cannot explicitly learn to get a complete multi-dimensional representation of all the modalities involved in the fusion. For example, some methods compress time series data of three modalities into three vectors for subsequent multi-modal fusion, such that subsequent steps cannot perceive critical cross-modality timing information. Existing approaches also attempt to convert from a source modality to a target modality to learn a joint representation of both modalities. However, this "translation method" is mainly implemented between two modalities and must include two-way "translations" such that the joint representation has strong local features and lacks important global multi-modal properties.
Disclosure of Invention
Aiming at the defects of the prior art, the invention introduces a lightweight multi-mode fusion method based on a multi-element layering depth fusion technology. The method transmits and integrates local associations into a common global space through two different types of matching mechanisms. When the multi-modal elements are matched and fused, a new tensor operator which does not contain additional parameters is provided, the operator can be used for combining the local matrix representation by using the common information of a specific modality to obtain a global tensor, and therefore more lightweight multi-modal data modeling can be achieved.
The specific steps of the invention are as follows:
step 1, collecting data of n modes of a measured object, and converting each mode data into a matrix form; n is more than or equal to 3.
And step 2, performing interactive modeling on the modal data to obtain a matrix of which the modal data contains the context information.
Step 3, taking a matrix of one mode as a central matrix; respectively carrying out element interaction matching on the rest n-1 modal matrixes and the central matrix to obtain n-1 matching matrixes;
and 4, generating weight vectors of the n-1 matching matrixes obtained in the step 3.
And 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4.
Step 6, performing multi-mode element matching by using n-1 matching matrixes to obtain n-dimensional tensorsP
Step 7, calculating an implicit matrix and a weight vector of the center matrix, and updating tensor by using the weight vector of the center matrixPObtaining tensorsP′
Step 8, according to tensorP'And predicting an emotion distribution vector label. Ith element label [ i ] of emotion distribution vector label]Is the probability that the tested object is in the ith emotion. And determining the emotion type of the tested object when the multi-mode data are acquired.
Preferably, in step 2, the matrix h=bi_fcn (E) containing the context information; wherein the intermediate matrix e=lstm (X); LSTM (·) represents a two-way long and short-term memory network layer transformation; bi_fcn (·) represents a nonlinear feed-forward fully-connected layer transform. The matrix X is a matrix corresponding to the single modality obtained in step 1.
Preferably, in step 3 a matrix H x And a central matrix H y Matching matrix of (a)Wherein W is 1 To pass through the adaptation matrix H x And matrix H y A weight matrix obtained from the context information of the mobile terminal.
Preferably, the matching matrix M in step 4 xy Implicit matrix R of (2) x And a weight vector alpha x The expression of (2) is as follows:
R x =tanh(W x M xy T +b x )
α x =softmax(w x R x )
wherein, tan h (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. W (W) x To determine the implicit matrix R x A parameter matrix in the time. w (w) x To determine the weight vector alpha x A parameter matrix in the time. b x To determine the implicit matrix R x Deviation in time.
Preferably, in step 5, the matrix M is matched xy After updating, a matching matrix M 'is obtained' xy . Matching matrix M' xy The i-th line element M' xy [i,:]=M xy [i,:]α x [i]The method comprises the steps of carrying out a first treatment on the surface of the Wherein M is xy [i,:]Representation matrix M xy I-th row element of (a); alpha x [i]Representing vector alpha x Is the i-th element of (c).
Preferably, in step 6, the matrix M 'is matched' xy And matching matrix M' zy When multi-modal element matching is performed, the resulting tensorPIs of the sub-vectors of (a)P[:,k,j]=M' xy [:,j]+M' zy [k,j];M' xy [:,j]For matrix M' xy Is a column of (2); m's' zy [k,j]For matrix M' zy Is an element of (a) in the above-mentioned formula (b).
Preferably, in step 7, the implicit matrix R of the center matrix y As shown in formula (10); weight vector alpha of center matrix y As shown in formula (11).
R y =tanh(W y P T +b y ) (10)
Wherein the matrix P is a multidimensional tensorIs a matrix of expansion of (a); w (W) y To determine the implicit matrix R y A parameter matrix in the time. w (w) y To determine the weight vector alpha y A parameter matrix in the time. b y To determine the implicit matrix R y Deviation in time.
Zhang LiangP′Is a sub-matrix of (2)P'[:,:,i]=P[:,:,i]α y [i];P[:,:,i]Is tensorPIs a sub-matrix of (a); alpha y [i]As the weight vector alpha y Elements within.
Preferably, the specific process of step 8 is as follows:
according to tensorP'The corresponding vector p is calculated as shown in equation (13).
Where vec (·) represents the vectorization operation of the matrix,representing a mode-3product operation. Vector c is a full 1 vector.
Predicting emotion distribution vector label according to vector p as shown in (14)
label=softmax(Wp+b)∈R d (14)
Where W is the weight matrix and b is the bias.
Preferably, in step 1, the mode data includes three modes, namely a text mode, a visual mode and an acoustic mode.
The beneficial effects of the invention are as follows:
1. the invention establishes the direct correlation among the multi-modal elements in a layering manner, and can capture short-time and long-time dependence among different modalities. To avoid reducing resolution and preserving the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcasting when selecting and emphasizing multi-modality information interactions.
2. The invention also provides a new tensor operator called Ex-KR addition to use shared information to fuse multi-mode element information to obtain global tensor characterization. This is an effective complement to the problem that most methods in the current multimodal emotion recognition field only focus on modeling in local time-sequence-multimodal space and cannot explicitly learn to get complete representations of all participating modalities fusion.
Drawings
FIG. 1 is a schematic diagram of a lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion;
FIG. 2 is a schematic diagram of two different types of matching mechanisms according to the present invention;
FIG. 3 is a schematic diagram of an attention generation module;
Detailed Description
The invention relates to a lightweight multi-mode emotion analysis method based on multi-element layering depth fusion, which is described in detail below with reference to the accompanying drawings.
Example 1
As shown in fig. 1, the lightweight multi-mode emotion analysis method based on multi-element hierarchical depth fusion specifically comprises the following steps:
step 1, data matrix representation of single modality
The data of a plurality of modes according to the present invention are represented by X, Y, and Z. Three types of modality data are used when the framework is actually used: the text mode, the visual mode and the acoustic mode sense the emotion state of the tested person. We refer to the text modality as uppercase letter Y, the visual modality as uppercase letter X, and the acoustic modality as uppercase letter Z. The data of each modality may be organized in the form of a two-dimensional matrix, i.eWherein t is x 、t y 、t z Respectively represent three modal elementsNumber of elements d x 、d y 、d z Respectively representing the characteristic length of the corresponding element. Taking text modality as an example, the test excites to say: "this movie is attractive". "(the test is expressing a positive excited emotion) we can get t y The actual value of (2) is 8.
Step 2, interaction modeling among single modal elements
Before multi-modal fusion is achieved, a two-dimensional matrix representation of the raw data of each modality needs to be transformed to establish interactions between the elements of the individual modalities, i.e. the element characteristic representation of each modality needs to contain context information of its neighboring elements. Here, context information for adjacent elements in a single modality is modeled using a two-way long and short term memory network Layer (LSTM) that concatenates forward and reverse implicit state representations. In order to further enrich the single-mode feature representation, we transform the encoded features through nonlinear feedforward full-connection mapping to perform subsequent multi-mode fusion to obtain a matrix H containing three-mode context information x 、H y 、H z The following are provided:
H x =bi_FCN(E x ) H y =bi_FCN(E y ) H z =bi_FCN(E z ) (1)
E x =LSTM(X) E y =LSTM(Y) E z =LSTM(Z) (2)
wherein LSTM (·) represents a two-way long and short term memory network layer transformation; bi_fcn (·) represents a nonlinear feed-forward fully-connected layer transform.
To more clearly illustrate the LSTM (·) specific computation process, the matrix m= [ M 1 ,m 2 ,...,m T ]∈R T×D (wherein m t ∈R D Is a vector, t=1, 2, third, T) is an example:
i t =σ(W ii m t +b ii +W hi h (t-1) +b hi )
f t =σ(W if m t +b if +W hf h (t-1) +b hf )
g t =tahn(W ig m t +b ig +W hg h (t-1) +b hg )
o t =σ(W io m t +b io +W ho h (t-1) +b ho )
c t =f t *c (t-1) +i t *g t
h t =o t *tanh(c t )
where h is t Is the output at time t, c t Is the state of the cell at time t, m t Is the input of time t, i t ,f t ,g t ,o t An input door, a forget door, a unit door and an output door respectively. Sigma is a sigmoid function, which is the Hadamard product, W in the formula ii ,W hi ,W if ,W hf ,W ig ,W hg ,W io ,W ho Is a weight matrix, b if ,b hf ,b ig ,b hg Etc. are deviation vectors. Thus we have [ h ] 1 ,h 2 ,...,h T ]=LSTM(M)。
To illustrate the calculation of bi_FCN (), we use the matrix M ε R as well T×D The following are examples:
H=bi_FCN(M)=tanh(MW f1 +b f1 )W f2 +b f2 ∈R T×D'
w here f1 And W is f2 Is a weight matrix, b f1 And b f2 Is a bias vector.
Step 3, element matching fusion between two modes
As shown in FIG. 2 (a), two unimodal features containing intra-modal element interactions are represented as H x And H y The operation is connected to model the relationship between the two modalities. Applying bilinear transformations to two vector feature representations from both modalities can achieve such element-matching fusion. Weight matrix W in bilinear transformation 1 And W is 2 Can be used to represent context information between bi-modalities and can ensure two different modalitiesMore flexible coupling. By matching vector feature representations of bimodal elements, we can model the interaction of elements between the two modalities to obtain a corresponding matching matrix M xy And M zy
Wherein the weight matrix W 1 By adapting matrix H x And matrix H y Obtained by the processor; weight matrix W 2 By adapting matrix H y And matrix H z Is obtained.
Step 4, generating a matching matrix M xy And M zy Attention weight vector of (a)
Matching matrix M based on interaction between two modalities xy And M zy We calculate the relative importance of each of its elements for a particular modality by means of an attention mechanism to optimize the joint feature representation between the bi-modalities. The calculated value of importance of each element is expressed in the form of a vector, and the length of the vector is consistent with the number of elements of a specific modality, as shown in fig. 3. Obtaining a matching matrix M xy Implicit matrix R of (2) x And a matching matrix M zy Implicit matrix R of (2) z The following are provided:
R x =tanh(W x M xy T +b x ) R z =tanh(W z M zy T +b z ) (5)
wherein the vector alpha x 、α z Attention weight vectors related to the specific modes X and Y, respectively, and the length of the vector and the modes X and YThe number of elements of Y corresponds. tanh (·) represents a hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. W (W) x 、W z Respectively, for determining an implicit matrix R x Implicit matrix R z A parameter matrix in the time. w (w) x 、w z Respectively, for determining the weight vector alpha x Weight vector alpha z A parameter matrix in the time. b x 、b z Respectively, for determining an implicit matrix R x Implicit matrix R z Deviation in time.
Step 5, broadcasting operation of attention weight on the matching matrix
By assigning attention weights to the corresponding values of the bimodal joint matching matrix, the model can be made to concentrate on important inter-modal interactions. However, conventional allocation approaches average out the attention weights at different locations, which can degrade the resolution of the feature representation. To avoid this, we apply the corresponding attention weights in broadcast form in the proposed fusion framework; using weight vectors alpha x 、α z Updating the matching matrix M xy And a matching matrix M zy Obtaining a new matching matrix M' xy And a matching matrix M' zy
M' xy =attention_broadcast(M xyx ) Wherein
M' zy =attention_broadcast(M zyz ) Wherein
Wherein,representation matrix M' xy I-th row element of (a); m is M xy [i,:]Representation matrix M xy I-th row element of (a); alpha x [i]Representing vector alpha x Is the i-th element of (c). M is M xy [i,:]α x [i]RepresentingMultiplication of vectors with scalar quantities. />Representation matrix M' zy I-th row element of (a); m is M zy [i,:]Representation matrix M zy I-th row element of (a); alpha z [i]Representing vector alpha z Is the i-th element of (c).
Step 6, utilizing the multi-mode common information through Ex-KR addition
Since the normalization operation is performed using the softmax function in equation (6), this brings some weight values in the attention weight vector very close to 0. When the broadcasting operation of the attention vector is performed on the matching matrix, some values of the obtained matrix are close to 0, which brings a certain sparsity to the matching matrix. On the other hand, two matching matrices M' xy And M' zy And contains common information about modality Y, i.e. each column of two matching matrices corresponds to an element from modality Y, which is advantageous for deep fusion of multi-modality elements. To better exploit these matching properties, we propose a new tensor operator: ex-KR addition, the sign of which can be written asThe multi-mode element matching based on the mode Y can be realized by utilizing the common information in the two matching matrixes through Ex-KR addition, so that the multi-dimensional tensor about multi-mode data fusion can be calculatedPCharacterization, as shown in fig. 2 (b).
Step 7, broadcasting operation of attention weight on multi-mode tensor
Through the same attention generation module as in step 4, an implicit matrix R can be obtained y Attention weight vector alpha for a multi-modal multi-dimensional tensor, as shown in equation (10) y As shown in formula (11). Applying the vector in broadcast form to a multidimensional tensorPCan further increase tensorExpression ability, updated tensorP′As shown in formula (12).
R y =tanh(W y P T +b y ) (10)
P'=attention_broadcast(Py ) WhereinP'[:,:,i]=P[:,:,i]α y [i] (12)
Wherein the matrixIs multidimensional tensor->Is a matrix of expansion of (a); w (W) y To determine the implicit matrix R y A parameter matrix in the time. w (w) y To determine the weight vector alpha y A parameter matrix in the time. b y To determine the implicit matrix R y Deviation in time.
Step 8, classification
Applying sum pulling to a multi-dimensional tensor characterization through broadcast attention calculationP'Finally summarizing the information of the three modes to obtain a corresponding vector p as shown in a formula (13). Vector p may be used as a multi-modal advanced representation for the final emotion classification task.
Where vec (·) represents the vectorization operation of the matrix,representing a mode-3product operation. Vector->Is a full 1 vector.
Then, predicting emotion distribution vector label according to vector p as shown in formula (14)
label=softmax(Wp+b)∈R d (14)
Wherein W is a learnable weight matrix and b is a bias; the ith element label [ i ] of the emotion distribution vector label is the predicted probability of the ith emotion.
As shown in Table 1, the invention and four multi-modal fusion methods simultaneously carry out emotion state discrimination tasks on two multi-modal emotion databases CMU-MOSI and YouTube, wherein MAE is mean square error, CORR is pearson correlation coefficient, ACC-7 is 7 classification precision, ACC-3 is 3 classification precision, and a plurality of indexes for measuring discrimination tasks are compared, so that the result of the invention is superior to that of the four multi-modal fusion methods.
TABLE 1 results comparison Table
Example 2
This embodiment differs from embodiment 1 in that; when the lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is applied to four modal data (such as text modal, visual modal, acoustic modal and electroencephalogram data), a matching matrix between the two modalities can be calculated first, and then the matching matrix representing the local fusion of the two modalities is integrated by using the Ex-KR method proposed by us to obtain the multi-modal global tensor representation. To describe the calculation flow, the four modes involved in fusion can be respectively expressed as m 1 ,m 2 ,m 3 ,m 4 Representing, the corresponding modal data is organized in the form of a two-dimensional matrix, i.eAnd->As with the methods employed by equations (1) and (2), a two-way long and short term memory network layer transformation LSTM (-) and a nonlinear feed-forward full link layer can be usedThe transformation bi_fcn (·) captures the context information matrix representation within a single modality, respectively +.>And->In the mode m 1 In the case of sharing modes, adopting the steps same as the steps 3, 4 and 5 to realize element matching among modes and carrying out operation on the generated attention weight vector in a broadcasting operation form matching matrix, so as to obtain three matching matrixes->And->Since the three matching matrices simultaneously contain the mode m 1 So that the common information can be used for realizing multi-mode global tensor representation by using the Ex-KR addition, and the specific calculation flow is as follows:
first, matching matrixAnd->Fusion is realized on the basis of Ex-KR addition:
will merge the three modes (m 1 ,m 2 And m 3 ) The tensor P of (2) is subjected to matrix expansion in the third dimension to obtain a matrixDue to matrix P and matching matrix->And also contain the mode m 1 Can be further fused here using Ex-KR addition:
tensor thus calculatedQThe information about modality m can be utilized 1 The global fusion of 4 modes is realized at the same time, and the Ex-KR addition does not contain weight parameters needing to be learned, so that the more lightweight multi-mode fusion modeling can be realized.
According to the calculation flow of step 7, the tensor can be calculatedQAttention vector of (a)And is implemented in tensorsQBroadcast calculations on.
And finally, calculating a corresponding emotion distribution vector label by adopting the sum pulling operation and the formula (14) described in the step 8, and realizing multi-mode emotion analysis based on multi-element layering depth fusion of four mode data.

Claims (9)

1. The lightweight multi-mode emotion analysis method based on multi-element layering depth fusion is characterized by comprising the following steps of: step 1, collecting data of n modes of a measured object, and converting each mode data into a matrix form; n is more than or equal to 3;
step 2, performing interactive modeling on the modal data to obtain a matrix of which the modal data contains context information;
step 3, taking a matrix of one mode as a central matrix; respectively carrying out element interaction matching on the rest n-1 modal matrixes and the central matrix to obtain n-1 matching matrixes;
step 4, generating weight vectors of n-1 matching matrixes obtained in the step 3;
step 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4;
step 6,Using n-1 matching matrixes to perform multi-mode element matching to obtain n-dimensional tensorsP
Step 7, calculating an implicit matrix and a weight vector of the center matrix, and updating tensor by using the weight vector of the center matrixPObtaining tensorsP′
Step 8, according to tensorP'Predicting emotion distribution vectors label; ith element label [ i ] of emotion distribution vector label]The probability of being in the ith emotion for the detected object; and determining the emotion type of the tested object when the multi-mode data are acquired.
2. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 2, the matrix h=bi_fcn (E) containing the context information; wherein the intermediate matrix e=lstm (X); LSTM (·) represents a two-way long and short-term memory network layer transformation; bi_fcn (·) represents a nonlinear feed-forward fully connected layer transform; the matrix X is a matrix corresponding to the single modality obtained in step 1.
3. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 3, a matrix H x And a central matrix H y Matching matrix of (a)Wherein W is 1 To pass through the adaptation matrix H x And matrix H y A weight matrix obtained from the context information of the mobile terminal.
4. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: matching matrix M in step 4 xy Implicit matrix R of (2) x And a weight vector alpha x The expression of (2) is as follows:
R x =tanh(W x M xy T +b x )
α x =softmax(w x R x )
wherein, tan h (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation; w (W) x To determine the implicit matrix R x A parameter matrix at the time; w (w) x To determine the weight vector alpha x A parameter matrix at the time; b x To determine the implicit matrix R x Deviation in time.
5. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 5, the matrix M is matched xy After updating, a matching matrix M 'is obtained' xy The method comprises the steps of carrying out a first treatment on the surface of the Matching matrix M' xy The i-th line element M' xy [i,:]=M xy [i,:]α x [i]The method comprises the steps of carrying out a first treatment on the surface of the Wherein M is xy [i,:]Representation matrix M xy I-th row element of (a); alpha x [i]Representing vector alpha x Is the i-th element of (c).
6. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 6, the matrix M 'is matched' xy And matching matrix M' zy When multi-modal element matching is performed, the resulting tensorPIs of the sub-vectors of (a)P[:,k,j]=M' xy [:,j]+M' zy [k,j];M' xy [:,j]For matrix M' xy Is a column of (2); m's' zy [k,j]For matrix M' zy Is an element of (a) in the above-mentioned formula (b).
7. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 7, the implicit matrix R of the center matrix y As shown in formula (10); weight vector alpha of center matrix y As shown in formula (11):
R y =tanh(W y P T +b y ) (10)
wherein the matrix P is a multidimensional tensorIs a matrix of expansion of (a); w (W) y To determine the implicit matrix R y A parameter matrix at the time; w (w) y To determine the weight vector alpha y A parameter matrix at the time; b y To determine the implicit matrix R y Deviation in time;
Zhang LiangP′is a sub-matrix of (2)P'[:,:,i]=P[:,:,i]α y [i];P[:,:,i]Is tensorPIs a sub-matrix of (a); alpha y [i]As the weight vector alpha y Elements within.
8. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: the specific process of the step 8 is as follows:
according to tensorP'Calculating a corresponding vector p as shown in formula (13);
where vec (·) represents the vectorization operation of the matrix,representing a mode-3product operation; vector c is a full 1 vector;
predicting emotion distribution vector label according to vector p as shown in (14)
label=softmax(Wp+b)∈R d (14)
Where W is the weight matrix and b is the bias.
9. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in the step 1, the mode data are three modes, namely a text mode, a visual mode and an acoustic mode.
CN202011452285.4A 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion Active CN112541541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011452285.4A CN112541541B (en) 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011452285.4A CN112541541B (en) 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Publications (2)

Publication Number Publication Date
CN112541541A CN112541541A (en) 2021-03-23
CN112541541B true CN112541541B (en) 2024-03-22

Family

ID=75018379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011452285.4A Active CN112541541B (en) 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Country Status (1)

Country Link
CN (1) CN112541541B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064968B (en) * 2021-04-06 2022-04-19 齐鲁工业大学 Social media emotion analysis method and system based on tensor fusion network
CN115019237B (en) * 2022-06-30 2023-12-08 中国电信股份有限公司 Multi-mode emotion analysis method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325120A (en) * 2018-09-14 2019-02-12 江苏师范大学 A kind of text sentiment classification method separating user and product attention mechanism
CN111012307A (en) * 2019-11-26 2020-04-17 清华大学 Method and device for evaluating training input degree of patient based on multi-mode information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325120A (en) * 2018-09-14 2019-02-12 江苏师范大学 A kind of text sentiment classification method separating user and product attention mechanism
CN111012307A (en) * 2019-11-26 2020-04-17 清华大学 Method and device for evaluating training input degree of patient based on multi-mode information

Also Published As

Publication number Publication date
CN112541541A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
Li et al. A survey of multi-view representation learning
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
CN111368993B (en) Data processing method and related equipment
CN110738984A (en) Artificial intelligence CNN, LSTM neural network speech recognition system
Ning et al. Semantics-consistent representation learning for remote sensing image–voice retrieval
Suman et al. A multi-modal personality prediction system
CN112508077A (en) Social media emotion analysis method and system based on multi-modal feature fusion
CN112328900A (en) Deep learning recommendation method integrating scoring matrix and comment text
Pandey et al. Attention gated tensor neural network architectures for speech emotion recognition
CN113723166A (en) Content identification method and device, computer equipment and storage medium
CN112541541B (en) Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion
Yang et al. Meta captioning: A meta learning based remote sensing image captioning framework
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN114443899A (en) Video classification method, device, equipment and medium
Han et al. Cross-modality co-attention networks for visual question answering
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
CN110781970A (en) Method, device and equipment for generating classifier and storage medium
CN114140885A (en) Emotion analysis model generation method and device, electronic equipment and storage medium
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
CN116432019A (en) Data processing method and related equipment
Goutsu et al. Classification of multi-class daily human motion using discriminative body parts and sentence descriptions
CN116976505A (en) Click rate prediction method of decoupling attention network based on information sharing
CN116385937A (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN116127175A (en) Mobile application classification and recommendation method based on multi-modal feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant