CN112541541B

CN112541541B - Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Info

Publication number: CN112541541B
Application number: CN202011452285.4A
Authority: CN
Inventors: 李康; 孔万增; 金宣妤; 唐佳佳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2024-03-22
Anticipated expiration: 2040-12-10
Also published as: CN112541541A

Abstract

The invention discloses a lightweight multi-mode emotion analysis method based on multi-element layering depth fusion; the invention establishes the direct correlation among the multi-modal elements in a layering manner, and can capture short-time and long-time dependence among different modalities. To avoid reducing resolution and preserving the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcasting when selecting and emphasizing multi-modality information interactions. In addition, the invention also provides a new tensor operator called Ex-KR addition to fuse the multi-modal element information by using the shared information to obtain the global tensor characterization. This is an effective complement to the problem that most methods in the current multimodal emotion recognition field only focus on modeling in local time-sequence-multimodal space and cannot explicitly learn to get complete representations of all participating modalities fusion.

Description

Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Technical Field

The invention belongs to the field of multi-modal emotion analysis of natural language processing, computer vision and voice signal processing cross fusion, and particularly relates to a multi-element layering depth fusion technology based on cross-modal.

Background

With the recent progress of machine learning research, analysis of multi-modal time series data has become an increasingly important research field. Multimodal learning aims at constructing a neural network that can process and integrate information from multiple modalities, and multimodal emotion analysis is one research sub-field of multimodal learning. When a person expresses his or her emotion (negative emotion or positive emotion) in real life, various different categories of information are involved in such communication activities, including speech (text modality), facial expression and gestures (visual form modality), prosodic features of sound (acoustic modality). Video is an important source of multimodal data, providing three types of data, visual, acoustic and text, simultaneously. For unified presentation and ease of understanding, we refer to each word in speech as a text element, each frame in video as a visual element, and the prosodic features corresponding to each word as an acoustic element. Through research on three-modality data, we found that there are two types of element interactions (intra-modality element interactions and inter-modality element interactions) at the same time. Moreover, over time, the elemental interactions between modalities provide more complex complementary information between the time domain and the multimodal data. If the complementary information between the multi-modal data is taken into account during the analysis, we can analyze the emotional state of a person more reliably and more accurately.

However, heterogeneous properties of multimodal data, such as proper alignment between elements, short-time dependencies and long-time dependencies between different modalities, are important complementary information between multimodal data, and are also major challenges that researchers must face when analyzing cross-modal information. Existing multi-modal fusion methods typically focus on modeling in a local time-series-multi-modal space and cannot explicitly learn to get a complete multi-dimensional representation of all the modalities involved in the fusion. For example, some methods compress time series data of three modalities into three vectors for subsequent multi-modal fusion, such that subsequent steps cannot perceive critical cross-modality timing information. Existing approaches also attempt to convert from a source modality to a target modality to learn a joint representation of both modalities. However, this "translation method" is mainly implemented between two modalities and must include two-way "translations" such that the joint representation has strong local features and lacks important global multi-modal properties.

Disclosure of Invention

Aiming at the defects of the prior art, the invention introduces a lightweight multi-mode fusion method based on a multi-element layering depth fusion technology. The method transmits and integrates local associations into a common global space through two different types of matching mechanisms. When the multi-modal elements are matched and fused, a new tensor operator which does not contain additional parameters is provided, the operator can be used for combining the local matrix representation by using the common information of a specific modality to obtain a global tensor, and therefore more lightweight multi-modal data modeling can be achieved.

The specific steps of the invention are as follows:

step 1, collecting data of n modes of a measured object, and converting each mode data into a matrix form; n is more than or equal to 3.

And step 2, performing interactive modeling on the modal data to obtain a matrix of which the modal data contains the context information.

Step 3, taking a matrix of one mode as a central matrix; respectively carrying out element interaction matching on the rest n-1 modal matrixes and the central matrix to obtain n-1 matching matrixes;

and 4, generating weight vectors of the n-1 matching matrixes obtained in the step 3.

And 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4.

Step 6, performing multi-mode element matching by using n-1 matching matrixes to obtain n-dimensional tensorsP。

Step 7, calculating an implicit matrix and a weight vector of the center matrix, and updating tensor by using the weight vector of the center matrixPObtaining tensorsP′。

Step 8, according to tensorP'And predicting an emotion distribution vector label. Ith element label [ i ] of emotion distribution vector label]Is the probability that the tested object is in the ith emotion. And determining the emotion type of the tested object when the multi-mode data are acquired.

Preferably, in step 2, the matrix h=bi_fcn (E) containing the context information; wherein the intermediate matrix e=lstm (X); LSTM (·) represents a two-way long and short-term memory network layer transformation; bi_fcn (·) represents a nonlinear feed-forward fully-connected layer transform. The matrix X is a matrix corresponding to the single modality obtained in step 1.

Preferably, in step 3 a matrix H _x And a central matrix H _y Matching matrix of (a)Wherein W is ₁ To pass through the adaptation matrix H _x And matrix H _y A weight matrix obtained from the context information of the mobile terminal.

Preferably, the matching matrix M in step 4 _xy Implicit matrix R of (2) _x And a weight vector alpha _x The expression of (2) is as follows:

R _x ＝tanh(W _x M _xy ^T +b _x )

α _x ＝softmax(w _x R _x )

wherein, tan h (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. W (W) _x To determine the implicit matrix R _x A parameter matrix in the time. w (w) _x To determine the weight vector alpha _x A parameter matrix in the time. b _x To determine the implicit matrix R _x Deviation in time.

Preferably, in step 5, the matrix M is matched _xy After updating, a matching matrix M 'is obtained' _xy . Matching matrix M' _xy The i-th line element M' _xy [i,:]＝M _xy [i,:]α _x [i]The method comprises the steps of carrying out a first treatment on the surface of the Wherein M is _xy [i,:]Representation matrix M _xy I-th row element of (a); alpha _x [i]Representing vector alpha _x Is the i-th element of (c).

Preferably, in step 6, the matrix M 'is matched' _xy And matching matrix M' _zy When multi-modal element matching is performed, the resulting tensorPIs of the sub-vectors of (a)P[:,k,j]＝M' _xy [:,j]+M' _zy [k,j]；M' _xy [:,j]For matrix M' _xy Is a column of (2); m's' _zy [k,j]For matrix M' _zy Is an element of (a) in the above-mentioned formula (b).

Preferably, in step 7, the implicit matrix R of the center matrix _y As shown in formula (10); weight vector alpha of center matrix _y As shown in formula (11).

R _y ＝tanh(W _y P ^T +b _y ) (10)

Wherein the matrix P is a multidimensional tensorIs a matrix of expansion of (a); w (W) _y To determine the implicit matrix R _y A parameter matrix in the time. w (w) _y To determine the weight vector alpha _y A parameter matrix in the time. b _y To determine the implicit matrix R _y Deviation in time.

Zhang LiangP′Is a sub-matrix of (2)P'[:,:,i]＝P[:,:,i]α _y [i]；P[:,:,i]Is tensorPIs a sub-matrix of (a); alpha _y [i]As the weight vector alpha _y Elements within.

Preferably, the specific process of step 8 is as follows:

according to tensorP'The corresponding vector p is calculated as shown in equation (13).

Where vec (·) represents the vectorization operation of the matrix,representing a mode-3product operation. Vector c is a full 1 vector.

Predicting emotion distribution vector label according to vector p as shown in (14)

label＝softmax(Wp+b)∈R ^d (14)

Where W is the weight matrix and b is the bias.

Preferably, in step 1, the mode data includes three modes, namely a text mode, a visual mode and an acoustic mode.

The beneficial effects of the invention are as follows:

1. the invention establishes the direct correlation among the multi-modal elements in a layering manner, and can capture short-time and long-time dependence among different modalities. To avoid reducing resolution and preserving the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcasting when selecting and emphasizing multi-modality information interactions.

2. The invention also provides a new tensor operator called Ex-KR addition to use shared information to fuse multi-mode element information to obtain global tensor characterization. This is an effective complement to the problem that most methods in the current multimodal emotion recognition field only focus on modeling in local time-sequence-multimodal space and cannot explicitly learn to get complete representations of all participating modalities fusion.

Drawings

FIG. 1 is a schematic diagram of a lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion;

FIG. 2 is a schematic diagram of two different types of matching mechanisms according to the present invention;

FIG. 3 is a schematic diagram of an attention generation module;

Detailed Description

The invention relates to a lightweight multi-mode emotion analysis method based on multi-element layering depth fusion, which is described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, the lightweight multi-mode emotion analysis method based on multi-element hierarchical depth fusion specifically comprises the following steps:

step 1, data matrix representation of single modality

The data of a plurality of modes according to the present invention are represented by X, Y, and Z. Three types of modality data are used when the framework is actually used: the text mode, the visual mode and the acoustic mode sense the emotion state of the tested person. We refer to the text modality as uppercase letter Y, the visual modality as uppercase letter X, and the acoustic modality as uppercase letter Z. The data of each modality may be organized in the form of a two-dimensional matrix, i.eWherein t is _x 、t _y 、t _z Respectively represent three modal elementsNumber of elements d _x 、d _y 、d _z Respectively representing the characteristic length of the corresponding element. Taking text modality as an example, the test excites to say: "this movie is attractive". "(the test is expressing a positive excited emotion) we can get t _y The actual value of (2) is 8.

Step 2, interaction modeling among single modal elements

Before multi-modal fusion is achieved, a two-dimensional matrix representation of the raw data of each modality needs to be transformed to establish interactions between the elements of the individual modalities, i.e. the element characteristic representation of each modality needs to contain context information of its neighboring elements. Here, context information for adjacent elements in a single modality is modeled using a two-way long and short term memory network Layer (LSTM) that concatenates forward and reverse implicit state representations. In order to further enrich the single-mode feature representation, we transform the encoded features through nonlinear feedforward full-connection mapping to perform subsequent multi-mode fusion to obtain a matrix H containing three-mode context information _x 、H _y 、H _z The following are provided:

H _x ＝bi_FCN(E _x ) H _y ＝bi_FCN(E _y ) H _z ＝bi_FCN(E _z ) (1)

E _x ＝LSTM(X) E _y ＝LSTM(Y) E _z ＝LSTM(Z) (2)

wherein LSTM (·) represents a two-way long and short term memory network layer transformation; bi_fcn (·) represents a nonlinear feed-forward fully-connected layer transform.

To more clearly illustrate the LSTM (·) specific computation process, the matrix m= [ M ₁ ,m ₂ ,...,m _T ]∈R ^T×D (wherein m _t ∈R ^D Is a vector, t=1, 2, third, T) is an example:

i _t ＝σ(W _ii m _t +b _ii +W _hi h _(t-1) +b _hi )

f _t ＝σ(W _if m _t +b _if +W _hf h _(t-1) +b _hf )

g _t ＝tahn(W _ig m _t +b _ig +W _hg h _(t-1) +b _hg )

o _t ＝σ(W _io m _t +b _io +W _ho h _(t-1) +b _ho )

c _t ＝f _t *c _(t-1) +i _t *g _t

h _t ＝o _t *tanh(c _t )

where h is _t Is the output at time t, c _t Is the state of the cell at time t, m _t Is the input of time t, i _t ，f _t ，g _t ，o _t An input door, a forget door, a unit door and an output door respectively. Sigma is a sigmoid function, which is the Hadamard product, W in the formula _ii ,W _hi ,W _if ,W _hf ,W _ig ,W _hg ,W _io ,W _ho Is a weight matrix, b _if ,b _hf ,b _ig ,b _hg Etc. are deviation vectors. Thus we have [ h ] ₁ ,h ₂ ,...,h _T ]＝LSTM(M)。

To illustrate the calculation of bi_FCN (), we use the matrix M ε R as well ^T×D The following are examples:

H＝bi_FCN(M)＝tanh(MW _f1 +b _f1 )W _f2 +b _f2 ∈R ^T×D'

w here _f1 And W is _f2 Is a weight matrix, b _f1 And b _f2 Is a bias vector.

Step 3, element matching fusion between two modes

As shown in FIG. 2 (a), two unimodal features containing intra-modal element interactions are represented as H _x And H _y The operation is connected to model the relationship between the two modalities. Applying bilinear transformations to two vector feature representations from both modalities can achieve such element-matching fusion. Weight matrix W in bilinear transformation ₁ And W is ₂ Can be used to represent context information between bi-modalities and can ensure two different modalitiesMore flexible coupling. By matching vector feature representations of bimodal elements, we can model the interaction of elements between the two modalities to obtain a corresponding matching matrix M _xy And M _zy 。

Wherein the weight matrix W ₁ By adapting matrix H _x And matrix H _y Obtained by the processor; weight matrix W ₂ By adapting matrix H _y And matrix H _z Is obtained.

Step 4, generating a matching matrix M _xy And M _zy Attention weight vector of (a)

Matching matrix M based on interaction between two modalities _xy And M _zy We calculate the relative importance of each of its elements for a particular modality by means of an attention mechanism to optimize the joint feature representation between the bi-modalities. The calculated value of importance of each element is expressed in the form of a vector, and the length of the vector is consistent with the number of elements of a specific modality, as shown in fig. 3. Obtaining a matching matrix M _xy Implicit matrix R of (2) _x And a matching matrix M _zy Implicit matrix R of (2) _z The following are provided:

R _x ＝tanh(W _x M _xy ^T +b _x ) R _z ＝tanh(W _z M _zy ^T +b _z ) (5)

wherein the vector alpha _x 、α _z Attention weight vectors related to the specific modes X and Y, respectively, and the length of the vector and the modes X and YThe number of elements of Y corresponds. tanh (·) represents a hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. W (W) _x 、W _z Respectively, for determining an implicit matrix R _x Implicit matrix R _z A parameter matrix in the time. w (w) _x 、w _z Respectively, for determining the weight vector alpha _x Weight vector alpha _z A parameter matrix in the time. b _x 、b _z Respectively, for determining an implicit matrix R _x Implicit matrix R _z Deviation in time.

Step 5, broadcasting operation of attention weight on the matching matrix

By assigning attention weights to the corresponding values of the bimodal joint matching matrix, the model can be made to concentrate on important inter-modal interactions. However, conventional allocation approaches average out the attention weights at different locations, which can degrade the resolution of the feature representation. To avoid this, we apply the corresponding attention weights in broadcast form in the proposed fusion framework; using weight vectors alpha _x 、α _z Updating the matching matrix M _xy And a matching matrix M _zy Obtaining a new matching matrix M' _xy And a matching matrix M' _zy 。

M' _xy ＝attention_broadcast(M _xy ,α _x ) Wherein

M' _zy ＝attention_broadcast(M _zy ,α _z ) Wherein

Wherein,representation matrix M' _xy I-th row element of (a); m is M _xy [i,:]Representation matrix M _xy I-th row element of (a); alpha _x [i]Representing vector alpha _x Is the i-th element of (c). M is M _xy [i,:]α _x [i]RepresentingMultiplication of vectors with scalar quantities. />Representation matrix M' _zy I-th row element of (a); m is M _zy [i,:]Representation matrix M _zy I-th row element of (a); alpha _z [i]Representing vector alpha _z Is the i-th element of (c).

Step 6, utilizing the multi-mode common information through Ex-KR addition

Since the normalization operation is performed using the softmax function in equation (6), this brings some weight values in the attention weight vector very close to 0. When the broadcasting operation of the attention vector is performed on the matching matrix, some values of the obtained matrix are close to 0, which brings a certain sparsity to the matching matrix. On the other hand, two matching matrices M' _xy And M' _zy And contains common information about modality Y, i.e. each column of two matching matrices corresponds to an element from modality Y, which is advantageous for deep fusion of multi-modality elements. To better exploit these matching properties, we propose a new tensor operator: ex-KR addition, the sign of which can be written asThe multi-mode element matching based on the mode Y can be realized by utilizing the common information in the two matching matrixes through Ex-KR addition, so that the multi-dimensional tensor about multi-mode data fusion can be calculatedPCharacterization, as shown in fig. 2 (b).

Step 7, broadcasting operation of attention weight on multi-mode tensor

Through the same attention generation module as in step 4, an implicit matrix R can be obtained _y Attention weight vector alpha for a multi-modal multi-dimensional tensor, as shown in equation (10) _y As shown in formula (11). Applying the vector in broadcast form to a multidimensional tensorPCan further increase tensorExpression ability, updated tensorP′As shown in formula (12).

R _y ＝tanh(W _y P ^T +b _y ) (10)

P'＝attention_broadcast(P,α _y ) WhereinP'[:,:,i]＝P[:,:,i]α _y [i] (12)

Wherein the matrixIs multidimensional tensor->Is a matrix of expansion of (a); w (W) _y To determine the implicit matrix R _y A parameter matrix in the time. w (w) _y To determine the weight vector alpha _y A parameter matrix in the time. b _y To determine the implicit matrix R _y Deviation in time.

Step 8, classification

Applying sum pulling to a multi-dimensional tensor characterization through broadcast attention calculationP'Finally summarizing the information of the three modes to obtain a corresponding vector p as shown in a formula (13). Vector p may be used as a multi-modal advanced representation for the final emotion classification task.

Where vec (·) represents the vectorization operation of the matrix,representing a mode-3product operation. Vector->Is a full 1 vector.

Then, predicting emotion distribution vector label according to vector p as shown in formula (14)

label＝softmax(Wp+b)∈R ^d (14)

Wherein W is a learnable weight matrix and b is a bias; the ith element label [ i ] of the emotion distribution vector label is the predicted probability of the ith emotion.

As shown in Table 1, the invention and four multi-modal fusion methods simultaneously carry out emotion state discrimination tasks on two multi-modal emotion databases CMU-MOSI and YouTube, wherein MAE is mean square error, CORR is pearson correlation coefficient, ACC-7 is 7 classification precision, ACC-3 is 3 classification precision, and a plurality of indexes for measuring discrimination tasks are compared, so that the result of the invention is superior to that of the four multi-modal fusion methods.

TABLE 1 results comparison Table

Example 2

This embodiment differs from embodiment 1 in that; when the lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is applied to four modal data (such as text modal, visual modal, acoustic modal and electroencephalogram data), a matching matrix between the two modalities can be calculated first, and then the matching matrix representing the local fusion of the two modalities is integrated by using the Ex-KR method proposed by us to obtain the multi-modal global tensor representation. To describe the calculation flow, the four modes involved in fusion can be respectively expressed as m ₁ ，m ₂ ，m ₃ ，m ₄ Representing, the corresponding modal data is organized in the form of a two-dimensional matrix, i.eAnd->As with the methods employed by equations (1) and (2), a two-way long and short term memory network layer transformation LSTM (-) and a nonlinear feed-forward full link layer can be usedThe transformation bi_fcn (·) captures the context information matrix representation within a single modality, respectively +.>And->In the mode m ₁ In the case of sharing modes, adopting the steps same as the steps 3, 4 and 5 to realize element matching among modes and carrying out operation on the generated attention weight vector in a broadcasting operation form matching matrix, so as to obtain three matching matrixes->And->Since the three matching matrices simultaneously contain the mode m ₁ So that the common information can be used for realizing multi-mode global tensor representation by using the Ex-KR addition, and the specific calculation flow is as follows:

first, matching matrixAnd->Fusion is realized on the basis of Ex-KR addition:

will merge the three modes (m ₁ ，m ₂ And m ₃ ) The tensor P of (2) is subjected to matrix expansion in the third dimension to obtain a matrixDue to matrix P and matching matrix->And also contain the mode m ₁ Can be further fused here using Ex-KR addition:

tensor thus calculatedQThe information about modality m can be utilized ₁ The global fusion of 4 modes is realized at the same time, and the Ex-KR addition does not contain weight parameters needing to be learned, so that the more lightweight multi-mode fusion modeling can be realized.

According to the calculation flow of step 7, the tensor can be calculatedQAttention vector of (a)And is implemented in tensorsQBroadcast calculations on.

And finally, calculating a corresponding emotion distribution vector label by adopting the sum pulling operation and the formula (14) described in the step 8, and realizing multi-mode emotion analysis based on multi-element layering depth fusion of four mode data.

Claims

1. The lightweight multi-mode emotion analysis method based on multi-element layering depth fusion is characterized by comprising the following steps of: step 1, collecting data of n modes of a measured object, and converting each mode data into a matrix form; n is more than or equal to 3;

step 2, performing interactive modeling on the modal data to obtain a matrix of which the modal data contains context information;

step 4, generating weight vectors of n-1 matching matrixes obtained in the step 3;

step 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4;

step 6,Using n-1 matching matrixes to perform multi-mode element matching to obtain n-dimensional tensorsP；

Step 7, calculating an implicit matrix and a weight vector of the center matrix, and updating tensor by using the weight vector of the center matrixPObtaining tensorsP′；

Step 8, according to tensorP'Predicting emotion distribution vectors label; ith element label [ i ] of emotion distribution vector label]The probability of being in the ith emotion for the detected object; and determining the emotion type of the tested object when the multi-mode data are acquired.

2. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 2, the matrix h=bi_fcn (E) containing the context information; wherein the intermediate matrix e=lstm (X); LSTM (·) represents a two-way long and short-term memory network layer transformation; bi_fcn (·) represents a nonlinear feed-forward fully connected layer transform; the matrix X is a matrix corresponding to the single modality obtained in step 1.

3. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 3, a matrix H _x And a central matrix H _y Matching matrix of (a)Wherein W is ₁ To pass through the adaptation matrix H _x And matrix H _y A weight matrix obtained from the context information of the mobile terminal.

4. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: matching matrix M in step 4 _xy Implicit matrix R of (2) _x And a weight vector alpha _x The expression of (2) is as follows:

R _x ＝tanh(W _x M _xy ^T +b _x )

α _x ＝softmax(w _x R _x )

wherein, tan h (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation; w (W) _x To determine the implicit matrix R _x A parameter matrix at the time; w (w) _x To determine the weight vector alpha _x A parameter matrix at the time; b _x To determine the implicit matrix R _x Deviation in time.

5. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 5, the matrix M is matched _xy After updating, a matching matrix M 'is obtained' _xy The method comprises the steps of carrying out a first treatment on the surface of the Matching matrix M' _xy The i-th line element M' _xy [i,:]＝M _xy [i,:]α _x [i]The method comprises the steps of carrying out a first treatment on the surface of the Wherein M is _xy [i,:]Representation matrix M _xy I-th row element of (a); alpha _x [i]Representing vector alpha _x Is the i-th element of (c).

6. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 6, the matrix M 'is matched' _xy And matching matrix M' _zy When multi-modal element matching is performed, the resulting tensorPIs of the sub-vectors of (a)P[:,k,j]＝M' _xy [:,j]+M' _zy [k,j]；M' _xy [:,j]For matrix M' _xy Is a column of (2); m's' _zy [k,j]For matrix M' _zy Is an element of (a) in the above-mentioned formula (b).

7. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in step 7, the implicit matrix R of the center matrix _y As shown in formula (10); weight vector alpha of center matrix _y As shown in formula (11):

R _y ＝tanh(W _y P ^T +b _y ) (10)

wherein the matrix P is a multidimensional tensorIs a matrix of expansion of (a); w (W) _y To determine the implicit matrix R _y A parameter matrix at the time; w (w) _y To determine the weight vector alpha _y A parameter matrix at the time; b _y To determine the implicit matrix R _y Deviation in time;

8. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: the specific process of the step 8 is as follows:

according to tensorP'Calculating a corresponding vector p as shown in formula (13);

where vec (·) represents the vectorization operation of the matrix,representing a mode-3product operation; vector c is a full 1 vector;

label＝softmax(Wp+b)∈R ^d (14)

Where W is the weight matrix and b is the bias.

9. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion of claim 1, wherein the method comprises the following steps: in the step 1, the mode data are three modes, namely a text mode, a visual mode and an acoustic mode.