CN112541541A - Lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion - Google Patents

Lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion Download PDF

Info

Publication number
CN112541541A
CN112541541A CN202011452285.4A CN202011452285A CN112541541A CN 112541541 A CN112541541 A CN 112541541A CN 202011452285 A CN202011452285 A CN 202011452285A CN 112541541 A CN112541541 A CN 112541541A
Authority
CN
China
Prior art keywords
matrix
modal
matching
vector
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011452285.4A
Other languages
Chinese (zh)
Other versions
CN112541541B (en
Inventor
李康
孔万增
金宣妤
唐佳佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202011452285.4A priority Critical patent/CN112541541B/en
Publication of CN112541541A publication Critical patent/CN112541541A/en
Application granted granted Critical
Publication of CN112541541B publication Critical patent/CN112541541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Biophysics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Optimization (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion; the invention establishes direct correlation among multi-modal elements in a layered manner, and can capture short-term and long-term dependencies among different modalities. To avoid reducing resolution and preserve the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcast when selecting and emphasizing multi-modal information interactions. In addition, the invention also provides a new tensor operator, which is called Ex-KR addition, so as to use shared information to fuse multi-modal element information to obtain global tensor representation. This is an effective supplement to the problem that most methods in the current multi-modal emotion recognition field only focus on modeling in a local time sequence-multi-modal space, and can not explicitly learn to obtain a complete representation form of all participating in modal fusion.

Description

Lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion
Technical Field
The invention belongs to the field of multi-modal emotion analysis of cross fusion of natural language processing, computer vision and voice signal processing, and particularly relates to a cross-modal-based multi-element hierarchical deep fusion technology.
Background
With the recent progress of machine learning research, the analysis of multi-modal time series data has become an increasingly important research field. Multimodal learning aims at building neural networks that can process and integrate information from multiple modalities, and multimodal emotion analysis is a research sub-field of multimodal learning. When people express their own emotions (negative or positive), in real life, various categories of information are involved in such communication activities, including speech (text modality), facial expressions and gestures (visual form modality), and prosodic features of voice (acoustic modality). Video is an important source of multimodal data and can provide three types of data, visual, acoustic and textual, simultaneously. For uniform presentation and ease of understanding, we refer to each word in speech as a text element, each frame in video as a visual element, and the prosodic features corresponding to each word as acoustic elements. Through a study of the three modality data, we found that there are two types of element interactions (intra-modality element interactions and inter-modality element interactions) present therein at the same time. Moreover, over time, the inter-modal element interactions provide more complex complementary information between the time domain and the multi-modal data. When the method is used for analyzing, if complementary information among multi-modal data is considered, the emotional state of a person can be analyzed more reliably and more accurately.
However, heterogeneous properties of multimodal data, such as proper alignment between elements, short-term dependencies and long-term dependencies between different modalities, are important complementary information between multimodal data, and are also major challenges that researchers must face when analyzing cross-modal information. Existing multimodal fusion methods typically focus on modeling in a local temporal-multimodal space and do not explicitly learn a complete multi-dimensional representation of all modalities involved in the fusion. For example, some methods compress time series data of three modalities into three vectors for subsequent multimodal fusion, such that the subsequent steps are unable to perceive critical cross-modality timing information. Existing approaches also attempt to convert from a source modality to a target modality to learn a joint representation of the two modalities. However, this "translation method" is mainly implemented between two modalities and must involve "translation" in two directions, so that the joint representation has strong local features but lacks important global multimodal properties.
Disclosure of Invention
Aiming at the defects of the prior art, the invention introduces a lightweight multi-modal fusion method based on a multi-element hierarchical deep fusion technology. The method transmits and integrates local association into a common global space through two different types of matching mechanisms. When multi-modal element matching fusion is carried out, a new tensor operator without additional parameters is provided, and the Ex-KR addition is carried out, wherein the tensor operator can combine local matrix representations by using common information of specific modes to obtain a global tensor, so that more lightweight multi-modal data modeling can be achieved.
The method comprises the following specific steps:
step 1, collecting n modal data of a measured object, and converting each modal data into a matrix form; n is more than or equal to 3.
And 2, carrying out interactive modeling on the modal data to obtain a matrix of the modal data containing context information.
Step 3, taking a matrix of one mode as a central matrix; respectively carrying out element interaction matching on the rest n-1 modal matrices and the central matrix to obtain n-1 matching matrices;
and 4, generating weight vectors of the n-1 matching matrixes obtained in the step 3.
And 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4.
Step 6, performing multi-mode element matching by using n-1 matching matrixes to obtain n-dimensional tensorP
Step 7, calculating an implicit matrix and a weight vector of the central matrix, and updating the tensor by using the weight vector of the central matrixPTo obtain tensorP′
Step 8, according to the tensorP'The emotion distribution vector label is predicted. The ith element of emotion distribution vector label [ i ]]Is the probability that the subject is in the ith emotion. The emotion type of the tested object in the multi-modal data acquisition is determined according to the data.
Preferably, in step 2, the matrix H ═ bi _ fcn (e) containing the context information; wherein the intermediate matrix E ═ lstm (x); LSTM (·) represents a bidirectional long-short term memory network layer transformation; bi _ FCN (·) represents a nonlinear feed-forward fully-connected layer transformation. The matrix X is a matrix corresponding to the single modality obtained in step 1.
Preferably, step 3 is performed by a matrix HxAnd a central matrix HyIs matched with the matrix
Figure BDA0002827565340000021
Wherein, W1To pass through adaptation matrix HxAnd matrix HyThe obtained weight matrix.
Preferably, the matching matrix M in step 4xyIs implicit matrix RxAnd a weight vector alphaxThe expression of (a) is as follows:
Rx=tanh(WxMxy T+bx)
αx=softmax(wxRx)
wherein, tanh (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. WxFor determining the hidden matrix RxA parameter matrix of time. w is axFor determining the weight vector alphaxA parameter matrix of time. bxFor determining the hidden matrix RxDeviation in time.
Preferably, the stepsIn step 5, the matching matrix MxyObtaining a matching matrix M 'after updating'xy. Matching matrix M'xyLine i element M'xy[i,:]=Mxy[i,:]αx[i](ii) a Wherein M isxy[i,:]Representation matrix MxyRow i element of (1); alpha is alphax[i]Representing a vector alphaxThe ith element of (1).
Preferably, in step 6, the matrix M 'is matched'xyAnd matching matrix M'zyTensor obtained when performing multi-modal element matchingPIs a sub-vector ofP[:,k,j]=M'xy[:,j]+M'zy[k,j];M'xy[:,j]Is matrix M'xyOne column of (1); m'zy[k,j]Is matrix M'zyAn element of (1).
Preferably, in step 7, the implicit matrix R of the central matrixyAs shown in formula (10); weight vector alpha of the central matrixyAs shown in equation (11).
Ry=tanh(WyPT+by) (10)
Figure BDA0002827565340000031
Wherein the matrix P is a multi-dimensional tensor
Figure BDA0002827565340000032
The deployment matrix of (a); wyFor determining the hidden matrix RyA parameter matrix of time. w is ayFor determining the weight vector alphayA parameter matrix of time. byFor determining the hidden matrix RyDeviation in time.
Zhang LiangP′Is sub-matrix ofP'[:,:,i]=P[:,:,i]αy[i];P[:,:,i]Is tensorPA sub-matrix of (a); alpha is alphay[i]Is a weight vector alphayThe elements in (c).
Preferably, the specific process of step 8 is as follows:
according to tensorP'Calculating the corresponding vector p is shown as equation (13).
Figure BDA0002827565340000033
Where vec (-) represents the vectorization operation of the matrix,
Figure BDA0002827565340000034
representing a mode-3product operation. Vector c is a full 1 vector.
Predicting emotion distribution vector label from vector p is shown in formula (14)
label=softmax(Wp+b)∈Rd (14)
Where W is the weight matrix and b is the bias.
Preferably, in step 1, there are three types of modal data, namely, a text mode, a visual mode and an acoustic mode.
The invention has the beneficial effects that:
1. the invention establishes direct correlation among multi-modal elements in a layered manner, and can capture short-term and long-term dependencies among different modalities. To avoid reducing resolution and preserve the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcast when selecting and emphasizing multi-modal information interactions.
2. The invention also provides a new tensor operator, called Ex-KR addition, to fuse the multi-modal element information by using the shared information to obtain the global tensor representation. This is an effective supplement to the problem that most methods in the current multi-modal emotion recognition field only focus on modeling in a local time sequence-multi-modal space, and can not explicitly learn to obtain a complete representation form of all participating in modal fusion.
Drawings
FIG. 1 is a schematic diagram of a lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion;
FIG. 2 is a schematic diagram of two different types of matching mechanisms in the present invention;
FIG. 3 is a schematic view of an attention generating module;
Detailed Description
The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is described in detail below with reference to the accompanying drawings.
Example 1
As shown in fig. 1, the lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion specifically includes the following steps:
step 1, data matrix representation of single mode
The data of the plurality of modes related to the present invention are referred to as X, Y and Z, respectively. Three types of modality data are used when the framework is actually used: the text mode, the visual mode and the acoustic mode sense the emotional state of the testee. We refer to text modalities as capital letter Y, visual modalities as capital letter X, and acoustic modalities as capital letter Z. The data of each modality can be organized in the form of a two-dimensional matrix, i.e.
Figure BDA0002827565340000041
Wherein, tx、ty、tzRespectively representing the number of three modal elements, dx、dy、dzRespectively, representing the characteristic length of the corresponding element. Taking the text modality as an example, it is tried to say exhilarating: "this movie is very appealing. "(tested in expressing a positive emotion) then we can get tyThe actual value of (c) is 8.
Step 2, interactive modeling between single modal elements
Before multi-modal fusion is implemented, a two-dimensional matrix representation of the raw data of each modality needs to be transformed to establish interaction between the elements of the single modality, i.e. the element feature representation of each modality needs to contain context information of its neighboring elements. Here, context information of neighboring elements in a single modality is modeled using a bidirectional long short term memory network Layer (LSTM) that concatenates forward and reverse implicit state representations. To further enrich the single-modal characterization, we feed forward through non-linearityThe full-connected mapping transforms the encoded features for subsequent multi-modal fusion, obtaining a matrix H containing context information of three modalitiesx、Hy、HzThe following were used:
Hx=bi_FCN(Ex) Hy=bi_FCN(Ey) Hz=bi_FCN(Ez) (1)
Ex=LSTM(X) Ey=LSTM(Y) Ez=LSTM(Z) (2)
wherein, LSTM (·) represents bidirectional long-short term memory network layer transformation; bi _ FCN (·) represents a nonlinear feed-forward fully-connected layer transformation.
To more clearly illustrate the specific calculation process of LSTM (·), the matrix M ═ M1,m2,...,mT]∈RT×D(wherein m ist∈RDIs a vector, T1, 2,.., T) for example:
it=σ(Wiimt+bii+Whih(t-1)+bhi)
ft=σ(Wifmt+bif+Whfh(t-1)+bhf)
gt=tahn(Wigmt+big+Whgh(t-1)+bhg)
ot=σ(Wiomt+bio+Whoh(t-1)+bho)
ct=ft*c(t-1)+it*gt
ht=ot*tanh(ct)
where h istIs the output at time t, ctIs the state of the cell at time t, mtIs an input at time t, it,ft,gt,otRespectively an input gate, a forgetting gate, a unit gate and an output gate. σ is sigmoid function, which is the Hadamard product, W in the formulaii,Whi,Wif,Whf,Wig,Whg,Wio,WhoIs a weight matrix, bif,bhf,big,bhgEtc. are deviation vectors. So we have [ h ]1,h2,...,hT]=LSTM(M)。
To illustrate the calculation of bi _ FCN (), we also use the matrix M ∈ RT×DFor example, the following steps are carried out:
H=bi_FCN(M)=tanh(MWf1+bf1)Wf2+bf2∈RT×D'
w hereinf1And Wf2Is a weight matrix, bf1And bf2Is a deviation vector.
Step 3, element matching fusion between two modes
As shown in FIG. 2(a), two single-modality features containing intra-modality element interactions are represented HxAnd HyJoin operations to model the relationship between two modalities. Applying a bilinear transformation to two vector feature representations from two modalities may achieve this element matching fusion. Weight matrix W in bilinear transformation1And W2Can be used to represent context information between modalities and can ensure more flexible coupling between two different modalities. By matching the vector feature representations of the bimodal elements, the interaction of the elements between two modes can be modeled to obtain a corresponding matching matrix MxyAnd Mzy
Figure BDA0002827565340000061
Figure BDA0002827565340000062
Wherein the weight matrix W1By adapting the matrix HxAnd matrix HyTo obtain the context information of; weight matrix W2By adapting the matrix HyAnd matrix HzTo obtain the context information of.
Step 4,Generating a correlation matrix MxyAnd MzyAttention weight vector of
Matching matrix M based on interaction between two modalitiesxyAnd MzyWe compute the relative importance of each of its elements for a particular modality by an attention mechanism to optimize the joint feature representation between the modalities. The calculated value of the importance of each element is expressed in the form of a vector, and the length of the vector corresponds to the number of modality-specific elements, as shown in fig. 3. Obtaining a matching matrix MxyIs implicit matrix RxAnd a matching matrix MzyIs implicit matrix RzThe following were used:
Rx=tanh(WxMxy T+bx) Rz=tanh(WzMzy T+bz) (5)
Figure BDA0002827565340000063
wherein the vector αx、αzThe attention weight vectors are respectively related to the specific mode X and the specific mode Y, and the length of the vector corresponds to the number of elements of the modes X and Y. tanh (·) represents a hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. Wx、WzRespectively for determining the implicit matrix RxImplicit matrix RzA parameter matrix of time. w is ax、wzRespectively for determining a weight vector alphaxWeight vector alphazA parameter matrix of time. bx、bzRespectively for determining the implicit matrix RxImplicit matrix RzDeviation in time.
Step 5, broadcasting operation of attention weight on matching matrix
By assigning attention weights to the corresponding values of the bimodal joint matching matrix, the model can be enabled to focus on important inter-modal interaction relationships. However, the conventional way of assigning averages at different positions by attention weights may cause the resolution of the feature representation to decrease. To avoid this situationIn case we apply the corresponding attention weights in the form of a broadcast in the proposed fusion framework; using weight vectors alphax、αzUpdating the matching matrix MxyAnd a matching matrix MzyObtaining a new matching matrix M'xyAnd a matching matrix M'zy
M'xy=attention_broadcast(Mxyx) Wherein
Figure BDA0002827565340000067
M'zy=attention_broadcast(Mzyz) Wherein
Figure BDA0002827565340000064
Wherein the content of the first and second substances,
Figure BDA0002827565340000065
represents matrix M'xyRow i element of (1); mxy[i,:]Representation matrix MxyRow i element of (1); alpha is alphax[i]Representing a vector alphaxThe ith element of (1). Mxy[i,:]αx[i]A vector by scalar multiplication is shown.
Figure BDA0002827565340000066
Represents matrix M'zyRow i element of (1); mzy[i,:]Representation matrix MzyRow i element of (1); alpha is alphaz[i]Representing a vector alphazThe ith element of (1).
Step 6, utilizing the multi-modal common information by Ex-KR addition
Since the softmax function is used for normalization in equation (6), some weight values in the attention weight vector are very close to 0. When the broadcast operation of the attention vector is carried out on the application matching matrix, some values of the obtained matrix are close to 0, which brings certain sparsity to the matching matrix. On the other hand, two matching matrices M'xyAnd M'zyWhile containing information common to modality Y, i.e. each column of the two matching matricesThis common information facilitates deep fusion of multi-modal elements, corresponding to an element from modality Y. To make better use of these matching properties, we propose a new tensor operator: Ex-KR addition, whose sign can be described as
Figure BDA0002827565340000071
Through Ex-KR addition, multi-modal element matching based on the modality Y can be realized by utilizing common information in two matching matrixes, so that a multi-dimensional tensor related to multi-modal data fusion can be calculatedPCharacterization, as shown in FIG. 2 (b).
Figure BDA0002827565340000072
Step 7, broadcasting operation of attention weight on multi-modal tensor
The implicit matrix R can be obtained by the same attention generation module as step 4yAs shown in equation (10), the attention weight vector α for the multi-modal multidimensional tensoryAs shown in equation (11). Applying the vector to the multidimensional tensor in the form of a broadcastPThe expression capacity of the tensor can be further increased, and the tensor is updatedP′As shown in equation (12).
Ry=tanh(WyPT+by) (10)
Figure BDA0002827565340000076
P'=attention_broadcast(Py) WhereinP'[:,:,i]=P[:,:,i]αy[i] (12)
Wherein, the matrix
Figure BDA0002827565340000073
Is a multi-dimensional tensor
Figure BDA0002827565340000074
The deployment matrix of (a); wyFor determining the hidden matrix RyA parameter matrix of time. w is ayFor determining the weight vector alphayA parameter matrix of time. byFor determining the hidden matrix RyDeviation in time.
Step 8, classification
Application of sum posing to multidimensional tensor characterization via broadcast attention computationP'Finally, the information of the three modes is summarized to obtain a corresponding vector p as shown in formula (13). Vector p, as a multi-modal high-level representation, can be used for the final emotion classification task.
Figure BDA0002827565340000075
Where vec (-) denotes the vectorization operation of the matrix,
Figure BDA0002827565340000081
representing a mode-3product operation. Vector quantity
Figure BDA0002827565340000082
Is a full 1 vector.
Then, the emotion distribution vector label is predicted according to the vector p as shown in the formula (14)
label=softmax(Wp+b)∈Rd (14)
Where W is a learnable weight matrix, b is a bias; the ith element label [ i ] of the emotion distribution vector label is the probability of the predicted ith emotion.
As shown in Table 1, the emotion state discrimination task is performed on two multi-modal emotion databases CMU-MOSI and YouTube simultaneously by the multi-modal emotion fusion method, MAE is mean square error, CORR is Pearson correlation coefficient, ACC-7 is 7 classification precision, ACC-3 is 3 classification precision, and the result of the multi-modal emotion discrimination method is superior to that of the four multi-modal fusion methods mostly by comparing and measuring multiple indexes of the discrimination task.
TABLE 1 comparison of results
Figure BDA0002827565340000083
Example 2
This example differs from example 1 in that; when the lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is applied to data of four modes (such as a text mode, a visual mode, an acoustic mode and electroencephalogram data), a matching matrix between the two modes can be calculated at first, and then the matching matrix representing local fusion of the two modes is integrated by utilizing the Ex-KR addition provided by us to obtain multi-modal global tensor representation. To illustrate the computational flow, the four modalities involved in the fusion can be expressed in m respectively1,m2,m3,m4The representation is carried out, the corresponding modal data being organized in the form of a two-dimensional matrix, i.e.
Figure BDA0002827565340000084
And
Figure BDA0002827565340000085
as with the methods employed by equations (1) and (2), the two-way long-short term memory network layer transformation LSTM (-) and the nonlinear feedforward full-connectivity layer transformation bi _ FCN (-) can be used to capture the context information matrix representation within a single modality, respectively
Figure BDA0002827565340000086
And
Figure BDA0002827565340000087
in the mode m1In the case of common mode, the same steps as steps 3, 4 and 5 are adopted to realize element matching between modes and operate the generated attention weight vector on a matching matrix in a broadcasting operation mode, so that three matching matrices can be obtained
Figure BDA0002827565340000091
And
Figure BDA0002827565340000092
due to the threeThe matching matrix contains m modes1So that the multi-modal global tensor representation can be realized by using the common information by using Ex-KR addition, and the specific calculation flow is as follows:
firstly, matching matrix is matched
Figure BDA0002827565340000093
And
Figure BDA0002827565340000094
fusion is realized on the basis of Ex-KR addition:
Figure BDA0002827565340000095
three modes (m) are fused1,m2And m3) The tensor P carries out matrix expansion on the third dimension to obtain a matrix
Figure BDA0002827565340000096
Due to the matrix P and the matching matrix
Figure BDA0002827565340000097
And also includes a mode m1Further fusion can be achieved here using Ex-KR addition:
Figure BDA0002827565340000098
the tensor thus calculatedQThe modality m can be utilized1And because the Ex-KR addition does not contain weight parameters needing to be learned, more lightweight multi-modal fusion modeling can be realized.
According to the calculation procedure of step 7, the tensor can be calculatedQAttention vector of
Figure BDA0002827565340000099
And is implemented in tensorQThe broadcast computation of (c).
And finally, calculating to obtain a corresponding emotion distribution vector label by using the sum posing operation described in the step 8 and the formula (14), and realizing multi-modal emotion analysis based on multi-element layered deep fusion of four modal data.

Claims (9)

1. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is characterized by comprising the following steps of: step 1, collecting n modal data of a measured object, and converting each modal data into a matrix form; n is more than or equal to 3;
step 2, carrying out interactive modeling on each modal data to obtain a matrix of each modal data containing context information;
step 3, taking a matrix of one mode as a central matrix; respectively carrying out element interaction matching on the rest n-1 modal matrices and the central matrix to obtain n-1 matching matrices;
step 4, generating weight vectors of the n-1 matching matrixes obtained in the step 3;
step 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4;
step 6, performing multi-mode element matching by using n-1 matching matrixes to obtain n-dimensional tensorP
Step 7, calculating an implicit matrix and a weight vector of the central matrix, and updating the tensor by using the weight vector of the central matrixPTo obtain tensorP′
Step 8, according to the tensorP'Predicting emotion distribution vector label; the ith element of emotion distribution vector label [ i ]]The probability of the tested object in the ith emotion; the emotion type of the tested object in the multi-modal data acquisition is determined according to the data.
2. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 2, a matrix H ═ bi _ fcn (e) containing context information; wherein the intermediate matrix E ═ lstm (x); LSTM (·) represents a bidirectional long-short term memory network layer transformation; bi _ FCN (-) represents a nonlinear feedforward fully-connected layer transform; the matrix X is a matrix corresponding to the single modality obtained in step 1.
3. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: one matrix H in step 3xAnd a central matrix HyIs matched with the matrix
Figure FDA0002827565330000011
Wherein, W1To pass through adaptation matrix HxAnd matrix HyThe obtained weight matrix.
4. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: matching matrix M in step 4xyIs implicit matrix RxAnd a weight vector alphaxThe expression of (a) is as follows:
Rx=tanh(WxMxy T+bx)
αx=softmax(wxRx)
wherein, tanh (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation; wxFor determining the hidden matrix RxA parameter matrix of time; w is axFor determining the weight vector alphaxA parameter matrix of time; bxFor determining the hidden matrix RxDeviation in time.
5. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 5, the matching matrix MxyObtaining a matching matrix M 'after updating'xy(ii) a Matching matrix M'xyLine i element M'xy[i,:]=Mxy[i,:]αx[i](ii) a Wherein M isxy[i,:]Representation matrix MxyRow i element of (1); alpha is alphax[i]Representing a vector alphaxThe ith element ofAnd (4) element.
6. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 6, matching matrix M'xyAnd matching matrix M'zyTensor obtained when performing multi-modal element matchingPIs a sub-vector ofP[:,k,j]=M'xy[:,j]+M'zy[k,j];M'xy[:,j]Is matrix M'xyOne column of (1); m'zy[k,j]Is matrix M'zyAn element of (1).
7. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 7, the implicit matrix R of the central matrixyAs shown in formula (10); weight vector alpha of the central matrixyAs shown in formula (11); (ii) a
Ry=tanh(WyPT+by) (10)
Figure FDA0002827565330000021
Wherein the matrix P is a multi-dimensional tensor
Figure FDA0002827565330000022
The deployment matrix of (a); wyFor determining the hidden matrix RyA parameter matrix of time; w is ayFor determining the weight vector alphayA parameter matrix of time; byFor determining the hidden matrix RyA deviation in time;
Zhang LiangP′is sub-matrix ofP'[:,:,i]=P[:,:,i]αy[i];P[:,:,i]Is tensorPA sub-matrix of (a); alpha is alphay[i]Is a weight vector alphayThe elements in (c).
8. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: the specific process of step 8 is as follows:
according to tensorP'Calculating a corresponding vector p as shown in formula (13);
Figure FDA0002827565330000023
where vec (-) represents the vectorization operation of the matrix,
Figure FDA0002827565330000024
represents the mode-3product operation; vector c is a full 1 vector;
predicting emotion distribution vector label from vector p is shown in formula (14)
label=softmax(Wp+b)∈Rd (14)
Where W is the weight matrix and b is the bias.
9. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in the step 1, the modal data are three types, namely a text mode, a visual mode and an acoustic mode.
CN202011452285.4A 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion Active CN112541541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011452285.4A CN112541541B (en) 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011452285.4A CN112541541B (en) 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Publications (2)

Publication Number Publication Date
CN112541541A true CN112541541A (en) 2021-03-23
CN112541541B CN112541541B (en) 2024-03-22

Family

ID=75018379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011452285.4A Active CN112541541B (en) 2020-12-10 2020-12-10 Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion

Country Status (1)

Country Link
CN (1) CN112541541B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064968A (en) * 2021-04-06 2021-07-02 齐鲁工业大学 Social media emotion analysis method and system based on tensor fusion network
CN115019237A (en) * 2022-06-30 2022-09-06 中国电信股份有限公司 Multi-modal emotion analysis method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325120A (en) * 2018-09-14 2019-02-12 江苏师范大学 A kind of text sentiment classification method separating user and product attention mechanism
CN111012307A (en) * 2019-11-26 2020-04-17 清华大学 Method and device for evaluating training input degree of patient based on multi-mode information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325120A (en) * 2018-09-14 2019-02-12 江苏师范大学 A kind of text sentiment classification method separating user and product attention mechanism
CN111012307A (en) * 2019-11-26 2020-04-17 清华大学 Method and device for evaluating training input degree of patient based on multi-mode information

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064968A (en) * 2021-04-06 2021-07-02 齐鲁工业大学 Social media emotion analysis method and system based on tensor fusion network
CN113064968B (en) * 2021-04-06 2022-04-19 齐鲁工业大学 Social media emotion analysis method and system based on tensor fusion network
CN115019237A (en) * 2022-06-30 2022-09-06 中国电信股份有限公司 Multi-modal emotion analysis method and device, electronic equipment and storage medium
CN115019237B (en) * 2022-06-30 2023-12-08 中国电信股份有限公司 Multi-mode emotion analysis method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112541541B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
Li et al. A survey of multi-view representation learning
CN110738984B (en) Artificial intelligence CNN, LSTM neural network speech recognition system
Abdu et al. Multimodal video sentiment analysis using deep learning approaches, a survey
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN111368993B (en) Data processing method and related equipment
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
CN111178389B (en) Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling
CN111460132B (en) Generation type conference abstract method based on graph convolution neural network
CN112328900A (en) Deep learning recommendation method integrating scoring matrix and comment text
Pandey et al. Attention gated tensor neural network architectures for speech emotion recognition
CN112115687A (en) Problem generation method combining triples and entity types in knowledge base
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN112541541B (en) Lightweight multi-modal emotion analysis method based on multi-element layering depth fusion
CN114443899A (en) Video classification method, device, equipment and medium
Han et al. Cross-modality co-attention networks for visual question answering
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
Jiang et al. A multitask learning framework for multimodal sentiment analysis
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN114169408A (en) Emotion classification method based on multi-mode attention mechanism
Gao et al. Generalized pyramid co-attention with learnable aggregation net for video question answering
CN111950592B (en) Multi-modal emotion feature fusion method based on supervised least square multi-class kernel canonical correlation analysis
Yuan [Retracted] A Classroom Emotion Recognition Model Based on a Convolutional Neural Network Speech Emotion Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant