CN111178389B

CN111178389B - Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling

Info

Publication number: CN111178389B
Application number: CN201911244389.3A
Authority: CN
Inventors: 唐佳佳; 金宣妤; 孔万增; 张建海
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2022-02-11
Anticipated expiration: 2039-12-06
Also published as: CN111178389A

Abstract

The invention relates to a multi-mode depth layering fusion emotion analysis method based on multi-channel tensor pooling. Based on the attention mechanism method, corresponding weights can be set for the multi-modal data, and the importance degrees of the data in different modes are divided, so that the interaction of the multi-modal data with large contribution degrees is amplified in the fusion part according to the different contribution degrees of the data in different modes to the task. And compared with the single-channel polynomial tensor pooling module, the multi-channel polynomial tensor pooling module can obtain high-robustness local high-dimensionality complex nonlinear interaction information from a fine-granularity layer. On the basis of judging the importance degree of multi-modal data, the invention can depict stable local high-dimensional complex dynamic interaction information from a fine-grained level, and is an effective supplement to a multi-modal fusion framework in the field of current emotion recognition.

Description

Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling

Technical Field

The invention belongs to the field of multi-modal emotion recognition in the cross field of natural language processing, vision and voice, and particularly relates to a method for judging a tested emotion state by performing fine-grained layered fusion on multi-modal information based on a multi-channel polynomial tensor pooling technology of an attention mechanism.

Background

How to effectively judge the emotional state of an individual is always a current research hotspot. For example, the commodity website can analyze and judge the evaluation of a specific commodity by the consumer according to the facial expression, voice or text evaluation of the consumer, so as to obtain emotional feedback (negative emotion or positive emotion) of the consumer on the commodity.

Single modality data, such as facial expressions, speech data or text data, respectively, may be used for emotional state recognition, but single modality data is not sufficient to fully characterize a certain emotional state, the multimodal data can supplement the emotion recognition task with information from multiple perspectives (e.g., from textual information analysis alone, only fuzzy emotion state discrimination may be possible, but emotion type can be further determined in conjunction with expression information. for example, an individual can say "you can really be annoying" in a full smile, from text alone "you can really be annoying" it can be determined that the individual's current valence is negative, but from the individual's facial expression a distinct opposite valence determination-positive emotion), meanwhile, interactive information among the multi-modal data can be used as a common characteristic mode contained in the multi-modal data, so that the robustness of the emotion recognition task is enhanced.

The current multi-modal data fusion method is generally analyzed from a coarse-grained perspective, and only two simple linear fusion methods, namely bilinear fusion and trilinearity fusion, are generally considered, so that only low-dimensional simple interaction information between multi-modal data can be obtained. Meanwhile, the existing tensor-based linear fusion method is to carry out integral decomposition on tensor data obtained by fusion, so that storage burden and computational complexity are increased (because the required storage capacity tends to grow exponentially along with the increase of a fusion order), and higher-order and more-complex interaction cannot be carried out. Meanwhile, the existing multi-modal interaction model considers that the importance degree of each modal data is the same during interaction, different weight information is not given to a plurality of modal data, and the final task precision is deviated.

Disclosure of Invention

The invention aims to provide a multi-mode depth layering fusion emotion analysis method based on multi-channel tensor pooling, aiming at the defects of the prior art. Firstly, an attention network is added to multi-modal data, different weight information is set for each modality, and different importance degrees are represented (the effect of the modality data with large contribution degree on an interaction part can be amplified). Secondly, multi-channel tensor pooling characterization is performed on multi-modal data obtained through the attention network (stability of data characterization is enhanced). And finally, performing deep layered cyclic fusion on the multichannel tensor pooling representation data, wherein the obtained global information representation can be used for judging the emotional task.

The technical scheme adopted by the invention is as follows:

step 1, acquiring multi-modal information data

The modality is a source or a form of information, and the multimodal information data comprises voice, video, words and other media data capable of recording human emotion information.

Step 2, multi-mode information data preprocessing

In order to avoid overlarge feature data distribution difference of each modal information data, a Long Short-Term Memory (LSTM) network or a Gated cycle unit (GRU) network is adopted to respectively extract a Short-Term Memory vector of each modal information data at each moment as a feature vector of the moment;

wherein

The characteristic vector corresponding to the t-th moment representing the m-th modal vector, namely the short-term memory vector of the LSTM network at the t-th moment, g_outIs the output gate of the LSTM network, C (t) is the long-term memory unit of the LSTM network, and f is the activation function.

Step 3, multi-mode data information organization

Organizing the characteristic vectors of the modal information data preprocessed in the step 2 into a pseudo two-dimensional matrix G, wherein the first dimension is a time dimension, the second dimension is a modal dimension, and each element in the matrix represents the characteristic vector of the corresponding time modal;

wherein T represents the size of the data time dimension, and M represents the number of modes;

step 4, attention mechanism setting

Aiming at the pseudo two-dimensional matrix G obtained in the step 3, an attention network is set for all modal data at all moments to obtain a new pseudo two-dimensional matrix G₁：

Wherein

Respectively in each mode

The weight at the t-th moment;

representing a modular multiplication.

Step 5, multi-channel high-order polynomial tensor pooling operation of multi-modal information

5.1 number of initialization iterations k equals 1, size in time dimension T₀＝T；

5.2 at size T₀In the time dimension, for a pseudo two-dimensional matrix G_kSplicing all the characteristic vectors of any two modes in a time window to obtain a new characteristic vector z_ij(ii) a Then z is paired according to equation (4)_ijCarrying out high-order (P-order) polynomial fusion operation to obtain P-order data tensor Z^p：

Wherein

Representing tensor product operations, i, j ∈ [1, M ]]；

The length of the time window is T₁Step length is s;

then to Z^pPerforming C single-channel low-rank tensor pooling operation according to each dimension of the P-order tensor to finally obtain C new eigenvectors

Wherein the feature vector

H-th data element z of_hThe following were used:

wherein W^hIs a tensor weight of the P order, i₁,…,i_pSubscripts for each dimension of the P-order tensor;

for the C new feature vectors

Performing maximum pooling to obtain local feature vector of two-mode information fusion in the time window

Wherein

H-th data element z'_hThe following were used:

wherein C is the number of times of single-channel tensor pooling operation of modal information in the same time window, namely the number of channels of the multi-channel tensor pooling operation; w^hcA P-order tensor weight for the c-th channel;

for pseudo two-dimensional matrix G_kAll the modal feature vectors are subjected to the two-modal fusion operation to obtain a plurality of modal feature vectors

The final build size is

Pseudo two-dimensional matrix G of_k+1；

5.3 judging whether k is more than or equal to N, wherein N is the maximum iteration number, and if so, judging whether k is more than or equal to NIf yes, outputting the current pseudo two-dimensional matrix G_k+1Otherwise, resetting k to k +1,

and jumps to step 5.2.

Step 6, multi-modal global interaction

For the pseudo two-dimensional matrix G output in the step (5)_k+1Splicing all the characteristic vectors to obtain a new characteristic vector z'; and then carrying out a high-order (P-order) polynomial fusion operation (such as formula (4)) on Z 'to obtain a P-order data tensor Z'^pAnd then to Z'^pAnd (3) performing multichannel low-rank tensor pooling operation according to the dimensions of the P-order tensor (such as the formula (6)), and finally obtaining the global eigenvector z.

Step 7, multi-modal information data classification

And (4) comparing the global interaction vector z obtained in the step (6) with a previous emotion class label to finally obtain a classification result.

The emotion category label is an emotion category label marked in advance when the emotion modal information data is collected in the step (1).

The invention has the beneficial effects that: according to the method, the importance degree of corresponding weight division of different modal data is set for multi-modal data in combination with an attention-based mechanism method, so that the modal data with large contribution degree are amplified in a fusion part for interaction according to the contribution degree of the different modal data to a task; secondly, the multi-channel tensor pooling operation is adopted, and the problem of unstable high-dimensional complex interaction existing in single-channel tensor pooling is solved. The method is based on the iterative fusion of different contribution degrees of multi-modal data, and the stable high-dimensional complex dynamic interaction information with strong robustness is described from a fine granularity level, so that the method is an effective supplement for a multi-modal fusion framework in the current emotion recognition field.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the multi-channel higher-order polynomial tensor pooling operation of the multi-modal information of the present invention;

FIG. 3 is a diagram of a layered fusion framework of the present invention;

FIG. 4 is a schematic illustration of an attention mechanism;

FIG. 5 is a schematic diagram of a single channel polynomial tensor pooling module;

FIG. 6 is a schematic diagram of a multi-channel polynomial tensor pooling module.

Detailed Description

The method of the present invention will be described in detail below with reference to the accompanying drawings.

The multi-modal depth layered fusion emotion analysis method based on multi-channel tensor pooling is shown in figure 1:

step 1, obtaining three modal information data of text, video and audio of an individual by the prior art

Only fuzzy emotion state judgment can be obtained according to the text information, namely, the emotion type (such as negative emotion or positive emotion) cannot be accurately judged from the text information; the emotional valence (positive or negative) can be preliminarily determined according to the facial expression of the individual in the video; the emotional activation degree can be objectively judged according to the fluctuation state (such as amplitude value) of the sound in a certain period of time.

Step 2, multi-mode information data preprocessing

wherein

Step 3, multi-mode data information organization

Organizing the feature vectors of the modal information data preprocessed in the step 2 into a pseudo two-dimensional matrix G, wherein the first dimension is a time dimension (T is 8), the second dimension is a modal dimension (M is 3), and each element in the matrix represents the feature vector of the corresponding time modal;

step 4, attention mechanism setting

Wherein

Respectively in each mode

The weight at the t-th moment;

representing a modular multiplication.

Respectively are feature vectors of three modes of text, video and audio.

Step 5, multi-channel high-order polynomial tensor pooling operation of multi-modal information: firstly, scanning a time window along a modal dimension to respectively obtain a video mode and an audio mode]Text modality, audio modality]And [ text modalityVideo modality]After the modal dimension scanning is finished, the scanning is carried out along the time dimension, so that 12 new eigenvectors can be obtained from the first layer and used as a pseudo two-dimensional matrix G of the second layer₂Then, two-by-two modal information data fusion is carried out on the feature vectors of the second layer, and 6 new feature vectors can be obtained on the second layer and used as a pseudo two-dimensional matrix G of the third layer₃And finally, fusing all nodes of the time window including the current layer on the third layer, and taking the finally obtained output feature vector as an emotional state judgment basis.

5.2 at size T₀In the time dimension, e.g. FIG. 3 for a pseudo two-dimensional matrix G_kSplicing all the characteristic vectors of any two modes in a time window to obtain a new characteristic vector z_ij(ii) a Then z is paired according to equation (4)_ijCarrying out high-order (P-order) polynomial fusion operation to obtain P-order data tensor Z^p：

Wherein

Representing tensor product operations, i, j ∈ [1,3 ]]；

The length of the time window is T₁Step length is s; t is₁2 (including t)₁And t₂Time data), s is 2;

as shown in FIG. 5, the conventional is Z^pPerforming single-channel low-rank tensor pooling operation according to dimensions of P-order tensor, and finally outputting a new eigenvector z as the output of each time window_ij', wherein the feature vector z_ij' the h-th data element z_hThe following were used:

however, although the pooling of the single-channel high-order (P-th order) polynomial fusion tensor can obtain high-dimensional complex interaction information, the method may have an unstable model, so that the robustness of the method is stronger, the invention provides the pooling operation of the multiple single-channel high-order (P-th order) polynomial fusion tensors as shown in fig. 6, specifically:

to Z^pPerforming C single-channel low-rank tensor pooling operation according to each dimension of the P-order tensor to finally obtain C new eigenvectors

Wherein the feature vector

H-th data element z_hThe following were used:

for the C new feature vectors

Wherein

H-th data element z'_hThe following were used:

formula (where C is the number of times of single-channel tensor pooling operation of modal information in the same time window, i.e. the number of channels of multi-channel tensor pooling operation; W^hcA P-order tensor weight for the c-th channel;

as shown in fig. 6, which is a schematic diagram of the multi-channel polynomial tensor pooling module of the present invention, compared to the single-channel polynomial tensor pooling module, the multi-channel pooling operation performs multiple high-order (P-order) polynomial fusion operations on the spliced data to obtain multiple P-order data tensors, and finally outputs multiple new eigenvectors in one time window, and a maximum pooling operation is performed on the multiple eigenvectors, that is, a maximum value solving operation is performed on all element sets specified by the same subscript of the multiple eigenvectors, and the obtained maximum value is used as a new element specified by the subscript, so that finally the multiple eigenvectors perform a dimension reduction operation along the channel dimension, and only one eigenvector is obtained as an output of the time window, which greatly increases robustness and reduces randomness.

The final build size is

Pseudo two-dimensional matrix G of_k+1；

5.3, judging whether k is larger than or equal to N, wherein N is the maximum iteration number (N is 2), and if so, outputting the current pseudo two-dimensional matrix G_k+1Otherwise, resetting k to k +1,

and jumps to step 5.2.

Step 6, multi-modal global interaction

For the pseudo two-dimensional matrix G output in the step (5)_k+1Splicing all the characteristic vectors to obtain a new characteristic vector z'; then, performing high-order (P order) polynomial fusion on zOperating (as formula (4)) to obtain P-order data tensor Z'^pZ 'to'^pAnd (3) performing multichannel low-rank tensor pooling operation according to the dimensions of the P-order tensor (such as the formula (6)), and finally obtaining the global eigenvector z.

Step 7, multi-modal information data classification

As shown in Table 1, the emotion state discrimination task is performed on two multi-modal emotion databases CMU-MOSI and IEMOCAP simultaneously by the method and the four basic multi-modal fusion methods, MAE is mean square error, CORR is Pearson correlation coefficient, ACC-7 is 7 classification accuracy, and a plurality of indexes for comparing and measuring the discrimination task are known, so that the result of the method is superior to or equivalent to that of a basic model.

TABLE 1 comparison of results

Claims

1. The multi-modal depth layered fusion emotion analysis method based on multi-channel tensor pooling is characterized by comprising the following steps of:

step 1, acquiring multi-modal information data

Step 2, multi-mode information data preprocessing

Respectively extracting short-term memory vectors of each moment of each modal information data as characteristic vectors of the moment by adopting a long-term and short-term memory network or a gated cyclic unit network;

wherein

The characteristic vector corresponding to the t-th moment representing the m-th modal vector, namely the short-term memory vector of the LSTM network at the t-th moment, g_outIs the output gate of the LSTM network, C (t) is the long-term memory unit of the LSTM network, and f is the activation function;

step 3, multi-mode data information organization

step 4, attention mechanism setting

Wherein

Respectively in each mode

The weight at the t-th moment;

represents a modular multiplication;

5.2 at size T₀In the time dimension, for a pseudo two-dimensional matrix G_kSplicing all the characteristic vectors of any two modes in a time window to obtain a new characteristic vector z_ij(ii) a Then z is paired according to equation (4)_ijCarrying out high-order polynomial fusion operation to obtain P-order data tensor Z^p：

Wherein

Representing tensor product operations, i, j ∈ [1, M ]]；

The length of the time window is T₁Step length is s;

Wherein the feature vector

H-th data element z_hThe following were used:

wherein W^hIs a tensor weight of the P order, i₁，...，i_pSubscripts for each dimension of the P-order tensor;

for the C new feature vectors

Wherein

H-th data element z'_hThe following were used:

The final build size is

Pseudo two-dimensional matrix G of_k+1；

5.3 judging whether k is more than or equal to N, wherein N is the maximum iteration number, and if so, outputting the current pseudo two-dimensional matrix G_k+1Otherwise, resetting k to k +1,

and skipping to step 5.2;

step 6, multi-modal global interaction

For the pseudo two-dimensional matrix G output in the step (5)_k+1Splicing all the characteristic vectors to obtain a new characteristic vector z'; then carrying out high-order polynomial fusion operation on Z 'to obtain P-order data tensor Z'^pZ 'to'^pAccording to the P-th order tensorPerforming multichannel low-rank tensor pooling operation on each dimension to finally obtain a global eigenvector z;

step 7, multi-modal information data classification