CN112329604A

CN112329604A - Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition

Info

Publication number: CN112329604A
Application number: CN202011209001.9A
Authority: CN
Inventors: 金涛; 李英明; 张仲非
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-02-05
Anticipated expiration: 2040-11-03
Also published as: CN112329604B

Abstract

The invention discloses a multi-modal sentiment analysis method based on multi-dimensional low-rank decomposition, which is used to fuse high-dimensional multi-modal features into low-dimensional vectors, which are then used for video sentiment analysis. Specifically, it includes the following steps: obtaining a video data set for training a multimodal sentiment analysis model, the video data set includes a plurality of sample videos, and defining algorithm targets; image features, audio features and text features in the video data set are analyzed Extracting and obtaining image features, audio features and text features of the video data; establishing a multi-modal sentiment analysis model based on a multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features; using the multi-modal sentiment analysis The model performs sentiment analysis on the input video. The invention is suitable for multi-modal emotion analysis of real video scenes, and has better effect and robustness in the face of various complex situations.

Description

Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition.

Background

In modern society, video becomes an indispensable part of human society, and is said to be ubiquitous. Such environments have led to significant developments in the research of semantic content of video. The multi-modal emotion analysis is a relatively important branch of video analysis, and is particularly important in the era of rapid development of short video live broadcast in the modern society. The method can judge the change of emotion of the video speaker in real time according to the expression, language and sound of the video speaker, and serve subsequent applications.

Most of the existing multi-modal emotion analysis methods based on tensors adopt the method that feature mean values of different modes are pooled, and then features are mapped to high-order tensor representations to be applied to subsequent tasks. Such processing methods ignore rich video timing information.

Disclosure of Invention

In order to solve the above problems, the present invention provides a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition, which is used for predicting the emotion score of a speaker in a video. The method first extracts the multi-modal features of the video including images, audio, and text. And then, feature fusion is carried out on multi-modal feature multi-dimension by using a tensor low-rank approximation method, so that the complexity of the model is reduced. The method makes full use of various modes in the video data, overcomes the defect that the existing tensor method ignores the time sequence information, and has better expandability.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition comprises the following steps:

s1, acquiring a video data set for training a multi-modal emotion analysis model, wherein the video data set comprises a plurality of sample videos and defines an algorithm target;

s2, extracting image features, audio features and text features in the video data set to obtain the image features, the audio features and the text features of the video data;

s3, establishing a multi-modal emotion analysis model based on the multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features;

and S4, performing emotion analysis on the input video by using the multi-modal emotion analysis model.

Further, in step S1, the video data set includes a video X_trainAnd manually labeled sentiment score Y_train；

The algorithm targets are defined as: given video x ═ x₁，x₂，...，x_L}，x_iRepresenting the ith block, each video block containing a fixed number of video frames, L representing the total number of video blocks, and the sentiment score y for the predicted video segment, y being a continuous value score.

Further, step S2 specifically includes:

s21, taking each video block x_iAll the images in (1) are input into a two-dimensional convolution neural network, the image characteristics of the video are extracted, and the mean vector is calculated and recorded as

S22, extracting each video block x_iThe text in (1) is characterized by word vectors, and the mean vector is calculated and recorded as

S23, extracting traditional MFCC audio features based on video blocks

S24, obtaining all video block image characteristics based on the extraction result

Text features

Audio features

Further, in step S3, the multi-modal emotion analysis model based on multi-dimensional low-rank decomposition is composed of a series of linear layers, dot product layers and mean pooling layers, and its video representation

Calculated by the following formula:

wherein, V_mRepresenting modal characteristics including image S, audio A or text T, and

and

is a training parameter, wherein R₁And R₂The rank of the expression tensor is manually set;

the emotion scores of the speakers in the video are predicted based on the video representation o,

p＝W_oo+b_o

wherein

b_o∈R¹Is a training parameter and p represents the predicted emotion score.

Further, the multi-modal emotion analysis model training uses an L1 loss function of the predicted value p and the label value y,

Loss＝|y-p|₁

where the entire model is trained under the Loss function Loss using Adam optimization algorithm and back propagation algorithm.

Compared with the existing multi-modal emotion analysis method, the multi-modal emotion analysis method based on multi-dimensional low-rank decomposition has the following beneficial effects:

firstly, the time sequence characteristics are introduced into a tensor fusion method, and the defects of the existing method are greatly supplemented.

Secondly, the invention firstly proposes tensor fusion on a plurality of dimensions, and a method for carrying out low-rank decomposition approximation on the plurality of dimensions is proposed through deduction, so that the performance of the model is improved, and the efficiency is not lost.

The method has good application value in short video and live broadcast systems, and can effectively improve the accuracy of multi-modal emotion analysis.

Drawings

FIG. 1 is a flow diagram of a multi-modal sentiment analysis method based on multi-dimensional low-rank decomposition according to the present invention.

FIG. 2 is a frame diagram of a multi-modal sentiment analysis model based on multi-dimensional low-rank decomposition according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

Referring to fig. 1, in a preferred embodiment of the present invention, a multi-modal sentiment analysis method based on multi-dimensional low rank decomposition includes the following steps:

first, a video data set for training a multimodal emotion analysis model is acquired. Wherein the video data set used for training the emotion analysis model comprises a video X_trainArtificially labeled video description sentence Y_train；

Second, multi-modal features in the video data set are extracted. Specifically, the method comprises the following steps:

first, take each video block x_iAll the images in (1) are input into a two-dimensional convolution neural network, the image characteristics of the video are extracted, and the mean vector is calculated and recorded as

Second, extract each video block x_iThe text in (1) is characterized by word vectors, and the mean vector is calculated and recorded as

Thirdly, extracting traditional MFCC audio features based on video blocks and recording the traditional MFCC audio features as

Fourthly, obtaining all video block image characteristics based on the extraction result

Text features

Audio features

And then, establishing a multi-modal emotion analysis model based on a multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features.

The multi-modal emotion analysis model based on multi-dimensional low-rank decomposition is composed of a series of linear layers, a dot product layer and a mean pooling layer, and video representation of the model

Calculated by the following formula:

and

the emotion scores of the speakers in the video are predicted based on the video representation o described above,

p＝W_oo+b_o

wherein

b_o∈R¹Is a training parameter, p represents the predicted sentiment score;

further, the training of the multi-model emotion analysis model uses an L1 loss function of the predicted value p and the label value y,

Loss＝|y-p|₁

In the above embodiments, the multi-modal emotion analysis method of the present invention uses image features, audio features and text features in the video. On the basis, a multi-dimensional low-rank decomposition mechanism is established. And finally, performing emotion analysis on the unmarked video by using the trained model.

Through the technical scheme, the embodiment of the invention develops a multi-modal emotion analysis method algorithm applied to the unprocessed video based on the deep learning technology. The time sequence information is introduced into the existing tensor fusion method, the tensor fusion method is used in multiple dimensions, and the model efficiency is improved by low-rank decomposition, so that the emotion analysis is more accurate and rapid.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multi-modal emotion analysis method based on multi-dimensional low-rank decomposition is characterized by comprising the following steps:

2. The method for multi-modal sentiment analysis based on multi-dimensional low rank decomposition of claim 1 wherein in step S1, the video data set comprises video X_trainAnd manually labeled sentiment score Y_train；

The algorithm targets are defined as:given video x ═ x₁，x₂，...，x_L}，x_iRepresenting the ith block, each video block containing a fixed number of video frames, L representing the total number of video blocks, and the sentiment score y for the predicted video segment, y being a continuous value score.

3. The multi-modal sentiment analysis method based on multi-dimensional low rank decomposition according to claim 2, wherein the step S2 specifically comprises:

S23, extracting traditional MFCC audio features based on video blocks

Text features

Audio features

4. The method as claimed in claim 3, wherein in step S3, the multi-modal sentiment analysis model based on multi-dimensional low-rank decomposition is composed of a series of linear layers, dot product layers and mean pooling layers, and its video representation is

Calculated by the following formula:

and

is a training parameter, wherein R₁And R₂The rank of the expression tensor is manually set; based on the above video characterization_oThe emotion scores of the speakers in the video are predicted,

p＝W_oo+b_o

wherein

Is a training parameter and p represents the predicted emotion score.

5. The multi-dimensional low-rank decomposition based multi-modal sentiment analysis method of claim 4, wherein the training of the multi-modal sentiment analysis model uses an L1 loss function of a predicted value p and a tag value y.

Loss＝|y-p|₁