CN112329604B - Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition - Google Patents

Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition Download PDF

Info

Publication number
CN112329604B
CN112329604B CN202011209001.9A CN202011209001A CN112329604B CN 112329604 B CN112329604 B CN 112329604B CN 202011209001 A CN202011209001 A CN 202011209001A CN 112329604 B CN112329604 B CN 112329604B
Authority
CN
China
Prior art keywords
video
features
emotion analysis
modal
video data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011209001.9A
Other languages
Chinese (zh)
Other versions
CN112329604A (en
Inventor
金涛
李英明
张仲非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011209001.9A priority Critical patent/CN112329604B/en
Publication of CN112329604A publication Critical patent/CN112329604A/en
Application granted granted Critical
Publication of CN112329604B publication Critical patent/CN112329604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention discloses a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition, which is used for fusing multi-modal features with high dimensionality into low-dimensional vectors and then using the low-dimensional vectors for video emotion analysis. The method specifically comprises the following steps: acquiring a video data set for training a multi-modal emotion analysis model, wherein the video data set comprises a plurality of sample videos and defines an algorithm target; extracting image features, audio features and text features in the video data set to obtain the image features, the audio features and the text features of the video data; establishing a multi-modal emotion analysis model based on a multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features; and performing emotion analysis on the input video by using the multi-mode emotion analysis model. The method is suitable for multi-modal emotion analysis of a real video scene, and has better effect and robustness in the face of various complex conditions.

Description

Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition.
Background
In modern society, video becomes an indispensable part of human society, and is said to be ubiquitous. Such environments have led to significant developments in the research of semantic content of video. The multi-modal emotion analysis is a relatively important branch of video analysis, and is particularly important in the era of rapid development of short video live broadcast in the modern society. The method can judge the change of emotion of the video speaker in real time according to the expression, language and sound of the video speaker, and serve subsequent applications.
Most of the existing multi-modal emotion analysis methods based on tensors adopt the method that feature mean values of different modes are pooled, and then features are mapped to high-order tensor representations to be applied to subsequent tasks. Such processing methods ignore rich video timing information.
Disclosure of Invention
In order to solve the above problems, the present invention provides a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition, which is used for predicting the emotion score of a speaker in a video. The method firstly extracts multi-modal characteristics of the video, including images, audios and texts. And then, feature fusion is carried out on multi-modal feature multi-dimension by using a tensor low-rank approximation method, so that the complexity of the model is reduced. The method makes full use of various modes in the video data, overcomes the defect that the existing tensor method ignores the time sequence information, and has better expandability.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a multi-modal emotion analysis method based on multi-dimensional low-rank decomposition comprises the following steps:
s1, acquiring a video data set for training a multi-modal emotion analysis model, wherein the video data set comprises a plurality of sample videos and defines an algorithm target;
s2, extracting image features, audio features and text features in the video data set to obtain the image features, the audio features and the text features of the video data;
s3, establishing a multi-modal emotion analysis model based on the multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features;
and S4, performing emotion analysis on the input video by using the multi-modal emotion analysis model.
Further, in step S1, the video data set includes a video X train And manually labeled sentiment score Y train
The algorithm targets are defined as: given video x ═ x 1 ,x 2 ,...,x L },x i Representing the ith block, each video block containing a fixed number of video frames, L representing the total number of video blocks, and the sentiment score y for the predicted video segment, y being a continuous value score.
Further, step S2 specifically includes:
s21, taking each video block x i All ofInputting the image into a two-dimensional convolution neural network, extracting image characteristics of the video, and calculating a mean vector to be recorded as
Figure BDA0002758010740000021
S22, extracting each video block x i The text in (1) is represented by using word vectors, and the mean value vector is calculated and recorded as
Figure BDA0002758010740000022
S23, extracting traditional MFCC audio features based on video blocks
Figure BDA0002758010740000023
S24, obtaining all video block image characteristics based on the extraction result
Figure BDA0002758010740000024
Text features
Figure BDA00027580107400000213
Figure BDA0002758010740000026
Audio features
Figure BDA0002758010740000027
Further, in step S3, the multi-modal emotion analysis model based on multi-dimensional low-rank decomposition is composed of a series of linear layers, dot product layers and mean pooling layers, and its video representation
Figure BDA0002758010740000028
Calculated by the following formula:
Figure BDA0002758010740000029
wherein, V m Representing modal characteristics, including image S, audio A orText T, and
Figure BDA00027580107400000210
and
Figure BDA00027580107400000211
is a training parameter, wherein R 1 And R 2 The rank of the expression tensor is manually set;
the emotion scores of the speakers in the video are predicted based on the video representation o,
p=W o o+b o
wherein
Figure BDA00027580107400000212
b o ∈R 1 Is a training parameter and p represents the predicted emotion score.
Further, the multi-modal emotion analysis model training uses an L1 loss function of the predicted value p and the label value y,
Loss=|y-p| 1
where the entire model is trained under the Loss function Loss using Adam optimization algorithm and back propagation algorithm.
Compared with the existing multi-modal emotion analysis method, the multi-modal emotion analysis method based on multi-dimensional low-rank decomposition has the following beneficial effects:
firstly, the time sequence characteristics are introduced into a tensor fusion method, and the defects of the existing method are greatly supplemented.
Secondly, the invention firstly proposes tensor fusion on a plurality of dimensions, and a method for carrying out low-rank decomposition approximation on the plurality of dimensions is proposed through deduction, so that the performance of the model is improved, and the efficiency is not lost.
The method has good application value in short video and live broadcast systems, and can effectively improve the accuracy of multi-modal emotion analysis.
Drawings
FIG. 1 is a flow diagram of a multi-modal sentiment analysis method based on multi-dimensional low-rank decomposition according to the present invention.
FIG. 2 is a frame diagram of a multi-modal sentiment analysis model based on multi-dimensional low-rank decomposition according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.
Referring to fig. 1, in a preferred embodiment of the present invention, a multi-modal sentiment analysis method based on multi-dimensional low rank decomposition includes the following steps:
first, a video data set for training a multimodal emotion analysis model is acquired. Wherein the video data set used for training the emotion analysis model comprises a video X train Artificially labeled video description sentence Y train
The algorithm targets are defined as: given video x ═ x 1 ,x 2 ,...,x L },x i Representing the ith block, each video block containing a fixed number of video frames, L representing the total number of video blocks, and the sentiment score y for the predicted video segment, y being a continuous value score.
Second, multi-modal features in the video data set are extracted. Specifically, the method comprises the following steps:
first, take each video block x i All the images in (1) are input into a two-dimensional convolution neural network, the image characteristics of the video are extracted, and the mean vector is calculated and recorded as
Figure BDA0002758010740000041
Second, extract each video block x i The text in (1) is characterized by word vectors, and the mean vector is calculated and recorded as
Figure BDA0002758010740000042
Thirdly, extracting traditional MFCC audio features based on video blocks and recording the traditional MFCC audio features as
Figure BDA0002758010740000043
Fourthly, obtaining all video block image characteristics based on the extraction result
Figure BDA0002758010740000044
Text features
Figure BDA0002758010740000045
Audio features
Figure BDA0002758010740000046
And then, establishing a multi-modal emotion analysis model based on a multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features.
The multi-modal emotion analysis model based on multi-dimensional low-rank decomposition is composed of a series of linear layers, a dot product layer and a mean pooling layer, and video representation of the model
Figure BDA00027580107400000412
Calculated by the following formula:
Figure BDA0002758010740000048
wherein, V m Representing modal characteristics including image S, audio A or text T, and
Figure BDA0002758010740000049
and
Figure BDA00027580107400000410
is a training parameter, wherein R 1 And R 2 The rank of the expression tensor is manually set;
the emotion scores of the speakers in the video are predicted based on the video representation o described above,
p=W o o+b o
wherein
Figure BDA00027580107400000411
b o ∈R 1 Is a training parameter, p represents the predicted sentiment score;
further, the training of the multi-model emotion analysis model uses an L1 loss function of the predicted value p and the label value y,
Loss=|y-p| 1
where the entire model is trained under the Loss function Loss using Adam optimization algorithm and back propagation algorithm.
In the above embodiments, the multi-modal emotion analysis method of the present invention uses image features, audio features and text features in the video. On the basis, a multi-dimensional low-rank decomposition mechanism is established. And finally, performing emotion analysis on the unmarked video by using the trained model.
Through the technical scheme, the embodiment of the invention develops a multi-modal emotion analysis method algorithm applied to the unprocessed video based on the deep learning technology. The time sequence information is introduced into the existing tensor fusion method, the tensor fusion method is used in multiple dimensions, and the model efficiency is improved by low-rank decomposition, so that the emotion analysis is more accurate and rapid.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (2)

1. A multi-modal emotion analysis method based on multi-dimensional low-rank decomposition is characterized by comprising the following steps:
s1, acquiring a video data set for training the multi-modal emotion analysis model, wherein the video data set comprises a plurality of sample videos, and an algorithm target is defined as: given video x ═ x 1 ,x 2 ,...,x L },x i Representing the ith block, wherein each video block comprises a fixed video frame number, L represents the total video block number, and the emotion score y of the video segment is predicted and is a score of a continuous value;
s2, extracting image features, audio features and text features in the video data set to obtain the image features, the audio features and the text features of the video data, and specifically comprising the following steps:
s21, taking each video block x i All the images in (1) are input into a two-dimensional convolution neural network, the image characteristics of the video are extracted, and the mean vector is calculated and recorded as
Figure FDA0003750551230000011
S22, extracting each video block x i The text in (1) is characterized by word vectors, and the mean vector is calculated and recorded as
Figure FDA0003750551230000012
S23, extracting traditional MFCC audio features based on video blocks
Figure FDA0003750551230000013
S24, obtaining all video block image characteristics based on the extraction result
Figure FDA0003750551230000014
Text features
Figure FDA0003750551230000015
Figure FDA0003750551230000016
Audio features
Figure FDA0003750551230000017
S3, establishing a multi-modal emotion analysis model based on the multi-dimensional low-rank decomposition mechanism based on the extracted image features, audio features and text features, wherein the multi-modal emotion analysis model consists of a series of linear layers, a dot product layer and a mean pooling layer, and video representation of the multi-modal emotion analysis model is realized
Figure FDA0003750551230000018
Calculated by the following formula:
Figure FDA0003750551230000019
wherein, V m Representing modal characteristics including image S, audio A or text T, and
Figure FDA00037505512300000110
and
Figure FDA00037505512300000111
is a training parameter, wherein R 1 And R 2 The rank of the expression tensor is manually set; the emotion scores of the speakers in the video are predicted based on the video representation o,
p=W o o+b o
wherein
Figure FDA00037505512300000112
b o ∈R 1 Is a training parameter, p represents the predicted sentiment score;
the multi-modal emotion analysis model training uses the L1 loss function for the predicted value p and the label value y:
Loss=|y-p| 1
wherein the entire model is trained under a Loss function Loss using an Adam optimization algorithm and a back propagation algorithm;
and S4, performing emotion analysis on the input video by using the multi-modal emotion analysis model.
2. The method for multi-modal sentiment analysis based on multi-dimensional low rank decomposition of claim 1 wherein in step S1, the video data set comprises video X train And manually labeled sentiment score Y train
CN202011209001.9A 2020-11-03 2020-11-03 Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition Active CN112329604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011209001.9A CN112329604B (en) 2020-11-03 2020-11-03 Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011209001.9A CN112329604B (en) 2020-11-03 2020-11-03 Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition

Publications (2)

Publication Number Publication Date
CN112329604A CN112329604A (en) 2021-02-05
CN112329604B true CN112329604B (en) 2022-09-20

Family

ID=74322845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011209001.9A Active CN112329604B (en) 2020-11-03 2020-11-03 Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition

Country Status (1)

Country Link
CN (1) CN112329604B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20230129094A (en) * 2022-02-28 2023-09-06 에스케이텔레콤 주식회사 Method and Apparatus for Emotion Recognition in Real-Time Based on Multimodal
CN117688936B (en) * 2024-02-04 2024-04-19 江西农业大学 Low-rank multi-mode fusion emotion analysis method for graphic fusion

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103310229B (en) * 2013-06-15 2016-09-07 浙江大学 A kind of multitask machine learning method for image classification and device thereof
CN104299216B (en) * 2014-10-22 2017-09-01 北京航空航天大学 Multimode medical image fusion method with low rank analysis is decomposed based on multiple dimensioned anisotropy
CN106056082B (en) * 2016-05-31 2019-03-08 杭州电子科技大学 A kind of video actions recognition methods based on sparse low-rank coding
CN107292858B (en) * 2017-05-22 2020-07-10 昆明理工大学 Multi-modal medical image fusion method based on low-rank decomposition and sparse representation
CN108197629B (en) * 2017-12-30 2021-12-31 北京工业大学 Multi-modal medical image feature extraction method based on label correlation constraint tensor decomposition
CN109934135B (en) * 2019-02-28 2020-04-28 北京航空航天大学 Rail foreign matter detection method based on low-rank matrix decomposition
CN110188770A (en) * 2019-05-17 2019-08-30 重庆邮电大学 A kind of non-convex low-rank well-marked target detection method decomposed based on structure matrix
CN110222213B (en) * 2019-05-28 2021-07-16 天津大学 Image classification method based on heterogeneous tensor decomposition
CN111178389B (en) * 2019-12-06 2022-02-11 杭州电子科技大学 Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling

Also Published As

Publication number Publication date
CN112329604A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
Hsu et al. Progressive domain adaptation for object detection
CN110751208B (en) Criminal emotion recognition method for multi-mode feature fusion based on self-weight differential encoder
CN112465008B (en) Voice and visual relevance enhancement method based on self-supervision course learning
CN111931795B (en) Multi-modal emotion recognition method and system based on subspace sparse feature fusion
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN112329604B (en) Multi-modal emotion analysis method based on multi-dimensional low-rank decomposition
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
CN111429885A (en) Method for mapping audio clip to human face-mouth type key point
CN113065344A (en) Cross-corpus emotion recognition method based on transfer learning and attention mechanism
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN117076693A (en) Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
CN114896434A (en) Hash code generation method and device based on center similarity learning
Wang et al. Cross-modal knowledge distillation method for automatic cued speech recognition
CN116703857A (en) Video action quality evaluation method based on time-space domain sensing
CN113053361B (en) Speech recognition method, model training method, device, equipment and medium
CN114281948A (en) Summary determination method and related equipment thereof
Palaskar et al. Multimodal Speech Summarization Through Semantic Concept Learning.
Lee et al. Robust sound-guided image manipulation
CN110674265A (en) Unstructured information oriented feature discrimination and information recommendation system
CN112949284A (en) Text semantic similarity prediction method based on Transformer model
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN115019137A (en) Method and device for predicting multi-scale double-flow attention video language event
CN114842301A (en) Semi-supervised training method of image annotation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant