CN114882412B - Labeling-associated short video emotion recognition method and system based on vision and language - Google Patents

Labeling-associated short video emotion recognition method and system based on vision and language Download PDF

Info

Publication number
CN114882412B
CN114882412B CN202210511572.0A CN202210511572A CN114882412B CN 114882412 B CN114882412 B CN 114882412B CN 202210511572 A CN202210511572 A CN 202210511572A CN 114882412 B CN114882412 B CN 114882412B
Authority
CN
China
Prior art keywords
emotion
short video
visual
network
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210511572.0A
Other languages
Chinese (zh)
Other versions
CN114882412A (en
Inventor
刘天亮
肖允鸿
戴修斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210511572.0A priority Critical patent/CN114882412B/en
Publication of CN114882412A publication Critical patent/CN114882412A/en
Application granted granted Critical
Publication of CN114882412B publication Critical patent/CN114882412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for identifying short video emotion based on annotation association of vision and language. Firstly, respectively extracting low-level visual features of a video stream in space and time dimensions, inputting the low-level visual features into a multi-head self-attention network, fusing the low-level visual features with high-level emotional features of visual information in a feature layer, and calculating an emotional score matrix of a short video visual mode; then using a word vector tool to convert text content of the short video into word vectors and utilizing an emotion dictionary to enhance emotion polarity of the word vectors; extracting high-level semantic features contained in the language information from the language information, and calculating an emotion score matrix of the short video text mode; and finally multiplying the emotion score matrix with the weighting coefficient matrix to obtain an emotion classification result of the short video. The invention can effectively integrate the short video vision and the emotion information of language, and give consideration to the space-time change of the video stream and the front-back semantic relation of the text content, breaks through the limitation of single-mode emotion classification, and improves the accuracy of short video emotion classification.

Description

Labeling-associated short video emotion recognition method and system based on vision and language
Technical Field
The invention relates to a method and a system for identifying annotation-associated short video emotion based on vision and language, and belongs to the technical field of computer vision video emotion identification.
Background
Nowadays, various short video APP is emerging and favored by many users, and users record life by shooting short videos, record current moods and express own moods. Compared with blogs, forums and the like recorded in the form of words and pictures, the short video can express psychological feelings of the parties, is more lively and active, and can more highlight the emotion tension of the users. Along with the rapid development of fields such as artificial intelligence, computer vision, natural language processing and the like, the great improvement of computer performance and the rapid increase of the computing power of a graphic processor, multi-mode emotion research analysis aiming at social media short videos has become a hot problem.
Emotion recognition methods based on manual design features have been developed for decades to achieve a certain result, but require professional prior knowledge and a complex parameter adjustment process, and have poor generalization capability and robustness. With the rapid development of machine learning and deep learning, multi-headed self-focusing neural networks achieve very good performance in many sub-directions in the natural language processing field. In recent years, with the rapid increase of computer power, the multi-head self-attention neural network is not limited to natural language processing tasks, and starts to emit light and heat in the field of computer vision. The literature [Dosovitskiy.A,Beyer.L,Kolesnikov.A,Weissenborn.D,and Houlsby.N,"An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale."2020.] uses a multi-head self-focusing neural network in an image classification task, divides an image into image blocks, converts the image blocks into one-dimensional vectors, inputs the one-dimensional vectors into the multi-head self-focusing neural network, extracts visual characteristics for predicting image class labels, and has classification performance on various large-scale image datasets slightly superior to the most advanced residual convolution network in the field of computer vision. Document [Liu.Z,Lin.Y,Cao.Y,Hu.H,Wei.Y,and Zhang.Z,"Swin transformer:hierarchical vision transformer using shifted windows."2021] proposes a multi-layer multi-head self-attention neural network connection structure, by which the data size in each layer of network is gradually reduced, and the structure design is helpful for improving the receptive field of single pixel points, can effectively combine local features and global features, and has classification performance far superior to that of a residual convolution network.
Chinese patent application No. CN201810686822.8, publication No. CN109145712B, which is a text information fused GIF short video emotion recognition method and system, extracts the image visual characteristics of the GIF short video by using a 3D convolutional neural network and a convolutional neural network, combines the emotion scoring value of the text information, and finally obtains an emotion classification result. The method only uses convolutional neural network to process image information, so that high-level emotion visual characteristics of a short video sequence cannot be effectively extracted.
Chinese patent application (patent application No. CN201910763722.5, publication No. CN 110532911B) proposes a small sample emotion recognition method for dividing a data set, and short video emotion recognition accuracy is improved by measuring emotion similarity between samples of a query set and a support set. The method is suitable for the condition of less sample number, insufficient generalization performance and over-fitting.
Although multi-headed self-focusing neural networks have achieved excellent performance in both the natural language processing field and the computer vision field, many challenges remain when used to solve the problem of short video emotion recognition. Firstly, the video stream data of the short video is different from the traditional image data, and is composed of multiple frames of images with front-back dependency relationship, so that it becomes important to design a proper network structure to extract visual characteristics; secondly, the text language content of the short video is rich, the forms are diversified, and the emotion polarity of the matched dictionary enhancement word vector is designed to effectively strengthen the emotion information of the text mode; thirdly, the multimedia short video generally has data in various modes, and how to effectively fuse emotion information among different modes, so that noise interference among modes is a key factor for improving the emotion recognition accuracy of the short video.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems existing in the prior art, the invention provides a short video emotion recognition method and a short video emotion recognition system based on the annotation association of vision and language, which are used for respectively calculating high-level visual emotion characteristics of short video streams and high-level semantic emotion characteristics of text contents based on a multi-head self-attention mechanism, and fusing the visual emotion score and the language emotion score calculated by the high-level emotion characteristics in a decision layer to realize emotion recognition classification of short videos.
In order to solve the technical problems, the invention adopts the following technical scheme:
a labeling associated short video emotion recognition method based on vision and language comprises the following steps:
(1) Preprocessing a video stream of a short video sample, dividing the video stream into a plurality of image frames and a plurality of image blocks, and adjusting the resolutions of the image frames and the image blocks;
(2) Extracting visual characteristic information of an image frame by using a CNN network to form a time-dimensional low-level characteristic sequence of the short video, and extracting visual characteristic information of an image block stream by using a C3D network to form a space-dimensional low-level characteristic sequence of the short video;
(3) Respectively inputting the two feature sequences obtained in the step (2) into a multi-head self-attention neural network, connecting the calculated two attention feature vectors in series to obtain a high-level emotion visual feature vector, and calculating emotion scores of visual modes by using a Softmax classifier after inputting the high-level emotion visual feature vector into a full-connection layer network;
(4) The CNN network and the C3D network in the step (2) and the multi-head self-attention neural network and the full-connection layer network in the step (3) form a visual characteristic emotion recognition network of the short video, a loss function value is calculated according to the visual emotion score in the step (3), and an iterative network parameter is optimized by using a gradient descent method, so that a trained visual network model is obtained;
(5) Processing text content of the marked associated short video into word vectors by using a word vector tool, and enhancing emotion polarities of the word vectors based on an emotion dictionary to obtain word vector sequences with enhanced emotion polarities;
(6) Inputting the word vector sequence obtained in the step (5) into a multi-head self-attention neural network to extract high-level emotion semantic features, inputting the high-level emotion semantic features into a full-connection layer network, decoding the semantic features by using a Softmax classifier, and calculating emotion scores of text modes;
(7) The multi-head self-attention neural network and the full-connection layer network in the simultaneous step (6) form a language feature emotion recognition network of the short video, a loss function value is calculated according to the text emotion score in the step (6), and an iterative network parameter is optimized by using a gradient descent method, so that a trained language network model is obtained;
(8) And (3) respectively calculating vision and language emotion scores of the short video sample by using the trained network models in the step (4) and the step (7), obtaining emotion matrixes by combining the vision and language emotion scores, designing a weighting coefficient matrix according to weight duty ratios of different modes and different emotion types, limiting the value range of each parameter of the weighting coefficient matrix by using priori knowledge, designing a solving step length, traversing a value space to search an optimal solution, multiplying the emotion matrixes and the optimal weighting coefficient matrix to obtain an emotion classification probability matrix, and judging the emotion types of the short video according to the numerical values of each element on diagonal lines in the probability matrix.
As a further preferred embodiment of the present invention, the step (1) includes:
(1.1) dividing the short video sample video stream, and selecting F frames from the first frame at equal intervals;
(1.2) adjusting the resolution of the image frames in step (1.1) to n×n;
(1.3) clipping the NxN image of step (1.2) into regular M 2 image blocks, each image block having a resolution of
(1.4) Forming a block stream from the front and back F blocks at each clipping position in the step (1.3), wherein M 2 blocks are in total at M 2 clipping positions.
As a further preferable embodiment of the present invention, the step (2) includes:
(2.1) inputting the NxN image frames in the step (1.2) into a CNN network, extracting low-level visual features of the short video sample in the time dimension, and forming a feature sequence with the length of F by using F frame images;
(2.2) the step (1.4) The image block streams are input into a C3D network, low-level visual features of short video samples in the space dimension are extracted, and M 2 image block streams form a feature sequence with the length of M 2.
As a further preferable embodiment of the present invention, the step (3) includes:
(3.1) embedding position information in the feature sequence in the step (2.1) and adding a category label, forming a feature sequence with the length of F+1, inputting the feature sequence into a multi-head self-attention network, and calculating to obtain attention feature vectors of the time feature sequence;
(3.2) embedding position information in the feature sequence in the step (2.2) and adding a category label, forming a feature sequence with the length of M 2 +1, inputting the feature sequence into a multi-head self-attention network, and calculating to obtain attention feature vectors of the space feature sequence;
(3.3) concatenating the attention feature vectors in the step (3.1) and the step (3.2) to obtain a high-level emotion visual feature vector of the short video;
(3.4) inputting the high-level emotion visual features in the step (3.3) into a full-link network, and calculating the visual emotion score of the short video by using a Softmax classifier:
Wherein K is the number of emotion categories, score j is the emotion Score of the j-th emotion, x i is the value on the ith dimension of the input vector x of the classifier, and the Softmax classifier calculates each emotion category Score in a vector index normalization mode.
As a further preferable embodiment of the present invention, the step (5) includes:
(5.1) converting text content of the annotation associated short video into a corresponding word vector sequence by using a word vector tool;
(5.2) multiplying the word vector obtained in the step (5.1) by the natural index of the enhancement factor α to obtain an emotion polarity enhancement word vector
In the above formula, x is an primitive word vector, pos (x) and neg (x) are respectively a positive emotion score and a negative emotion score of a corresponding word in an emotion dictionary, and an enhancement factor alpha is obtained by averaging the two.
As a further preferable embodiment of the present invention, the step (8) includes:
(8.1) the visual and language emotion scores calculated by the network model in the step (4) and the step (7) are combined to obtain an emotion matrix S of the marked-association short video:
Emotion score for i-th emotion of visual mode,/> Emotion scores for text modality class i emotion;
(8.2) designing a weighting coefficient matrix W according to the weight ratio of different emotion categories of visual and text modalities:
Weight ratio for i-th emotion of visual mode,/> The weight ratio of the i-th emotion of the text mode;
(8.3) limiting the value range of each parameter in the weighting coefficient matrix by using priori knowledge, and traversing the value space by a fixed step length to search the optimal weighting coefficient matrix;
(8.4) multiplying the optimal weighting coefficient matrix searched in the step (8.3) by the emotion matrix in the step (8.1) to obtain an emotion classification probability matrix P:
p i is the probability that a short video is identified as an emotion class i;
and (8.5) comparing the numerical values of the elements on the diagonal line of the probability matrix P in the step (8.4), and identifying the subscript of the maximum element value as the emotion type of the short video.
In another aspect, the present invention provides a short video emotion recognition system based on visual and linguistic annotation association, which includes:
the video stream segmentation module is used for segmenting a video stream of a short video into image frames, selecting a fixed number of image frames at equal intervals, adjusting the resolution of the selected image frames and cutting the selected image frames into regular image blocks;
The visual feature extraction module is used for extracting a time feature sequence and a space feature sequence of the selected image frames and image blocks by utilizing a CNN network and a C3D network, respectively inputting the two feature sequences into a multi-head self-attention neural network to calculate to obtain emotion visual features in time dimension and space dimension, and connecting the two feature sequences in series to obtain high-level visual emotion features of the short video;
The visual emotion score calculation module is used for inputting the high-level visual emotion features obtained by the operation of the visual feature extraction module into the full-connection layer network for decoding, and calculating emotion scores of the short video visual modes by using a Softmax classifier;
the word vector conversion module is used for converting text content of the short video into word vectors by using a word vector tool and carrying out polarity enhancement on the word vectors according to emotion scores of words in an emotion dictionary;
the semantic feature extraction module is used for inputting the emotion polarity enhancement word vector obtained by the word vector conversion module into a multi-head self-attention neural network, and calculating to obtain high-level semantic emotion features of the short video;
the text language emotion score calculation module is used for inputting the high-level semantic emotion features obtained by the operation of the semantic feature extraction module into the full-connection layer network for decoding, and calculating emotion scores of short video text modes by using a Softmax classifier;
visual and language network model training module: the system is used for training vision and language network models respectively, calculating a loss function value according to the emotion score calculated in the emotion score calculation module and optimizing parameters in the two network models by using a gradient descent method;
the emotion fusion recognition module is used for fusing the visual and language emotion recognition results of the short video, calculating visual and language emotion scores to form an emotion matrix by using a network model trained by the visual and language network model training module, designing a weighting coefficient matrix according to the weight proportion of different emotion types of visual and text modes, limiting the value range of each parameter in the weighting coefficient matrix by using priori knowledge, traversing a value space with a fixed step length to search for obtaining an optimal weighting coefficient matrix, multiplying the weighting coefficient matrix by the emotion matrix to obtain an emotion classification probability matrix of the short video, comparing the numerical values of each element on the diagonal of the probability matrix, and recognizing and judging the emotion type of the short video.
The invention further provides a short video emotion recognition system based on the annotation association of vision and language, which comprises at least one computer device, wherein the computer device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the processor realizes the short video emotion recognition method based on the annotation association of vision and language when executing the computer program.
Compared with the prior art, the invention has the following beneficial effects:
(1) According to the invention, the CNN network and the C3D network are used for respectively extracting the low-level visual characteristics of the video stream of the short video in the time dimension and the space dimension, the emotion characteristic information is enhanced based on the multi-head attention mechanism and then fused, the emotion information changes of the video content in different time sequences and different space positions are comprehensively considered, and the method is suitable for more diversified short video data.
(2) According to the invention, the emotion polarity enhancement is carried out on the text word vector of the short video based on the emotion dictionary, so that emotion semantic information contained in the text language can be enhanced, and the emotion recognition accuracy is improved.
(3) According to the invention, the prior knowledge is utilized to establish the emotion recognition mathematical model of vision and language, and the classification recognition probability of each emotion type is calculated by designing the weighting coefficient matrix to solve the optimal parameters, so that emotion information among different modes can be effectively fused, and noise interference among the modes is reduced.
Drawings
FIG. 1 is a flow chart of the overall steps of the present invention.
Fig. 2 is a block diagram of a short video emotion recognition system of the present invention.
Detailed Description
The following describes the specific embodiments of the present invention in detail with reference to the accompanying drawings:
As shown in fig. 1, the embodiment of the invention discloses a short video emotion recognition method based on visual and linguistic annotation association, which specifically comprises the following steps:
step (1): the video stream of the short video sample is preprocessed and divided into a plurality of image frames and a plurality of image blocks, and the resolutions of the image frames and the image blocks are adjusted. The present embodiment uses a T-GIF dataset as a data source to segment video content therein and adjust image resolution, comprising the following sub-steps:
(1.1) dividing the short video sample video stream, and selecting 9 frames from the first frame at equal intervals;
(1.2) adjusting the resolution of the image frames in step (1.1) to 216 x 216;
(1.3) clipping the 216×216 image of step (1.2) into regular 9 image blocks, each image block having a resolution of 72×72;
(1.4) composing the front and rear 9 image blocks at each clipping position in the step (1.3) into one image block stream, and 9 image block streams are arranged at the 9 clipping positions in total.
Step (2): and extracting visual characteristic information of the image frames by using a CNN network to form a time-dimensional low-level characteristic sequence of the short video, and extracting visual characteristic information of the image block streams by using a C3D network to form a space-dimensional low-level characteristic sequence of the short video. The method specifically comprises the following substeps:
Inputting 9 frames of 216×216 images in the step (1.2) into the same CNN network, extracting low-layer visual characteristics of which the length is 128 in the time dimension of a short video sample, wherein the 9 frames of images form a characteristic sequence of which the length is 9, the CNN network is composed of three convolution layers and three pooling layers which are alternately connected and added with a last full-connection layer, the convolution kernels are 3×3, the three convolution layers are respectively activated by using 2, 4 and 8 convolution kernels according to the sequence, each of the convolution kernels is activated by using Relu functions, the pooling kernel size is 2×2, the maximum pooling strategy is adopted, the first layer of the full-connection layer comprises 1024 nodes, the second layer comprises 128 nodes, and the convolution kernels are activated by using Relu functions;
and (2.2) inputting 9 72 multiplied by 9 image block streams in the step (1.4) into the same C3D network, extracting low-layer visual characteristics of which the length of a short video sample is 128 in the space dimension, wherein the 9 image block streams form a characteristic sequence of which the length is 9, the C3D network is formed by alternately connecting three convolution layers and three pooling layers and adding a last full-connection layer, the convolution kernels are 3 multiplied by 3, the three convolution layers are respectively activated by using 2, 4 and 8 convolution kernels according to the sequence, the sizes of the pooling kernels are 2 multiplied by 2, the maximum pooling strategy is adopted, the first layer of the full-connection layer comprises 1024 nodes, the second layer comprises 128 nodes, and the function of Relu is activated.
Step (3); and (3) respectively inputting the two feature sequences obtained in the step (2) into a multi-head self-attention neural network, obtaining a high-level emotion visual feature vector by the two attention feature vectors calculated in series, and calculating emotion scores of visual modes by using a Softmax classifier after inputting the high-level emotion visual feature vector into a full-connection layer network. The method specifically comprises the following substeps:
(3.1) embedding position information in the feature sequence in the step (2.1) and adding a category label, inputting the feature sequence into a time-dimension multi-head self-attention network after forming the feature sequence with the size of 128 multiplied by 10, and calculating to obtain 1024-dimension attention feature vectors of the time feature sequence;
(3.2) embedding position information in the feature sequence in the step (2.2) and adding a category label, inputting the feature sequence into a space-dimensional multi-head self-attention network after forming the feature sequence with the size of 128 multiplied by 10, and calculating to obtain 1024-dimensional attention feature vectors of the space feature sequence;
The attention feature vectors in the step (3.1) and the step (3.2) are connected in series to obtain 2048-dimensional high-level emotion visual feature vectors of a short video, the multi-head self-attention network structure in the step (3.1) and the multi-head self-attention network structure in the step (3.2) are the same, each attention head comprises 8 attention heads, each attention head comprises a layer of attention network and a layer of feedforward neural network, regularization operation and residual operation are sequentially carried out on the two layers of networks respectively, the attention networks comprise parameter matrixes W Q、WK、WV, the sizes of the attention networks are 128×128, a first layer of the feedforward neural network comprises 1024 nodes, activation is carried out by using Relu functions, a second layer of the feedforward neural network comprises 128 nodes, 128-dimensional feature vectors are output, and the 8 attention heads are connected in series to obtain 1024-dimensional attention feature vectors;
(3.4) inputting the high-level emotion visual characteristics in the step (3.3) into a full-connection layer network, wherein the first two layers of the full-connection layer network are provided with 2048 nodes, activated by using Relu functions, the third layer is provided with 3 nodes, feature vectors with the length of 3 are output, and a Softmax classifier is used for calculating the visual emotion score of the short video according to the numerical value of the feature vectors:
wherein Score 1、Score2、Score3 is the emotion Score of positive, neutral and negative emotion, x i is the value on the ith dimension of the input vector x of the classifier, and the Softmax classifier calculates each emotion category Score by means of vector index normalization.
Step (4): and (3) constructing a visual characteristic emotion recognition network of the short video by the CNN network and the C3D network in the simultaneous step (2) and the multi-head self-attention neural network and the full-connection layer network in the step (3), calculating a cross entropy loss function value according to the visual emotion score in the step (3), and optimizing iteration network parameters by using a random gradient descent method to obtain a trained visual network model.
Step (5): and processing the text content of the marked associated short video into word vectors by using a word vector tool, and enhancing the emotion polarity of the word vectors based on the emotion dictionary to obtain a word vector sequence with enhanced emotion polarity. The method specifically comprises the following substeps:
(5.1) converting the text content of the short video into a corresponding 128-dimensional Word vector sequence using the CBOW model of the Word2Vec Word vector tool;
(5.2) multiplying the word vector obtained in the step (5.1) by the natural index of the enhancement factor α to obtain an emotion polarity enhancement word vector
In the above formula, x is an primitive word vector, pos (x) and neg (x) are respectively the positive emotion score and the negative emotion score of a corresponding word in text content in a SentiWordNet3.0 emotion dictionary, and the enhancement factor alpha is obtained by averaging the positive emotion score and the negative emotion score.
Step (6): embedding position information in the word vector sequence with enhanced emotion polarity in the step (5) and adding a category label, inputting a text language multi-head self-attention network after a feature sequence with the size of 128 multiplied by 10 is formed, extracting 1024-dimensional high-level emotion semantic features, inputting the high-level emotion semantic features into a full-connection layer network, decoding semantic features by using a Softmax classifier, calculating text modal emotion scores with three emotion polarities, and enabling the multi-head self-attention network structure to be the same as that in the step (3), wherein the first two layers of the full-connection layer network have 1024 nodes, the first two layers are activated by using Relu functions, the third layer has 3 nodes, and feature vectors with the length of 3 are output.
Step (7): and (3) constructing a language feature emotion recognition network of the short video by the multi-head self-attention neural network and the full-connection layer network in the simultaneous step (6), calculating a cross entropy loss function value according to the text emotion score in the step (6), and optimizing iteration network parameters by using a random gradient descent method to obtain a trained language network model.
Step (8): and (3) respectively calculating vision and language emotion scores of the short video sample by using the trained network models in the step (4) and the step (7), obtaining emotion matrixes by combining the vision and language emotion scores, designing a weighting coefficient matrix according to weight duty ratios of different modes and different emotion types, limiting the value range of each parameter of the weighting coefficient matrix by using priori knowledge, designing a solving step length, traversing a value space to search an optimal solution, multiplying the emotion matrixes and the optimal weighting coefficient matrix to obtain an emotion classification probability matrix, and judging the emotion types of the short video according to the numerical values of each element on diagonal lines in the probability matrix. Comprises the following substeps:
(8.1) the visual and language emotion scores calculated by the network model in the step (4) and the step (7) are combined to obtain an emotion matrix S of the marked-association short video:
the superscript v and t respectively correspond to visual and text modes, the subscript pos, neu, neg respectively corresponds to three types of positive emotion, neutral emotion and negative emotion, and the 6 score values are emotion scores of the corresponding modes and the corresponding polarities respectively;
(8.2) designing a weighting coefficient matrix W according to the weight ratio of different emotion categories of visual and text modalities:
The 6W values in the matrix are weight parameters of corresponding modes and corresponding polarities respectively;
(8.3) setting the emotion weight ratio of the visual mode as k, the positive emotion weight ratio in the visual mode as x, the negative emotion weight ratio as y, the positive emotion weight ratio in the text mode as m, the negative emotion weight ratio as n, and obtaining the value range of each parameter by using priori knowledge:
accordingly, the weighting coefficient matrix W is converted into the following form:
Searching an optimal weighting coefficient matrix by using a fixed step length 0.001 traversal parameter value space;
(8.4) multiplying the optimal weighting coefficient matrix searched in the step (8.3) by the emotion matrix in the step (8.1) to obtain an emotion classification probability matrix P:
P 1、P2、P3 is the probability that the short video is identified as positive, neutral, negative emotion categories, respectively;
And (8.5) comparing the numerical value of P 1、P2、P3 on the diagonal line of the probability matrix P in the step (8.4), and identifying the subscript of the maximum value as the emotion type of the marked associated short video.
As shown in fig. 2, another embodiment of the present invention discloses a short video emotion recognition system associated with a labeling based on vision and language, which includes:
the video stream segmentation module is used for segmenting a video stream of a short video into image frames, selecting a fixed number of image frames at equal intervals, adjusting the resolution of the selected image frames and cutting the selected image frames into regular image blocks;
The visual feature extraction module is used for extracting a time feature sequence and a space feature sequence of the selected image frames and image blocks by utilizing a CNN network and a C3D network, respectively inputting the two feature sequences into a multi-head self-attention neural network to calculate to obtain emotion visual features in time dimension and space dimension, and connecting the two feature sequences in series to obtain high-level visual emotion features of the short video;
The visual emotion score calculation module is used for inputting the high-level visual emotion features obtained by the operation of the visual feature extraction module into the full-connection layer network for decoding, and calculating emotion scores of the short video visual modes by using a Softmax classifier;
the word vector conversion module is used for converting text content of the short video into word vectors by using a word vector tool and carrying out polarity enhancement on the word vectors according to emotion scores of words in an emotion dictionary;
the semantic feature extraction module is used for inputting the emotion polarity enhancement word vector obtained by the word vector conversion module into a multi-head self-attention neural network, and calculating to obtain high-level semantic emotion features of the short video;
the text language emotion score calculation module is used for inputting the high-level semantic emotion features obtained by the operation of the semantic feature extraction module into the full-connection layer network for decoding, and calculating emotion scores of short video text modes by using a Softmax classifier;
visual and language network model training module: the system is used for training vision and language network models respectively, calculating a loss function value according to the emotion score calculated in the emotion score calculation module and optimizing parameters in the two network models by using a gradient descent method;
the emotion fusion recognition module is used for fusing the visual and language emotion recognition results of the short video, calculating visual and language emotion scores to form an emotion matrix by using a network model trained by the visual and language network model training module, designing a weighting coefficient matrix according to the weight proportion of different emotion types of visual and text modes, limiting the value range of each parameter in the weighting coefficient matrix by using priori knowledge, traversing a value space with a fixed step length to search for obtaining an optimal weighting coefficient matrix, multiplying the weighting coefficient matrix by the emotion matrix to obtain an emotion classification probability matrix of the short video, comparing the numerical values of each element on the diagonal of the probability matrix, and recognizing and judging the emotion type of the short video.
The technical principle, the solved technical problems and the generated technical effects of the embodiment of the marking-associated short video emotion recognition system based on vision and language are similar to those of the embodiment of the method, and belong to the same inventive concept, and specific implementation details and related descriptions can refer to the corresponding processes in the embodiment of the marking-associated short video emotion recognition method based on vision and language, and are not repeated herein.
Those skilled in the art will appreciate that the modules in an embodiment may be adaptively changed and arranged in one or more systems different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components.
Based on the same inventive concept, the embodiment of the invention also provides a marking-associated short video emotion recognition system based on vision and language, which comprises at least one computer device, wherein the computer device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the processor realizes the marking-associated short video emotion recognition method based on vision and language when executing the program.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims (8)

1. A method for identifying short video emotion based on annotation association of vision and language is characterized by comprising the following steps:
s1, preprocessing a video stream of a short video sample, dividing the video stream into a plurality of image frames and a plurality of image blocks, and adjusting the resolutions of the image frames and the image blocks;
S2, respectively extracting visual characteristic information of image frames and image blocks by adopting a convolutional neural network to form a time-dimensional and space-dimensional low-layer characteristic sequence of the short video;
S3, respectively inputting the two feature sequences obtained in the step S2 into a multi-head self-attention neural network, and connecting the calculated two attention feature vectors in series to obtain a high-level emotion visual feature vector, and calculating emotion scores of visual modes by using a classifier after inputting the high-level emotion visual feature vector into a full-connection layer network;
S4, constructing a short video visual feature emotion recognition network by using the convolutional neural network in the simultaneous step S2, the multi-head self-attention neural network and the full-connection layer network in the step S3, calculating a loss function value according to the emotion score of the visual mode in the step S3, and optimizing iteration network parameters by using a gradient descent method to obtain a trained visual network model;
S5, processing text content of the marked correlation type short video into word vectors by using a word vector tool, and enhancing emotion polarities of the word vectors based on an emotion dictionary to obtain word vector sequences with enhanced emotion polarities;
S6, inputting the word vector sequence obtained in the step S5 into a multi-head self-attention neural network, extracting high-level emotion semantic features, inputting the high-level emotion semantic features into a full-connection layer network, decoding the semantic features by using a classifier, and calculating emotion scores of text modes;
S7, constructing a language feature emotion recognition network of the short video by the multi-head self-attention neural network and the full-connection layer network in the simultaneous step S6, calculating a loss function value according to emotion scores of the text modes in the step S6, and optimizing iteration network parameters by using a gradient descent method to obtain a trained language network model;
S8, respectively calculating vision and language emotion scores of the short video sample by using the trained network models in the step S4 and the step S7, obtaining emotion matrixes by combining the vision and language emotion scores, designing a weighting coefficient matrix according to weight duty ratios of different modes and different emotion types, limiting the value range of each parameter of the weighting coefficient matrix by using priori knowledge, designing a solving step length, traversing a value space to search an optimal solution, multiplying the emotion matrixes and the optimal weighting coefficient matrix to obtain an emotion classification probability matrix, and judging the emotion types of the short video according to the numerical values of each element on diagonal lines in the probability matrix.
2. The method for identifying short video emotion based on annotation association of vision and language according to claim 1, wherein said step S1 specifically comprises:
s1.1, dividing a short video sample video stream, and selecting F frames from a first frame at equal intervals;
S1.2, adjusting the resolution of the image frames in the step S1.1 to be N multiplied by N;
S1.3, clipping the NxN image in the step S1.2 into M 2 regular image blocks, wherein the resolution of each image block is
S1.4, forming an image block stream by F image blocks before and after each clipping position in the step S1.3, wherein M 2 image block streams are arranged in total in M 2 clipping positions.
3. The method for identifying short video emotion based on label association of vision and language according to claim 2, wherein said step S2 comprises:
s2.1, inputting the NxN image frames in the step S1.2 into a CNN network, extracting low-layer visual features of a short video sample in a time dimension, and forming a feature sequence with the length of F by using F frame images;
s2.2, step S1.4 The image block streams are input into a C3D network, low-level visual features of short video samples in the space dimension are extracted, and M 2 image block streams form a feature sequence with the length of M 2.
4. The method for identifying short video emotion based on label association of vision and language according to claim 3, wherein said step S3 comprises:
s3.1, embedding position information in the feature sequence in the step S2.1, adding a category label, forming a feature sequence with the length of F+1, inputting the feature sequence into a multi-head self-attention network, and calculating to obtain attention feature vectors of the time feature sequence;
S3.2, embedding position information in the feature sequence in the step S2.2, adding a category label, forming a feature sequence with the length of M 2 +1, inputting the feature sequence into a multi-head self-attention network, and calculating to obtain attention feature vectors of the space feature sequence;
s3.3, connecting the attention feature vectors in the step S3.1 and the step S3.2 in series to obtain a high-level emotion vision feature vector of the short video;
s3.4, inputting the high-level emotion visual features in the step S3.3 into a full-connection layer network, and calculating the visual emotion score of the short video by using a Softmax classifier:
Wherein K is the number of emotion categories, score j is the emotion Score of the j-th emotion, j=1, 2,..k, x i is the value on the i-th dimension of the classifier input vector x, and the Softmax classifier calculates each emotion category Score by means of vector index normalization.
5. The method for identifying short video emotion based on label association of vision and language according to claim 1, wherein said step S5 comprises:
s5.1, converting text content of the short video into a corresponding word vector sequence by using a word vector tool;
S5.2, multiplying the word vector obtained in the step S5.1 by the natural index of the enhancement factor alpha to obtain the emotion polarity enhancement word vector
Wherein x is an primitive word vector, pos (x) and neg (x) are respectively a positive emotion score and a negative emotion score of a corresponding word in an emotion dictionary, and an enhancement factor alpha is obtained by averaging the positive emotion score and the negative emotion score.
6. The method for identifying short video emotion based on label association of vision and language according to claim 1, wherein said step S8 comprises:
s8.1, combining the visual sense and language emotion scores calculated by the network model in the step S4 and the step S7 to obtain an emotion matrix S of the marked-association short video:
Emotion score for i-th emotion of visual mode,/> Emotion scores for text modality class i emotion;
s8.2, designing a weighting coefficient matrix W according to the weight ratio of different emotion categories of visual and text modes:
Weight ratio for i-th emotion of visual mode,/> The weight ratio of the i-th emotion of the text mode;
s8.3, limiting the value range of each parameter in the weighting coefficient matrix by using priori knowledge, and traversing a value space by a fixed step length to search the optimal weighting coefficient matrix;
s8.4, multiplying the optimal weighting coefficient matrix searched in the step S8.3 by the emotion matrix in the step S8.1 to obtain an emotion classification probability matrix P:
p i is the probability that a short video is identified as an emotion class i;
s8.5, comparing the numerical values of the elements on the diagonal of the probability matrix P in the step S8.4, and identifying the subscript of the maximum element value as the emotion type of the short video.
7. A short visual emotion recognition system associated with a visual and linguistic-based annotation, comprising:
the video stream segmentation module is used for segmenting a video stream of a short video into image frames, selecting a fixed number of image frames at equal intervals, adjusting the resolution of the selected image frames and cutting the selected image frames into regular image blocks;
The visual feature extraction module is used for extracting a time feature sequence and a space feature sequence of the selected image frames and image blocks by utilizing a CNN network and a C3D network, respectively inputting the two feature sequences into a multi-head self-attention neural network to calculate to obtain emotion visual features in time dimension and space dimension, and connecting the two feature sequences in series to obtain high-level visual emotion features of the short video;
The visual emotion score calculation module is used for inputting the high-level visual emotion features obtained by the operation of the visual feature extraction module into the full-connection layer network for decoding, and calculating emotion scores of the short video visual modes by using a Softmax classifier;
The word vector conversion module is used for converting text contents of short video association labels into word vectors by using a word vector tool, and carrying out polarity enhancement on the word vectors according to emotion scores of words in an emotion dictionary;
the semantic feature extraction module is used for inputting the emotion polarity enhancement word vector obtained by the word vector conversion module into a multi-head self-attention neural network, and calculating to obtain high-level semantic emotion features of the short video;
the text language emotion score calculation module is used for inputting the high-level semantic emotion features obtained by the operation of the semantic feature extraction module into the full-connection layer network for decoding, and calculating emotion scores of short video text modes by using a Softmax classifier;
visual and language network model training module: the system is used for training vision and language network models respectively, calculating a loss function value according to the emotion score calculated in the emotion score calculation module and optimizing parameters in the two network models by using a gradient descent method;
the emotion fusion recognition module is used for fusing the visual and language emotion recognition results of the short video, calculating visual and language emotion scores to form an emotion matrix by using a network model trained by the visual and language network model training module, designing a weighting coefficient matrix according to the weight proportion of different emotion types of visual and text modes, limiting the value range of each parameter in the weighting coefficient matrix by using priori knowledge, traversing a value space with a fixed step length to search for obtaining an optimal weighting coefficient matrix, multiplying the weighting coefficient matrix by the emotion matrix to obtain an emotion classification probability matrix of the short video, comparing the numerical values of each element on the diagonal of the probability matrix, and recognizing and judging the emotion type of the short video.
8. A visual and linguistic based annotation-associated short video emotion recognition system comprising at least one computer device, the computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the visual and linguistic based annotation-associated short video emotion recognition method of any of claims 1-6 when the computer program is executed.
CN202210511572.0A 2022-05-11 2022-05-11 Labeling-associated short video emotion recognition method and system based on vision and language Active CN114882412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210511572.0A CN114882412B (en) 2022-05-11 2022-05-11 Labeling-associated short video emotion recognition method and system based on vision and language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210511572.0A CN114882412B (en) 2022-05-11 2022-05-11 Labeling-associated short video emotion recognition method and system based on vision and language

Publications (2)

Publication Number Publication Date
CN114882412A CN114882412A (en) 2022-08-09
CN114882412B true CN114882412B (en) 2024-06-25

Family

ID=82675433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210511572.0A Active CN114882412B (en) 2022-05-11 2022-05-11 Labeling-associated short video emotion recognition method and system based on vision and language

Country Status (1)

Country Link
CN (1) CN114882412B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145712A (en) * 2018-06-28 2019-01-04 南京邮电大学 A kind of short-sighted frequency emotion identification method of the GIF of fusing text information and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN112818861B (en) * 2021-02-02 2022-07-26 南京邮电大学 Emotion classification method and system based on multi-mode context semantic features

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145712A (en) * 2018-06-28 2019-01-04 南京邮电大学 A kind of short-sighted frequency emotion identification method of the GIF of fusing text information and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于卷积注意力机制的情感分类方法;顾军华等;计算机工程与设计;20200116(第01期);全文 *

Also Published As

Publication number Publication date
CN114882412A (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN110334705B (en) Language identification method of scene text image combining global and local information
Yang et al. Visual sentiment prediction based on automatic discovery of affective regions
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
Zhang et al. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining
CN108804530B (en) Subtitling areas of an image
CN106650813B (en) A kind of image understanding method based on depth residual error network and LSTM
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN107239565B (en) Image retrieval method based on saliency region
Peng et al. Research on image feature extraction and retrieval algorithms based on convolutional neural network
CN110083729B (en) Image searching method and system
CN112686345B (en) Offline English handwriting recognition method based on attention mechanism
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN108427740B (en) Image emotion classification and retrieval algorithm based on depth metric learning
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113516152B (en) Image description method based on composite image semantics
Sharma et al. Deep eigen space based ASL recognition system
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN112507800A (en) Pedestrian multi-attribute cooperative identification method based on channel attention mechanism and light convolutional neural network
Xiao et al. An extended attention mechanism for scene text recognition
CN116611024A (en) Multi-mode trans mock detection method based on facts and emotion oppositivity
Cui et al. Representation and correlation enhanced encoder-decoder framework for scene text recognition
Takimoto et al. Image aesthetics assessment based on multi-stream CNN architecture and saliency features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant