CN115995029A

CN115995029A - Image emotion analysis method based on bidirectional connection

Info

Publication number: CN115995029A
Application number: CN202211689872.4A
Authority: CN
Inventors: 王兴起; 何金义; 邵艳利; 魏丹; 陈滨; 方景龙
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-04-21

Abstract

The invention discloses an image emotion analysis method based on bidirectional connection. The method mainly considers the width of the extended feature extraction model to replace the deep stacking to improve the adaptability of the deep neural network to the abstract target emotion in the field of visual emotion analysis. According to the invention, a bidirectional characteristic connection mode is adopted, data enhancement and sample capacity expansion are firstly carried out on a training set, then a transverse connection depth characteristic extraction network is adopted to expand the characteristic extraction network width, then a longitudinal characteristic connection mode of a similar characteristic pyramid is adopted to improve model characteristic extraction capacity, finally a classification network is used to obtain emotion classification results, the defect that the depth neural network is insufficient in emotion abstract target identification in image emotion analysis is successfully overcome, and the accuracy of visual emotion analysis is further improved.

Description

Image emotion analysis method based on bidirectional connection

Technical Field

The invention belongs to the field of image emotion analysis, and particularly relates to a visual emotion analysis method based on transverse and longitudinal bidirectional feature connection.

Background

Rich emotion is a human-specific mental phenomenon. Emotion is part of biological intelligence and is a derivative of intelligence. In general, we consider that the emotion richness is directly related to the intelligence, and organisms capable of generating and expressing rich emotion have higher intelligence, but organisms incapable of expressing emotion or having single emotion expression have lower intelligence. In recent years, along with the development of the fields of artificial intelligence and robots, emotion computing (Affective Computing) is increasingly important in the research of man-machine interaction of the artificial intelligence, and a natural interaction mode with emotion not only can realize a more friendly interaction interface, but also is a necessary way for realizing strong artificial intelligence. The artificial intelligence with emotion and intelligence has higher practical value and practical significance, for example: in the management industry, the emotion of a leader and staff is obtained through emotion calculation, so that the overall efficiency of an enterprise is improved; in trade, emotion of a client is analyzed through a client evaluation text to carry out accurate promotion, so that an enterprise can be more accurately helped to set up own brands; in the health field, doctor-patient-response-based emotion prediction can help doctors analyze the psychology of patients, assist psychological interviews, and further diagnose and treat psychological diseases and negative emotions such as suicide and the like. To achieve true artificial intelligence, natural human-machine interaction integrating intelligence and emotion must be achieved. In recent years, research on artificial intelligence has reached a new climax, which benefits from breakthroughs in computer technology and deep neural network technology. Along with the development of the fields of artificial intelligence and robots, the role of emotion calculation (Affective Computing) in the man-machine interaction related research of the artificial intelligence is more and more important, and the natural interaction mode with emotion not only can realize a more friendly interaction interface, but also is a necessary way for realizing strong artificial intelligence. The artificial intelligence with emotion and intelligence has higher practical value and practical significance, for example: in the management industry, the emotion of a leader and staff is obtained through emotion calculation, so that the overall efficiency of an enterprise is improved; in trade, emotion of a client is analyzed through a client evaluation text to carry out accurate promotion, so that an enterprise can be more accurately helped to set up own brands; in the health field, doctor-patient-response-based emotion prediction can help doctors analyze the psychology of patients, assist psychological interviews, and further diagnose and treat psychological diseases and negative emotions such as suicide and the like. To achieve true artificial intelligence, natural human-machine interaction integrating intelligence and emotion must be achieved. In this context, emotion computing is an emerging science field that starts to enter the field of view of numerous information science and psychological researchers, and more researchers are focusing on the derivative of emotion, which is an intelligence that hopes to restore biological emotion intelligence through emotion computing research, giving a machine rich emotion and higher intelligence. The visual emotion analysis is an important component in the emotion calculation field, aims at predicting psychological change [1] generated when a person sees an image by using a computer and a specific algorithm, can predict and control emotion change of readers during browsing and reading to a certain extent, has important application prospects in the fields of social media analysis, multimedia analysis, public opinion prediction and the like, and becomes an important research direction in the current emotion calculation field.

Visual emotion analysis is used as part of emotion calculation, and has consistency and variability compared with classical classification problems in machine learning. The consistency is that the visual emotion analysis problem and the classical classification problem both need a certain understanding capability of the neural network to semantic information in the image, namely the feature extraction capability of the model to the image is consistent with most other classification problems, and the emotion classification needs to analyze and understand the content of the image to obtain a classification result; the difference is a huge gap between the emotion space and the visual feature space, the recognition target of the classical classification problem is an object with an image in the image visual space, the key of the visual emotion analysis is how to map the image space with the image to the abstract emotion space, and the emotion classification label is obtained by simulating the emotion process of human beings when the human beings watch the image.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an image emotion analysis method based on bidirectional connection.

The invention designs an emotion analysis method based on transverse and longitudinal bidirectional feature connection, and improves the precision of abstract classification problems in a width expansion mode. Firstly, extracting a backbone network by taking two deep convolutional neural networks as characteristics, and acquiring characteristic tensors of different scales of the backbone network in a module longitudinal stacking process; then, connecting the two models in parallel, and carrying out transverse multi-stage connection on the feature tensor with the same scale so as to expand the extracted semantic information, adding the attention features of the feature information by using an attention module, and carrying out layer-by-layer downsampling and combination through convolution to finally obtain multi-level complex feature vector output of the two-way combined deep convolution neural network; and finally, outputting an emotion classification result through a classification network. The emotion classification method using the bidirectional feature connection strengthens the feature extraction capability of the backbone network, absorbs the advantage of comprehensive consideration of feature pyramids to feature graphs with different scales, breaks the linear structure of a single classification network repeated stacking module, and is more suitable for abstract emotion classification.

The image emotion analysis method based on the bidirectional connection specifically comprises the following steps:

step 1: deep neural network weight pre-training and image dataset pre-processing;

step 2: feature combination is completed through the depth feature transverse connection module;

transversely connecting a plurality of depth neural network models through a depth characteristic transverse connection module and completing characteristic combination; the plurality of deep neural network models have convolution layers with the same output size; the depth characteristic transverse connection module consists of a convolution layer, a connection layer and an SE attention module;

step 3: the longitudinal feature cascading module is used for carrying out joint feature fusion;

the combined features obtained by transversely connecting the depth features are used as input of a cascading structure, the combined features with different sizes are fused according to a mode similar to a feature pyramid, and meanwhile, a attention layer is adopted to add channel attention, so that model accuracy is improved; the longitudinal combination mode is similar to the feature pyramid structure, and the combination feature vectors are longitudinally combined from shallow to deep combination output; this process is expressed as:

q in _I The output of the layer I of the joint feature pyramid; f (f) _Q () For the downsampling operation in the joint feature pyramid,

is a feature joint layer; avgpool () is a global average pooling layer; out is the overall output feature vector of the joint feature pyramid; p (P) _I Input tensors for layer I of the depth feature cross connect module;

step 4: after the longitudinal cascade module, a classification network with 2 layers of 4096 neurons, 1 layer of full-connection layers with 1000 neurons, one layer of full-connection layers with 2 neurons and an activation function of Sigmoid is used;

step 5: and (3) obtaining emotion classification results from the normalized images through the steps 2-4.

Preferably, the deep neural network weight pre-trains and preprocesses the data set; the method comprises the following steps:

step 1-1: training a deep neural network feature extraction model on an image dataset ImageNet; to enhance the basic image perception capability of the image emotion analysis model;

step 1-2: for a target data set, normalizing each sample in the data set to the same size, and performing sample expansion in each cross-validated training set through image processing; depth features of each sample image are extracted on the processed dataset using a pre-trained model.

Preferably, the number of the deep neural network models is two; the depth residual network and the neural structure search network are respectively.

Preferably, the depth residual error network and the neural structure search network model are transversely connected through the depth characteristic transverse connection module, specifically:

for an input image Y, performing feature extraction by using a deep convolution model Res-Net and an NAS-Net to obtain the output of the input image Y at each complete module layer, and connecting the input image Y by using a transverse connection module; the transverse connection part connects the output of the 3 rd, 4 th and 5 th RES modules of the depth residual error network with the output characteristics of the 2 nd, 3 rd and 4 th NAS modules of the neural structure search network; to reduce the computational effort, the features to be combined are processed using a 1 x 1 convolution layer before each cell is laterally combined to achieve cross-channel information exchange while limiting expansion of the number of channels; the process is expressed as:

wherein PI is the input tensor of layer I; FA (FA) _,I Representing the I-th particular layer output tensor, FB of model A _,I An I-th particular layer output tensor representing model B; fA. fB is the normalized operation of model A and model B, and consists of convolution layers with convolution kernel sizes of 1×1 and 3×3, and is used for adding the nonlinearity of the joint feature tensor;

representing feature combinations; g is normalized operation for limiting the number of channels of the joint characteristic tensor, and consists of convolution layers with the convolution kernel size of 1 multiplied by 1; through this step, a plurality of joint feature vectors are obtained through the input image Y for subsequent use.

Preferably, in the longitudinal joint process, the first transverse joint input tensor needs to be processed by an additional convolution layer with a convolution kernel size of 3×3, and joint input tensors of other layers are all processed by a convolution layer with a convolution kernel size of 1×1, so as to limit expansion of the number of characteristic channels in the longitudinal joint process; after passing through the longitudinal cascade module, a large-scale joint feature vector is obtained.

The beneficial results of the invention are:

1. for the abstract target classification problem of image emotion analysis, the perception surface of the model and the perception capability of the abstract features are enhanced by using the transverse connection feature extraction model to expand the width of the model to replace the deep module stacking in the traditional target classification problem.

2. The multi-layer transverse joint features are longitudinally cascaded by using the improved self-FPN longitudinal feature connection model, so that the advantage of comprehensive consideration of feature pyramids to feature graphs with different scales is absorbed, the linear structure of a single classification network repeated stacking module is broken, the method is more suitable for abstract emotion classification problems, and the accuracy of the model on the emotion classification problems is improved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a diagram of a network configuration of cross-linking modules and a cross-linking feature formation process;

FIG. 3 is a network architecture diagram of a longitudinal cascading process and longitudinal connecting modules;

Detailed Description

The invention provides an image emotion analysis method based on deep learning. The overall structure of the method is shown in fig. 1, and the main content comprises a characteristic transverse connection module based on a transverse and longitudinal bidirectional characteristic connection parallel network and a longitudinal connection module based on a characteristic pyramid. Firstly, a transverse connection module is used for completing feature extraction through two backbone networks pre-trained on a large data set, and feature tensors extracted by the feature extraction networks of different module levels are transversely connected with feature tensors with the same feature scale of two models; secondly, longitudinally cascading the combined features obtained by combining the transverse connection modules by using the longitudinal connection modules, wherein the longitudinal connection adopts a combined feature pyramid structure similar to a feature pyramid, the output features of the combined feature pyramid are obtained by gradually convolving and downsampling, and the channel level attention is increased by using the attention module; and finally, obtaining a classification result through a classification network consisting of a global average pooling layer and a full connection layer.

The invention will now be described in detail with reference to the accompanying drawings and examples, with the following specific steps:

and step 1, deep neural network weight pre-training and data set preprocessing.

The large dataset pre-training was performed on ImageNet using Res-Net and NAS-Net, enhancing the basic image perceptibility of the image emotion analysis model, using pre-training weights provided by the model authorities as initial weights for reproducible considerations.

Further, the data set is divided into a training set and a testing set, a training set sample is used as a training target, and a testing set sample is used for verifying the model. Because the number of the training set samples is small, the sample capacity is expanded by a simple data enhancement mode, and specifically, the sample capacity is expanded to 3 times by carrying out random scaling, random translation and random rotation on each training set sample so as to meet the training requirement.

And 2, transversely connecting depth features.

Depth feature cross connect is performed using a depth feature cross connect module. For the input image Y, the deep convolution models Res-Net and NAS-Net are used for feature extraction, the output of the input image Y at each complete module layer is obtained, and the input image Y is connected by using a transverse connection module. Specifically, as shown in fig. 2, the cross connection section connects the outputs of the 3 rd, 4 th, and 5 th Res modules of the depth residual network (Res-Net) with the output characteristics of the 2 nd, 3 th, and 4 th NAS modules of the neural structure search network (NAS-Net). To reduce the computational effort, we use a 1 x 1 convolution layer to process the features to be combined before each cell is cross-wise joint to achieve cross-channel information exchange while limiting expansion of the number of channels. This process can be expressed as:

wherein PI is the input tensor of layer I; FA (FA) _,I An I-th specific layer output tensor representing model A; fA is the normalized operation of the model A, and consists of convolution layers with convolution kernel sizes of 1 multiplied by 1 and 3 multiplied by 3, and is used for adding the nonlinearity of the joint characteristic tensor;

representing feature combinations; g is a normalized operation for limiting the number of channels of the joint feature tensor, and consists of convolution layers with a convolution kernel size of 1×1. Through this step, a plurality of joint feature vectors can be obtained for subsequent use by inputting the image Y.

Step 3, outputting longitudinal feature cascade and classification result

And carrying out longitudinal cascading by using a longitudinal cascading module and outputting a classification result by using a classification network. As shown in fig. 3, for the joint feature vectors obtained in step 2, a network structure similar to a feature pyramid is used for step-by-step combination, and the difference between the two is that the input received by the feature pyramid is from the output tensor on a specific layer of a convolutional neural network, and the input received by the vertical cascade module is the output tensor obtained by transversely connecting two layers with the same size of different neural networks, and then the vertical combination is performed. The longitudinal combination mode is similar to the feature pyramid structure, and the longitudinal combination of the combined feature vectors from shallow to deep combined outputs can be expressed as follows:

wherein QI is the output of the I layer of the combined feature pyramid; fQ () is a downsampling operation in the joint feature pyramid, and is mainly composed of a convolution layer with a convolution kernel size of 1×1 and a pooling layer with a step length of 2, so as to realize cross-space information exchange and determine training weights between joint feature vectors;

a feature federation layer (Concatenate Layer); avgpool () is a global average pooling layer (Global Average Pooling Layer); out is the overall output feature vector of the joint feature pyramid, and the feature vector can be mapped to a classification result through a classification network. In the longitudinal joint process, the first transverse joint input tensor needs to be processed by an additional convolution layer with a convolution kernel size of 3×3, and joint input tensors of other layers are processed by a convolution layer with a convolution kernel size of 1×1, so as to limit the expansion of the characteristic channel number in the longitudinal joint process. After the longitudinal cascade module, we obtain a huge scale of joint feature vector.

Further, the joint feature vectors are processed using a classification network. The classification network is composed of a full-connection layer with 4096 neurons in 2 layers and 1000 neurons in 1 layer, and finally, the emotion classification result is output through the full-connection layer with 2 neurons and Sigmoid activation function.

For the experimental results of image emotion analysis, we verify in two aspects of ablation experiment and contrast experiment, the experimental data set is a Twitter I data set and an Artphoto data set, the experimental data set is a public data set which is more known in the field of image emotion analysis, and the ablation experimental results are as follows:

table 1 visual emotion analysis ablation experiments based on bi-directional feature connection

Compared with the method of directly using the deep neural network model and simply connecting the deep neural network of the last layer, the image emotion analysis method based on the bidirectional connection provided by the invention has higher classification accuracy and proves the superiority of the invention. The comparison experiment selects a method with excellent image emotion analysis field in recent years for comparison, and the comparison result is as follows:

table 2 comparative experiments

Through a comparison experiment, the invention can be found to have relatively excellent performance on the ArtPhoto and Twitter I data sets; in addition, the method is an improvement scheme aiming at feature extraction and feature connection, is not completely contradictory with the existing image emotion classification method, and has good development prospect.

Claims

1. The image emotion analysis method based on the bidirectional connection is characterized by comprising the following steps of:

2. The image emotion analysis method based on bidirectional connection according to claim 1, characterized by comprising the following steps: the deep neural network weight pre-trains and preprocesses a data set; the method comprises the following steps:

3. The image emotion analysis method based on bidirectional connection according to claim 1, characterized by comprising the following steps: the number of the deep neural network models is two; the depth residual network and the neural structure search network are respectively.

4. A bi-directional connection based image emotion analysis method as set forth in claim 3, wherein: the depth residual error network and the neural structure search network model are transversely connected through the depth characteristic transverse connection module, and the method specifically comprises the following steps:

p in the formula _I An input tensor for layer I; f (F) _A,I Represents the I-th specific layer output tensor of model A, F _B,I An I-th particular layer output tensor representing model B; f (f) _A 、f _B The normalized operation of the model A and the model B consists of convolution layers with convolution kernel sizes of 1 multiplied by 1 and 3 multiplied by 3, and is used for adding the nonlinearity of the joint characteristic tensor;

5. The image emotion analysis method based on bidirectional connection according to claim 1, characterized by comprising the following steps: in the longitudinal combination process, the first transverse combination input tensor needs an additional convolution layer with a convolution kernel size of 3×3 to process, and the combination input tensors of other layers are all processed through one convolution layer with a convolution kernel size of 1×1, so that the expansion of the characteristic channel number in the longitudinal combination process is limited; after passing through the longitudinal cascade module, a large-scale joint feature vector is obtained.