CN111985532A

CN111985532A - Scene-level context-aware emotion recognition deep network method

Info

Publication number: CN111985532A
Application number: CN202010664287.3A
Authority: CN
Inventors: 孙强; 张龙涛
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-11-24
Anticipated expiration: 2040-07-10
Also published as: CN111985532B

Abstract

The invention discloses a scene-level context-aware emotion recognition deep network method. The body part image set X _B is obtained by reading the body labeling value of the training sample set X _in and the original emotion labeling value; and the X _in and X _B are normalized. Then, it is sent to the upper and lower convolutional neural networks respectively to extract the emotional feature TF and the contextual emotional feature TC, and send _TF and TC to the upper and lower adaptive layers _{respectively to obtain the fusion weights λ F} _and _λ _C _. _F , T _C , λ _F and λ _C are fused to obtain emotional fusion feature T _A , T _A is linearly mapped by the fully connected layer to obtain the initial predicted values of arousal and valence, and measure the loss between the two initial predicted values and the original emotional labeling value , gradually converge, complete the training, and obtain the network model; after processing the test sample set, send it to the network model, and obtain the predicted label value of the test sample set X _tn . The method of the invention takes into account the influence degree of the features of different attributes on the emotion of the characters when fusing the features, and improves the prediction performance of the model on the basis of enriching the research work of emotion recognition based on images.

Description

A scene-level context-aware deep network approach for emotion recognition

技术领域technical field

本发明属于模式识别技术领域，具体涉及一种场景级上下文感知的情感识别深度网络方法。The invention belongs to the technical field of pattern recognition, and in particular relates to a scene-level context-aware emotion recognition deep network method.

背景技术Background technique

情感是人们表达自身感受的一种必要表达形式。在日常生活中，从一个人所处的实际场景理解并识别他们的情绪有助于感知其精神状态和预测行为，有效进行互动。早在上世纪90年代，情感计算这一概念已被MIT媒体实验室提出，随后科学家们就开始致力于将人类的复杂情感转化为计算机可以识别的数值信息，以更好实现人机交互，使计算机趋于智能化，这已成为人工智能时代亟待解决的关键问题之一。Emotions are a necessary form of expression for people to express their feelings. In everyday life, understanding and recognizing a person's emotions from the actual situation they are in can help to perceive their mental state and predict behavior for effective interaction. As early as the 1990s, the concept of affective computing was proposed by the MIT Media Lab, and then scientists began to work on transforming complex human emotions into numerical information that can be recognized by computers, so as to better realize human-computer interaction and enable human-computer interaction. Computers are becoming more intelligent, which has become one of the key problems to be solved urgently in the era of artificial intelligence.

传统意义上，针对静态图像的情感识别任务主要依据人脸图像展开研究。对于人脸图像，采用预先定义的特征提取方法进行情感特征提取，并送入分类器(回归器)进行模型训练，最终实现情感预测。然而，基于人脸图像进行情感识别容易受到姿态、光照以及人脸差异等自然环境与样本特征因素的影响。Traditionally, emotion recognition tasks for static images are mainly based on face images. For the face image, the pre-defined feature extraction method is used to extract emotional features, and then sent to the classifier (regressor) for model training, and finally achieve emotional prediction. However, emotion recognition based on face images is easily affected by natural environment and sample characteristic factors such as posture, illumination, and face differences.

根据心理学研究发现，视觉传达的情感信息中，人脸图像传递的信息大约占55％。在日常情感交流中，判断一个人的情感，不仅可以通过目标人物的面部表情，还可以通过周围环境，如人物动作，与他人互动，所处场景等一系列丰富的上下文信息对人物内心的情感进行估计，甚至在检测不到人脸的极端情况下，依然可以通过大量上下文信息对研究对象的情感进行估计。According to psychological research, about 55% of the emotional information conveyed by vision is conveyed by face images. In daily emotional communication, a person's emotions can be judged not only through the facial expressions of the target character, but also through a series of rich contextual information such as the character's actions, interaction with others, and the scene in which the character is located. Even in the extreme case where no face is detected, the emotion of the research subject can still be estimated through a large amount of contextual information.

近年来，基于深度卷积网络的复杂情感识别方法引起关注，让网络自行学习情感特征并进行分析替代了传统人为定义的方式。但是，当下的深度学习分析方法主要针对人脸图像进行情感分析，缺乏对于自然场景复杂情形下人物表现的综合考虑，未曾考虑场景级上下文信息对于场景中人物情感识别的影响。同时，对于不同属性特征的融合方式也研究不足，所建立的模型往往忽略了不同属性特征对于情感状态识别的贡献程度。In recent years, complex emotion recognition methods based on deep convolutional networks have attracted attention, allowing the network to learn and analyze emotional features on its own instead of the traditional human-defined method. However, the current deep learning analysis methods mainly focus on emotion analysis of face images, lack of comprehensive consideration for the performance of characters in complex natural scenes, and have not considered the impact of scene-level context information on emotion recognition of characters in scenes. At the same time, there is insufficient research on the fusion method of different attribute features, and the established models often ignore the contribution of different attribute features to emotional state recognition.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种场景级上下文感知的情感识别深度网络方法，解决了现有技术中基于静态图像的情感分析范围较窄，仅针对人脸图像，以及采用直接拼接不同属性特征的方式进行情感识别的局限性问题。The purpose of the present invention is to provide a scene-level context-aware emotion recognition deep network method, which solves the problem that the range of emotion analysis based on static images in the prior art is narrow, only for face images, and the method of directly splicing different attribute features is adopted. Limitations of Emotion Recognition.

本发明所采用的技术方案是，一种场景级上下文感知的情感识别深度网络方法，具体包括以下步骤：The technical solution adopted in the present invention is, a scene-level context-aware emotion recognition deep network method, which specifically includes the following steps:

步骤1、采集图像，确定训练样本集X_in和测试样本集X_tn；Step 1, collect images, determine the training sample set X _in and the test sample set X _tn ;

步骤2、读取训练样本集X_in中每个样本的身体标注值与原始情感标注值，依据身体标注值提取每个样本的身体部位，得到身体部位图像集X_B；Step 2, reading the body labeling value and the original emotion labeling value of each sample in the training sample set X _in , and extracting the body part of each sample according to the body labeling value to obtain a body part image set X _B ;

步骤3、对训练样本集X_in进行集合内归一化处理，得到上下文情感图像集X_im；对身体部位图像集X_B进行集合内归一化处理，得到归一化的身体部位图像集X_body；Step 3. Carry out intra-set normalization processing to the training sample set X _in to obtain the context emotional image set X _im ; Carry out intra-set normalization processing to the body part image set X _B to obtain the normalized body part image set X _body ;

步骤4、将归一化的身体部位图像集X_body送入上层的卷积神经网络提取身体部位情感特征T_F，并将上下文情感图像集X_im送入下层的卷积神经网络提取场景级上下文情感特征T_C；Step 4. Send the normalized body part image set X _body to the upper convolutional neural network to extract body part emotional features _TF , and send the contextual emotional image set X _im to the lower convolutional neural network to extract scene-level context affective feature T _C ;

步骤5、将身体部位情感特征T_F与场景级上下文情感特征T_C分别送入上层的自适应层和下层的自适应层进行特征自适应学习，上层的自适应层输出身体部位融合权重λ_F，下层的自适应层输出上下文融合权重λ_C；Step 5. The body part emotional feature _TF and the scene _- level contextual emotional feature TC are respectively sent to the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and the upper adaptive layer outputs the body part fusion weight _λF , the lower adaptive layer outputs the context fusion weight λ _C ;

步骤6、将身体部位情感特征T_F、场景级上下文情感特征T_C与身体部位融合权重λ_F、上下文融合权重λ_C进行加权融合得到结合上下文信息的情感融合特征T_A，然后将T_A经过全连接层线性映射，得到arousal和valence的初始预测值，采用KL散度损失函数，衡量arousal和valence初始预测值与对应的原始情感标注值之间的损失，通过网络反向传播，多次迭代，更新网络权重，逐渐减少损失，使算法逐步收敛，完成训练，得到网络模型；Step 6. Perform weighted fusion of the body part emotional feature TF , the scene _{-level contextual emotional feature TC with the body part fusion weight λ F , and the context fusion weight λ C} _to _obtain the emotional fusion feature _TA combined with the context information, and then pass _TA through _. The fully connected layer is linearly mapped to obtain the initial predicted values of arousal and valence, and the KL divergence loss function is used to measure the loss between the initial predicted values of arousal and valence and the corresponding original sentiment annotation values, backpropagation through the network, multiple iterations , update the network weight, gradually reduce the loss, make the algorithm gradually converge, complete the training, and obtain the network model;

步骤7、按照步骤2提取测试样本集X_tn中每一个测试样本的身体部位，获得测试身体部位图像集X_tB，接着按照步骤3分别对测试样本集X_tn和测试身体部位图像集X_tB进行归一化处理后，送入步骤6所得的网络模型，最终得到测试样本集X_tn的arousal和valence预测标签值。Step 7: Extract the body part of each test sample in the test sample set X _tn according to step 2, and obtain the test body part image set X _tB , and then perform the test sample set X _tn and the test body part image set X _tB according to step 3 respectively. After normalization, it is sent to the network model obtained in step 6, and finally the arousal and valence predicted label values of the test sample set X _tn are obtained.

本发明的特点还在于，The present invention is also characterized in that,

步骤2中对训练样本集X_in进行身体部位提取的具体步骤为：In step 2, the specific steps for extracting body parts from the training sample set X _in are as follows:

步骤2.1、读取训练样本集X_in中每个样本的的身体标注(B_x1,B_y1,B_x2,B_y2)，其中(B_x1,B_y1,)，(B_x2,B_y2)为身体部位所在的斜对角的两点坐标，通过公式(1) 计算位置与尺寸参数集

其中：Step 2.1. Read the body label (B _x1 ,B _y1 ,B _x2 ,B _y2 ) of each sample in the training sample set X _in , where (B _x1 ,B _y1 ,), (B _x2 ,B _y2 ) are The two-point coordinates of the diagonally opposite corners where the body part is located, and the position and size parameter set is calculated by formula (1)

in:

公式(1)中，B_w表示身体部位图像的宽度，B_h表示身体部位图像的宽度；In formula (1), B _w represents the width of the body part image, and B _h represents the width of the body part image;

步骤2.2、根据步骤1.1中得到的参数集

对训练样本集X_in中的每个样本进行裁剪，得到身体部位图像集X_B。Step 2.2, according to the parameter set obtained in step 1.1

Each sample in the training sample set X _in is cropped to obtain a body part image set X _B .

步骤3中对训练样本集X_in进行集合内归一化处理的公式如下：The formula for performing intra-set normalization on the training sample set X _{in in} step 3 is as follows:

公式(2)中，X_in为训练样本集，X_im为上下文情感图像集，σ为训练样本集的标准差图像，x_mean为训练样本集的均值图像；In formula (2), X _in is the training sample set, X _im is the context emotional image set, σ is the standard deviation image of the training sample set, and x _mean is the mean image of the training sample set;

公式(2)中x_mean和σ定义如下：In formula (2), x _mean and σ are defined as follows:

公式(3)和(4)中，x_i表示训练样本集X_in中的任一单个样本，n表示总的训练样本个数，n≥1。In formulas (3) and (4), x _i represents any single sample in the training sample set X _in , n represents the total number of training samples, and n≥1.

步骤3中对身体部位图像集X_B进行集合内归一化处理的公式如下：The formula for performing intra-set normalization on the body part image set X _B in step 3 is as follows:

公式(5)中，X_B为身体部位图像集，X_body为归一化后的身体部位图像集，σ'为身体部位图像集的标准差图像，x'_mean为身体部位图像集的均值图像；In formula (5), X _B is the body part image set, X _body is the normalized body part image set, σ' is the standard deviation image of the body part image set, and x' _mean is the mean image of the body part image set ;

公式(5)中x′_mean和σ'定义如下：In formula (5), x' _mean and σ' are defined as follows:

公式(6)和(7)中，x′_i'表示身体部位图像集X_B中的任一单个样本，n 表示总的训练样本个数，n≥1。In formulas (6) and (7), x'_i' represents any single sample in the body part image set X _B , n represents the total number of training samples, and n≥1.

步骤4中上层的卷积神经网络和下层的卷积神经网络结构参数相同，均采用VGG16架构。In step 4, the upper-layer convolutional neural network and the lower-layer convolutional neural network have the same structure parameters, and both use the VGG16 architecture.

步骤4中身体部位情感特征T_F和上下文情感特征T_C的计算过程如下：The calculation process of the body part emotional feature _TF and the _contextual emotional feature TC in step 4 is as follows:

T_F＝F(X_body,W_F) (8)T _F =F(X _body ,W _F ) (8)

T_C＝F(X_im,W_C) (9)T _C =F(X _im ,W _C ) (9)

公式(8)中，W_F表示上层的卷积神经网络全部卷积层与池化层的所有参数，公式(9)中，W_C表示下层的卷积神经网络全部卷积层与池化层的所有参数，F表示特征提取网络中卷积及池化的计算操作。In formula (8), WF represents all parameters of all convolutional layers and pooling layers of the upper convolutional neural network, and in _formula (9), _WC represents all convolutional layers and pooling layers of the lower convolutional neural network All parameters of , F represents the computational operations of convolution and pooling in the feature extraction network.

步骤5中身体部位融合权重λ_F和上下文融合权重λ_C的计算过程如下：The calculation process of the body part fusion weight λ _F and the context fusion weight λ _C in step 5 is as follows:

λ_F＝F(T_F,W_D) (10)λ _F =F(T _F ,W _D ) (10)

λ_C＝F(T_C,W_E) (11)λ _C =F(T _C ,W _E ) (11)

公式(10)中，W_D为上层的自适应层网络参数，公式(11)中，W_E为下层的自适应层网络参数，且λ_F+λ_C＝1。In formula (10), WD is the network parameter of the adaptive layer of the upper layer, and in the formula (11), _{WE is the network parameter of the adaptive layer of the lower layer, and λ F} ₊ λ _C ₌ 1.

步骤5中上层的自适应层和下层的自适应层的网络结构完全相同，具体的网络架构参数如下：In step 5, the network structures of the adaptive layer of the upper layer and the adaptive layer of the lower layer are exactly the same, and the specific network architecture parameters are as follows:

上层的自适应层和下层的自适应层各包含一个最大池化层和两个卷积层以及一个Softmax层。The upper adaptive layer and the lower adaptive layer each contain a max pooling layer, two convolutional layers and a Softmax layer.

步骤6中身体部位情感特征T_F、场景级上下文特征T_C与身体部位融合权重λ_F、上下文融合权重λ_C进行加权融合的计算公式如下：In step 6, the calculation _formula for weighted fusion of body part emotional feature TF, scene-level context feature _{TC, body part fusion weight λ F} _, and context fusion weight λ _C is as follows:

公式(12)中，T_A为情感融合特征，Π表示连接运算符，表示将融合权重后的身体部位情感特征和场景级上下文情感特征进行拼接，

表示不同特性特征与融合权重之间的卷积操作。In formula (12), T _A is the emotional fusion feature, and Π represents the connection operator, which means that the emotional features of the body parts and the contextual emotional features of the scene level after the fusion weight are spliced,

Represents the convolution operation between different feature features and fusion weights.

本发明的有益效果是：本发明的一种场景级上下文感知的情感识别深度网络方法，提出一种两阶段上下文情感识别网络，第一阶段采用两层卷积神经网络对复杂图像中的人物以及场景分别进行特征提取，第二阶段采用自适应网络对两层子网络中不同属性的特征进行加权融合。采用两阶段上下文情感识别网络，一方面解决了现有针对图像的情感识别任务主要针对人脸图像数据的现实不足问题；另一方面，在融合特征时充分考虑不同属性的特征对与人物情感的影响程度，在丰富基于图像的情感识别研究工作的基础上提高了模型的预测性能。The beneficial effects of the present invention are as follows: a scene-level context-aware emotion recognition deep network method of the present invention proposes a two-stage contextual emotion recognition network. Feature extraction is performed on the scene separately, and in the second stage, the adaptive network is used to perform weighted fusion of the features of different attributes in the two-layer sub-network. Using a two-stage contextual emotion recognition network, on the one hand, it solves the problem that the existing emotion recognition tasks for images mainly focus on face image data; The degree of influence improves the predictive performance of the model on the basis of enriching research work on image-based emotion recognition.

附图说明Description of drawings

图1是本发明的一种场景级上下文感知的情感识别深度网络方法的整体流程图；Fig. 1 is the overall flow chart of the emotion recognition deep network method of a kind of scene level context perception of the present invention;

图2是复杂情感图像及其情感维度标注信息展示图；Figure 2 is a display diagram of complex emotional images and their emotional dimension annotation information;

图3是卷积操作示意图；Figure 3 is a schematic diagram of a convolution operation;

图4是通过小卷积核堆叠扩大感受野示意图；Figure 4 is a schematic diagram of expanding the receptive field by stacking small convolution kernels;

图5是池化操作示意图。Figure 5 is a schematic diagram of the pooling operation.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明进行详细说明。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

本发明的一种场景级上下文感知的情感识别深度网络方法，具体过程如图1所示，具体包括以下步骤：A scene-level context-aware emotion recognition deep network method of the present invention, the specific process is shown in Figure 1, and specifically includes the following steps:

步骤1、采集图像，确定训练样本集X_in和测试样本集X_tm；Step 1, collect images, determine the training sample set X _in and the test sample set X _tm ;

每个训练样本样本和测试样本均有其对应的原始情感标注值和身体标注值。Each training sample sample and test sample has its corresponding original emotion annotation value and body annotation value.

对于训练样本集X_in，原始情感标注值为n×2维的向量 y＝[(a₁,v₁),(a₂,v₂),...,(a_n,v_n)]，其中(a₁,v₁)分别表示训练样本集X_in中的第1个样本的arousal和valence标签，…，(a_n,v_n)分别表示训练样本集X_in中的第n个样本的arousal和valence标签，身体部位标注值为n×4维的向量

表示身体部位图像集 X_B中的第1个样本的身体标注，…，

表示身体部位图像集X_B中的第n个样本的身体标注。For the training sample set X _in , the original sentiment annotation value is an n×2-dimensional vector y=[(a ₁ ,v ₁ ),(a ₂ ,v ₂ ),...,(a _n ,v _n )], where (a ₁ , v ₁ ) represent the arousal and valence labels of the first sample in the training sample set X _in , respectively, ..., (a _n , v _n ) represent the nth sample in the training sample set X _in , respectively arousal and valence labels, body part label value is a vector of n × 4 dimensions

represents the body annotation of the 1st sample in the set of body part images X _B , ...,

Represents the body annotation of the nth sample in the set of body part images X _B .

对于测试样本集X_tn，原始情感标注值为m×2维的向量 ty＝[(ta₁,tv₁),(ta₂,tv₂),...,(ta_m,tv_m)]，身体部位标注值为m×4维的向量

m代表测试样本的数目。For the test sample set X _tn , the original sentiment annotation value is an m×2-dimensional vector ty=[(ta ₁ ,tv ₁ ),(ta ₂ ,tv ₂ ),...,( _{tam ,tv m} ₎ ], Body part label value is a vector of dimensions m × 4

m represents the number of test samples.

其中，对训练样本集X_in进行身体部位提取的具体步骤为：The specific steps for extracting body parts from the training sample set X _in are as follows:

步骤2.1、读取训练样本集X_in中每个样本的身体标注(B_x1,B_y1,B_x2,B_y2)，其中(B_x1,B_y1,)，(B_x2,B_y2)为身体部位所在的斜对角的两点坐标，通过公式(1) 计算位置与尺寸参数集

其中：Step 2.1. Read the body annotations (B _x1 , _{By y1} , B _x2 , _{By y2} ) of each sample in the training sample set X _in , where (B _x1 , _{By y1} ,), (B _x2 , _{By y2} ) are the body The two-point coordinates of the diagonally opposite corner where the part is located, and the position and size parameter set is calculated by formula (1)

in:

公式(1)中，B_w表示身体部位图像的宽度，B_h表示身体部位图像的宽度。In formula (1), B _w represents the width of the body part image, and B _h represents the width of the body part image.

步骤2.2、根据步骤2.1中得到的参数集

对训练样本集X_in中每个训练样本进行裁剪，得到身体部位图像集X_B。Step 2.2, according to the parameter set obtained in step 2.1

Each training sample in the training sample set X _in is cropped to obtain a body part image set X _B .

其中，对训练样本集X_in进行集合内归一化处理的公式如下：Among them, the formula for performing intra-set normalization on the training sample set X _in is as follows:

对身体部位图像集X_B进行集合内归一化处理的公式如下：The formula for intra-set normalization of the body part image set X _B is as follows:

步骤4.1、对整体网络架构参数初始化，包含网络中所有卷积层、池化层、全连接层，将每层权重初始化为整体服从均值为0，标准差为1的高斯分布，偏置项统一初始化为0.001；Step 4.1. Initialize the parameters of the overall network architecture, including all convolutional layers, pooling layers, and fully-connected layers in the network. Initialize the weights of each layer to a Gaussian distribution with a mean of 0 and a standard deviation of 1, and the bias terms are unified. initialized to 0.001;

步骤4.2、将身体部位图像集X_body送入上层的卷积神经网络，将上下文情感图像集X_im送入下层的卷积神经网络，上层与下层卷积神经网络模型结构相同，均采用VGG16网络架构，VGG16网络架构参数如下表1所示：Step 4.2. Send the body part image set X _body to the upper convolutional neural network, and send the contextual emotion image set X _im to the lower convolutional neural network. The upper and lower convolutional neural network models have the same structure, and both use VGG16 network Architecture, VGG16 network architecture parameters are shown in Table 1 below:

表1情感特征提取卷积网络架构参数表Table 1 Sentiment feature extraction convolutional network architecture parameter table

由网络架构参数表1可见，对于网络结构中5个卷积层C1、C2、C3、 C4、C5，其对应的特征图谱个数分别为64、128、256、512、512，每个特征图谱由输入图像或上一层的输出X_m分别与对应数目的卷积模版K_uv作卷积运算，最后再加上偏置项b_v，卷积过程如图3所示，特征图谱的具体计算公式为：It can be seen from Table 1 of the network architecture parameters that for the five convolutional layers C1, C2, C3, C4, and C5 in the network structure, the corresponding number of feature maps are 64, 128, 256, 512, and 512, respectively. The input image or the output X _m of the previous layer is convolved with the corresponding number of convolution templates K _uv respectively, and finally the bias term b _v is added. The convolution process is shown in Figure 3. The specific calculation of the feature map The formula is:

公式(13)中，u的取值为{1,2,3,4,5}，表示对应的卷积层数，v的取值为各个层对应的卷积模版的个数，分别为64，128，256，512，512，

代表步长为1的卷积运算，卷积核大小都为3×3，通过小卷积核的堆叠，扩大了卷积层的感受野，同时可有效减少卷积层的参数量，感受野示意图如图4 所示。In formula (13), the value of u is {1, 2, 3, 4, 5}, indicating the number of corresponding convolution layers, and the value of v is the number of convolution templates corresponding to each layer, which are 64 respectively. , 128, 256, 512, 512,

Represents a convolution operation with a stride of 1, and the size of the convolution kernel is 3 × 3. The stacking of small convolution kernels expands the receptive field of the convolution layer, and at the same time can effectively reduce the amount of parameters of the convolution layer, and the receptive field The schematic diagram is shown in Figure 4.

对于池化层S1、S2、S3、S4，采用最大采样对应的卷积层得到的结果进行采样，本发明的池化采样区域大小为2×2，步长为2，池化过程如图5 所示，例如：卷积层C1的第1个特征图谱X_m的第一个采样区域2×2，采样结果得到池化层S1的第1个特征图谱的第一个输入O₁，其中采样方法为取 2×2区域中的最大值，其他输出也类似，采样后的水平和垂直空间分辨率变为原来的1/2。For the pooling layers S1, S2, S3, and S4, the results obtained by the convolutional layer corresponding to the maximum sampling are used for sampling. The size of the pooling sampling area in the present invention is 2×2, and the step size is 2. The pooling process is shown in Figure 5. As shown, for example: the first sampling area of the first feature map X _m of the convolutional layer C1 is 2×2, and the sampling result obtains the first input O ₁ of the first feature map of the pooling layer S1, where sampling The method is to take the maximum value in the 2×2 area, and other outputs are similar, and the horizontal and vertical spatial resolution after sampling becomes 1/2 of the original.

步骤4.3、归一化的身体部位图像集X_body、上下文情感图像集X_im分别经过上层的卷积神经网络和下层的卷积神经网络迭代及计算之后，可得到身体部位情感特征T_F，场景级上下文情感特征T_C，计算过程可用下式表示：Step 4.3. After the normalized body part image set X _body and the contextual emotion image set X _im are iterated and calculated by the upper-layer convolutional neural network and the lower-layer convolutional neural network respectively, the body part emotional feature _TF can be obtained, and the scene level context emotional feature T _C , the calculation process can be expressed by the following formula:

T_F＝F(X_body,W_F) (8)T _F =F(X _body ,W _F ) (8)

T_C＝F(X_im,W_C) (9)T _C =F(X _im ,W _C ) (9)

公式(8)中，W_F表示上层的身体部位情感特征提取涉及的网络参数，公式(9)中，W_C表示为下层的场景级上下文信息特征提取涉及的网络参数， F表示特征提取网络中卷积及池化层的计算操作；In formula (8), WF represents the network parameters involved in the extraction of emotional features of the upper body part, in formula (9), W _C represents the network parameters involved in the feature extraction of the scene-level context information of the lower layer, and _F represents the network parameters in the feature extraction network. Computational operations of convolution and pooling layers;

步骤5、将身体部位情感特征T_F与场景级上下文情感特征T_C分别送入上层的自适应层和下层的自适应层进行自适应权重学习，上层的自适应层输出身体部位融合权重λ_F，下层的自适应层输出上下文融合权重λ_C；Step 5. The body part emotional feature _TF and the scene _- level contextual emotional feature TC are respectively sent to the upper adaptive layer and the lower adaptive layer for adaptive weight learning, and the upper adaptive layer outputs the body part fusion weight _λF , the lower adaptive layer outputs the context fusion weight λ _C ;

对于自适应层网络结构，上、下两层的自适应层网络结构完全相同，两个网络分别包含一个最大池化层、两个卷积层和一个Softmax层，整体结构参数如下表2所示：For the adaptive layer network structure, the adaptive layer network structure of the upper and lower layers is exactly the same. The two networks respectively include a maximum pooling layer, two convolution layers and a Softmax layer. The overall structural parameters are shown in Table 2 below. :

表2自适应融合网络架构参数表Table 2 Parameter table of adaptive fusion network architecture

最终通过Softmax层输出身体部位融合权重λ_F和上下文融合权重λ_C，计算过程如下：Finally, the body part fusion weight λ _F and the context fusion weight λ _C are output through the Softmax layer. The calculation process is as follows:

λ_F＝F(T_F,W_D) (10)λ _F =F(T _F ,W _D ) (10)

λ_C＝F(T_C,W_E) (11)λ _C =F(T _C ,W _E ) (11)

公式(10)中，W_D为上层的自适应层网络参数，公式(11)中，W_E为下层的自适应层网络参数，同时通过自适应网络层最后的Softmax层对融合权值添加约束，使得λ_F+λ_C＝1。In formula (10), _WD is the network parameter of the upper adaptive layer, and in formula (11), _WE is the network parameter of the adaptive layer of the lower layer. At the same time, constraints are added to the fusion weights through the last Softmax layer of the adaptive network layer. , so that λ _F +λ _C =1.

步骤6、将身体部位情感特征T_F、场景级上下文情感特征T_C与身体部位融合权重λ_F、上下文融合权重λ_C进行加权融合得到结合上下文信息的情感融合特征T_A，然后将T_A经过全连接层线性映射，得到arousal和valence的初始预测值，采用KL散度损失函数，衡量arousal和valence初始预测值与对应的原始情感标注值之间的损失，通过网络反向传播，多次迭代，更新网络权重，逐渐减少损失，使算法逐步收敛，完成训练，得到网络模型。Step 6. Perform weighted fusion of the body part emotional feature TF , the scene _{-level contextual emotional feature TC with the body part fusion weight λ F , and the context fusion weight λ C} _to _obtain the emotional fusion feature _TA combined with the context information, and then pass _TA through _. The fully connected layer is linearly mapped to obtain the initial predicted values of arousal and valence, and the KL divergence loss function is used to measure the loss between the initial predicted values of arousal and valence and the corresponding original sentiment annotation values, backpropagation through the network, multiple iterations , update the network weight, gradually reduce the loss, make the algorithm gradually converge, complete the training, and get the network model.

步骤6.1、将身体部位情感特征T_F、场景级上下文情感特征T_C与身体部位融合权重λ_F、上下文融合权重λ_C进行加权融合，得到情感融合特征T_A，表达式如下所示：Step 6.1 _{. Perform weighted fusion of the body part emotional feature TF , the scene-level contextual emotional feature TC with the body part fusion weight λ F} _{, and the context fusion weight λ C} _to _obtain the emotional fusion feature _TA , and the expression is as follows:

公式(12)中，Π表示连接运算符，表示将融合权重后的身体部位情感特征和场景级上下文情感特征进行拼接，

表示不同特性特征与融合权重之间的卷积操作；In formula (12), Π represents the connection operator, which means to splicing the emotional features of body parts and scene-level contextual emotional features after the fusion weight,

Represents the convolution operation between different feature features and fusion weights;

步骤6.2、将融合特征T_A送入全连接层进行处理，由于预测值为连续型，因此最后一层全连接层改为线性激活函数，全连接层参数结构表如下所示：Step 6.2. Send the fusion feature T _A to the fully connected layer for processing. Since the predicted value is continuous, the last fully connected layer is changed to a linear activation function. The parameter structure table of the fully connected layer is as follows:

全连接层参数表Fully connected layer parameter table

步骤6.3、最终256维情感特征经过全连接层Fc10线性映射成为2维预测标签值arousal和valence值，损失函数采用KL散度损失，衡量预测标签值与原始标签值之间的损失，使网络反向传播，共迭代80次，更新网络权重，逐渐减少损失，使算法逐步收敛，完成训练。Step 6.3. The final 256-dimensional emotional feature is linearly mapped into the 2-dimensional predicted label value arousal and valence value through the fully connected layer Fc10. The loss function adopts the KL divergence loss to measure the loss between the predicted label value and the original label value, so that the network reverses. Propagation, a total of 80 iterations, update the network weight, gradually reduce the loss, make the algorithm gradually converge, and complete the training.

其中，采用的损失函数为KL散度函数，具体定义式如下所示：Among them, the loss function used is the KL divergence function, and the specific definition is as follows:

公式(14)中，p(y_i”)表示原始情感标签y的真实分布，q(ly_i”)表示模型预测标签值ly的分布，n表示总的训练样本个数。In formula (14), p(y _i” ) represents the true distribution of the original sentiment label y, q(ly _i” ) represents the distribution of the model predicted label value ly, and n represents the total number of training samples.

本发明所采用的卷积神经网络的反向传播包括三种情况：The back propagation of the convolutional neural network adopted by the present invention includes three situations:

(1)、池化层后接全连接层时，误差会由一个全连接层反向传入多个下采样层中，需要求得特征图中每个像素的梯度。(1) When the pooling layer is followed by the fully connected layer, the error will be reversely transmitted to multiple downsampling layers from one fully connected layer, and the gradient of each pixel in the feature map needs to be obtained.

如式(15)所示，f′(u^l _j)表示第l层激活函数的偏导，j代表当前层特征图的个数，δ^l+1 _j为第l+1层偏置的梯度，先将l+1层权重矩阵W^l+1 _j旋转180 度后，再将δ^l+1 _j周围邻域补0后与权重矩阵rot180(W^l+1 _j)进行卷积操作，其中⊙代表两个矩阵点乘。获得当前层特征图中对应元素的偏置梯度后，下采样层的偏置梯度和权重梯度分别如下式所示：As shown in equation (15), f'(u ^l _j ) represents the partial derivative of the activation function of the lth layer, j represents the number of feature maps of the current layer, and δ ^l+1 _j is the gradient of the l+1th layer bias , first rotate the weight matrix W ^l+1 _j of the l+1 layer by 180 degrees, then fill the neighborhood around δ ^l+1 _j with 0, and perform a convolution operation with the weight matrix rot180(W ^l+1 _j ), where ⊙ Represents the dot product of two matrices. After obtaining the bias gradient of the corresponding element in the feature map of the current layer, the bias gradient and weight gradient of the downsampling layer are respectively as follows:

d^l _j＝downsample(x^l-1 _j)为l-1层的第j个特征图的下采样结果。d ^l _j =downsample(x ^l-1 _j ) is the down-sampling result of the jth feature map of the l-1 layer.

(2)、池化层后接卷积层时与情形(1)类似，偏置和权重梯度的求解也相同。(2) When the pooling layer is followed by the convolutional layer, it is similar to the case (1), and the solutions of the bias and weight gradient are also the same.

(3)卷积层后为池化层时，特征图之间一一对应。同理，先求当前层特征图中每个像素点的偏置梯度δ^l _j：(3) When the convolutional layer is followed by a pooling layer, the feature maps correspond one-to-one. Similarly, first find the bias gradient δ ^l _j of each pixel in the feature map of the current layer:

δ^l _j＝w^l+1 _j(f′(u^l _j)×upsample(δ^l+1 _j)) (17)δ ^l _j =w ^l+1 _j (f′(u ^l _j )×upsample(δ ^l+1 _j )) (17)

公式(17)中，upsample(δ^l+1 _j)代表对δ^l+1 _j上采样，将l+1层下采样的第 j个结果上采样，恢复为与卷积特征图相同的尺寸，方便与f′(u^l _j)矩阵做点乘，卷积层的偏置梯度和权重梯度如式(18)和(19)所示。In formula (17), upsample(δ ^l+1 _j ) represents up-sampling δ ^l+1 _j , up-sampling the j-th result of the l+1 layer down-sampling, and restores it to the same size as the convolution feature map, It is convenient to do dot multiplication with the f'(u ^l _j ) matrix, and the bias gradient and weight gradient of the convolutional layer are shown in equations (18) and (19).

公式(18)、(19)中，w^l _j为第l层第j个特征图x^l _j对应的卷积核，p^l _j为第l-1层的第j个特征图x^l-1 _j与卷积核w^l _j卷积后得到的对应结果。In formulas (18) and (19), w ^l _j is the convolution kernel corresponding to the j-th feature map x ^l _j of the l-th layer, and p ^l _j is the j-th feature map x ^l-1 of the l-1 layer The corresponding result obtained after _j is convolved with the convolution kernel w ^l _j .

步骤7、对测试样本集X_tn，按照步骤2提取测试样本集X_tn中每一个测试样本的身体部位，获得测试身体部位图像集X_tB，接着按照步骤3分别对测试样本集X_tn和测试身体部位图像集X_tB进行归一化处理后，送入步骤6所得的网络模型，最终得到测试样本集X_tn的arousal和valence预测标签值。Step 7. For the test sample set X _tn , extract the body part of each test sample in the test sample set X _tn according to step 2, and obtain the test body part image set X _tB , and then according to step 3, respectively perform the test sample set X _tn and the test sample set X tn . After the body part image set X _tB is normalized, it is sent to the network model obtained in step 6, and finally the arousal and valence predicted label values of the test sample set X _tn are obtained.

步骤7的具体过程如下：The specific process of step 7 is as follows:

步骤7.1、读取测试样本集X_tn中每个样本的身体标注(tB_x1,tB_y1,tB_x2,tB_y2)，通过下面的公式计算得到位置与尺寸参数集

Step 7.1. Read the body annotation (tB _x1 , tB _y1 , tB _x2 , tB _y2 ) of each sample in the test sample set X _tn , and calculate the position and size parameter set by the following formula

步骤7.2、按照步骤7.1中得到的参数集

对测试样本集 X_tn进行裁剪，得到测试身体部位图像集X_tB。Step 7.2, follow the parameter set obtained in step 7.1

The test sample set X _tn is cropped to obtain the test body part image set X _tB .

步骤7.3、参照步骤3对测试样本集X_tn和测试身体部位图像集X_tB分别作集合内归一化处理，得到对应的测试上下文情感图像集X_tm和归一化的测试身体部位图像集X_tbody；Step 7.3, with reference to step 3, perform intra-set normalization on the test sample set X _tn and the test body part image set X _tB respectively, to obtain the corresponding test context emotional image set X _tm and the normalized test body part image set X t _tbody ;

步骤7.4、将归一化的测试身体部位图像集X_tbody送入步骤6所得网络模型的上层结构，将测试上下文情感图像集X_tm送如步骤6所得网络模型的下层结构，通过模型预测得到测试样本集X_tn的arousal和valence预测标签值。Step 7.4, the normalized test body part image set X _tbody is sent to the superstructure of the network model obtained in step 6, the test context emotional image set X _tm is sent to the lower structure of the network model obtained in step 6, and the test is obtained by model prediction. The arousal and valence of the sample set X _tn predict the label value.

实施例Example

本发明的实验基于EMOTIC数据库进行，EMOTIC数据集提供了丰富的复杂场景下的情感图像，图像中不仅包含待测对象本身还包括大量环境及其他因素的场景级上下文信息；该数据集共有23554个待测样本，可分为 17077个训练集样本，2088个验证集样本，4389个测试集样本。其标注信息不仅包含离散标注和连续维度标注，同时还包括每张图像中待测对象的身体部位标注，便于展开场景级上下文研究，部分复杂情感图像及其标注如附图 2所示。The experiments of the present invention are carried out based on the EMOTIC database. The EMOTIC dataset provides rich emotional images in complex scenes. The images include not only the object to be tested but also scene-level context information of a large number of environments and other factors; the dataset has a total of 23,554 The samples to be tested can be divided into 17077 training set samples, 2088 validation set samples, and 4389 test set samples. The annotation information not only includes discrete annotations and continuous dimension annotations, but also includes the body part annotations of the object to be measured in each image, which is convenient for scene-level context research. Some complex emotional images and their annotations are shown in Figure 2.

实验结果对比如下：The experimental results are compared as follows:

1)不同特征融合方式对情感识别的影响1) The influence of different feature fusion methods on emotion recognition

由于不同网络结构提取的特征往往其属性并不相同，若将身体部位情感特征与场景级上下文情感特征这两种不同属性的特征直接拼接并不能提供最优的性能判别。因此，为验证自适应融合网络的有效性，采用同样的实验设置，将两层卷积神经网络输出的特征分别采用直接拼接融合方式和自适应网络融合方式进行对比，实验结果如下表3所示：Since the features extracted by different network structures often have different attributes, directly splicing the features of the two different attributes, the body part emotional feature and the scene-level contextual emotional feature, cannot provide the optimal performance discrimination. Therefore, in order to verify the effectiveness of the adaptive fusion network, the same experimental settings were used to compare the features output by the two-layer convolutional neural network using the direct splicing fusion method and the adaptive network fusion method. The experimental results are shown in Table 3 below. :

表3不同特征融合方式对情感识别的影响Table 3 Effects of different feature fusion methods on emotion recognition

由表中数据可知，本发明设计的自适应融合网络对于情感特征的融合方式要优于直接将两种不同属性的特征进行直接拼接。这验证了本发明将自适应融合网络引入上下文情感识别网络结构的有效性。It can be seen from the data in the table that the adaptive fusion network designed by the present invention is better than the direct splicing of the features of two different attributes for the fusion of emotional features. This verifies the effectiveness of the present invention in introducing the adaptive fusion network into the contextual emotion recognition network structure.

Claims

1. A scene level context-aware emotion recognition deep network method is characterized by specifically comprising the following steps:

step 1, collecting images and determining a training sample set X_inAnd test sample set X_tn；

Step 2, reading a training sample set X_inExtracting the body part of each sample according to the body mark value to obtain a body part image set X_B；

Step 3, training sample set X_inCarrying out normalization processing in the set to obtain a context emotion image set X_im(ii) a For body part image set X_BCarrying out normalization processing in the set to obtain a normalized body part image set X_body；

Step 4, collecting the normalized body part image set X_bodyExtracting body part emotional feature T by convolutional neural network fed into upper layer_FAnd collecting the context emotion image set X_imExtracting scene level context emotional characteristic T by convolutional neural network sent to lower layer_C；

Step 5, emotional characteristics T of the body part_FContext and context sentiment feature T at scene level_CRespectively sending into the upper adaptive layer and the lower adaptive layer for feature adaptive learning, and outputting the fusion weight lambda of the body part by the upper adaptive layer_FThe adaptive layer output context fusion weight λ of the lower layer_C；

Step 6, emotion characteristics T of body part_FScene level contextual emotional characteristics T_CFusing weight lambda with body part_FContext fusion weight λ_CCarrying out weighted fusion to obtain emotion fusion characteristic T combined with context information_AThen T is added_AObtaining initial predicted values of arousal and value through linear mapping of a full connection layer, measuring the loss between the initial predicted values of arousal and value and corresponding original emotion marking values by adopting a KL divergence loss function, and updating network weight through network back propagation and multiple iterations to gradually reduce the loss, so that the algorithm gradually converges, and the training is completed to obtain a network model;

step 7, extracting a test sample set X according to the step 2_tnObtaining a test body part image set X of the body part of each test sample_tBThen, according to step 3, respectively testing sample sets X_tnAnd testing the body part image set X_tBAfter normalization processing, sending the normalized result into the network model obtained in the step 6, and finally obtaining a test sample set X_tnPredict tag values.

2. The method as claimed in claim 1, wherein the training sample set X in step 2 is a training sample set_inThe specific steps for extracting the body part are as follows:

step 2.1, read training sample set X_inBody labeling of each sample in (B)_x1,B_y1,B_x2,B_y2) Wherein (B)_x1,B_y1,)，(B_x2,B_y2) Calculating a set of position and size parameters for two point coordinates of an oblique angle where a body part is located by formula (1)

Wherein:

in the formula (1), B_wWidth of body part image, B_hA width representing a body part image;

step 2.2, according to the parameter set obtained in step 1.1

For training sample set X_inCutting each sample to obtain a body part image set X_B。

3. The method as claimed in claim 1, wherein the training sample set X in step 3 is a training sample set_inThe formula for the intra-set normalization process is as follows:

in the formula (2), X_inFor training the sample set, X_imIs a context emotion image set, sigma is a standard deviation image of a training sample set, x_meanIs a mean image of a training sample set;

x in formula (2)_meanAnd σ is defined as follows:

in the formulae (3) and (4), x_iRepresenting a training sample set X_inN represents the total number of training samples, n ≧ 1.

4. The method as claimed in claim 1, wherein the step 3 is a body part image set X_BThe formula for the intra-set normalization process is as follows:

in the formula (5), X_BFor a set of body part images, X_bodyIs a normalized body part image set, sigma 'is a standard deviation image of the body part image set, x'_meanA mean image of the body part image set;

x 'in formula (5)'_meanAnd σ' is defined as follows:

in formulas (6) and (7), x'_i'Image set X representing body part_BN represents the total number of training samples, n ≧ 1.

5. The method for emotion recognition depth network based on scene-level context awareness, according to claim 1, wherein in step 4, the convolutional neural network at the upper layer and the convolutional neural network at the lower layer have the same structural parameters, and both adopt VGG16 architecture.

6. The method as claimed in claim 1, wherein the body part emotion characteristics T in step 4 are determined by the context-aware emotion recognition depth network method_FAnd contextual affective feature T_CThe calculation process of (2) is as follows:

T_F＝F(X_body,W_F) (8)

T_C＝F(X_im,W_C) (9)

in the formula (8), W_FAll parameters of all convolutional and pooling layers of the convolutional neural network representing the upper layer, in equation (9), W_CAll parameters of all convolution layers and pooling layers of the convolutional neural network at the lower layer are represented, and F represents the calculation operation of convolution and pooling in the feature extraction network.

7. The method as claimed in claim 1, wherein the body region fusion weight λ of step 5 is a scene-level context-aware emotion recognition depth network method_FAnd context fusion weight λ_CThe calculation process of (2) is as follows:

λ_F＝F(T_F,W_D) (10)

λ_C＝F(T_C,W_E) (11)

in the formula (10), W_DFor the adaptive layer network parameters of the upper layer, in equation (11), W_EIs an adaptive layer network parameter of the lower layer, and_F+λ_C＝1。

8. the method as claimed in claim 1, wherein in step 5, the network structures of the upper adaptive layer and the lower adaptive layer are completely the same, and the specific network architecture parameters are as follows:

the upper adaptive layer and the lower adaptive layer respectively comprise a maximum pooling layer, two convolution layers and a Softmax layer.

9. The method as claimed in claim 1, wherein the body part emotion characteristics T in step 6 are obtained by the context-aware emotion recognition deep network method_FContext feature T at scene level_CFusing weight lambda with body part_FContext fusion weight λ_CThe calculation formula for performing weighted fusion is as follows:

in the formula (12), T_AThe n represents a connection operator for integrating the emotion characteristics of the body part after the weight is integrated and the context emotion characteristics of the scene level are spliced,

representing a convolution operation between different characteristic features and fusion weights.