CN111914594B

CN111914594B - Group emotion recognition method based on motion characteristics

Info

Publication number: CN111914594B
Application number: CN201910383943.XA
Authority: CN
Inventors: 卿粼波; 许盛宇; 吴晓红; 何小海; 滕奇志; 周文俊
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2022-07-01
Anticipated expiration: 2039-05-08
Also published as: CN111914594A

Abstract

The invention provides a group emotion recognition method based on motion characteristics, and mainly relates to the analysis of emotion in a scene video sequence by utilizing a multi-channel group emotion recognition network. The method comprises the following steps: constructing a multi-channel group emotion recognition network, extracting low-level motion features of different time sequences in parallel by using the network, rearranging and fusing the low-level features extracted by each channel in a time dimension, and obtaining global high-level features through a 3D residual error module to realize group emotion recognition. The method effectively avoids the problems of deviation, long time consumption and the like of manual feature extraction, so that the adaptability of the method is stronger. In addition, a multi-channel network is used for carrying out feature extraction on the long video sequence in a time sequence, the time correlation between frames is fully considered, low-level time sequence features are rearranged and fused on the time dimension, the coupling between the features is reduced, and the accuracy and the efficiency of group emotion recognition are improved.

Description

Group emotion recognition method based on motion characteristics

Technical Field

The invention relates to an emotion recognition problem in the field of deep learning, in particular to a group emotion recognition method based on motion characteristics.

Background

The emotion analysis of the crowd judges the emotional state of the crowd by analyzing the behaviors, dresses and the like of the crowd. Videos exist in real life in a large number, such as unmanned aerial vehicle video monitoring, network sharing videos, 3D videos and the like. By analyzing the emotion of the crowd in the video, the emotion and emotion change of the crowd in the video can be learned dynamically, and the video emotion recognition method has a wide application prospect.

Group emotion recognition is mainly analyzed by the emotions of people in a scene when a target is close to a camera. However, in a new era of rapid development, the mere analysis of clearly visible faces and emotions of groups has not fully satisfied the perception of emotional states of people. Therefore, the need of the study is not only to lift the study object from the face of an individual to a group, but also to lift the study on the group to the study on the emotion of a large-scale crowd far away from the shot. With the increasing annual population of the world in recent years, large-scale meeting occasions and population events are more and more, so that the emotion analysis of population groups is particularly important.

The traditional crowd emotion recognition algorithm mainly utilizes shallow algorithms to extract motion characteristics among video frames. For some shallow algorithms (support vector machines, single-layer neural networks, etc.), they need to manually extract features, and given a limited number of samples and computing units, the shallow structure is difficult to effectively express the features of a complex model, and especially when the studied object has rich meanings, the generalization capability is obviously insufficient, so the shallow structure has certain limitations. Existing research aiming at population groups mainly focuses on studying behaviors in the population, and the research on the emotional aspects of the population is less. The basic type of group movement may reflect a representative mood of the group. However, these conventional algorithms often extract too single features, resulting in an analysis of performance that is not deep enough. And a small amount of related research also gives full play to the advantage of deep learning, ensures that the motion characteristics of the group are automatically extracted, simultaneously improves the richness of the characteristics, and realizes the analysis of the group emotion in the video.

Disclosure of Invention

The invention aims to provide a group emotion recognition method based on motion characteristics, which combines deep learning with group emotion in a video, introduces a 3D residual convolution neural network structure, analyzes time sequence characteristics in a group video to obtain motion states of people in the video, and further analyzes emotion information of the people.

For convenience of explanation, the following concepts are first introduced:

convolutional Neural Network (CNN): the convolutional neural network is designed based on the inspiration of a visual neural mechanism, is a multilayer feedforward neural network, each layer is composed of a plurality of two-dimensional planes, each neuron on each plane works independently, and the convolutional neural network mainly comprises a feature extraction layer and a feature mapping layer.

3D Residual Module (3D Residual Module) to solve the problem of learning the identity mapping function, a linear layer is fitted to another feature f (x) h (x) -x, the main idea being to remove the same body part, highlighting the slight variations. And replacing the 2D convolution operation in the residual error module with a 3D convolution operation to obtain the 3D residual error module.

The invention specifically adopts the following technical scheme:

a group emotion recognition method based on motion characteristics is characterized by comprising the following steps:

a. dividing the long video sequence in time sequence, and respectively extracting low-level motion characteristics of each segment by channels;

b. analyzing low-level motion characteristics in the group video by using a 3D residual convolutional neural network;

c. rearranging and fusing the motion characteristics of the multi-channel network in the step a in the time dimension, and analyzing global high-level characteristics;

the method mainly comprises the following steps:

(1) preprocessing a group scene video sequence, and uniformly processing the video sequence into a resolution of 112 multiplied by 112;

(2) dividing a video sequence to be analyzed into 4 short videos, and respectively taking out initial 4 frames in each short video as the input of a network to obtain low-level motion characteristics on different time sequences;

(3) introducing a multi-Channel group emotion recognition network (Channel1 Channel, Channel2 Channel, Channel3 Channel and Channel 4 Channel) based on a 3D residual convolutional neural network, and extracting low-level motion features of corresponding time sequences of each short video.

(4) And performing recombination fusion on the acquired low-level motion characteristics on a time dimension through a fusion module, sending the combined global low-level characteristics into a 3D residual error module, analyzing the global high-level characteristics based on the long video, and finally classifying to obtain group emotion.

The invention has the beneficial effects that:

(1) the advantage of self-learning in the deep learning is fully developed, the machine can automatically learn the image characteristics, the problem of deviation and low efficiency of manually selecting the characteristics is effectively avoided, and the adaptive capacity is stronger.

(2) The original long video sequence is divided into small segments according to time sequence, data volume is compressed on the premise of keeping global information, and network speed and computing efficiency are improved.

(3) The 3D convolutional neural network is used for replacing the 2D convolutional neural network for feature extraction, time sequence information between frames is fully reserved, and the performance and efficiency of the network are optimized by using the 3D residual error module.

(4) The motion features extracted from the channels are rearranged and fused in the time dimension, the features with correlation in the 4 channels are fused together, the coupling between the features is reduced, the correlation of the motion features in the time dimension is fully mined, and the performance of the network on group emotion analysis is improved.

(5) The deep learning and the emotion analysis of the group scene are combined, the problem that the accuracy rate of the traditional method is low is solved, and the research value is improved.

Drawings

Fig. 1 is a diagram of a motion feature population emotion recognition network composition based on a 3D convolutional neural network.

Fig. 2 is an illustration of the way in which the low-level motion features extracted from multiple channels are rearranged and fused in the time dimension.

Detailed Description

The present invention is further described in detail with reference to the drawings and examples, it should be noted that the following examples are only for illustrating the present invention and should not be construed as limiting the scope of the present invention, and those skilled in the art should be able to make certain insubstantial modifications and adaptations to the present invention based on the above disclosure and should still fall within the scope of the present invention.

The group emotion recognition method based on the motion characteristics specifically comprises the following steps:

(1) and a mixed data set combining a CUHK group data set, a UCF data set, a Web data set and a PET2009 data set is used, each long video in the data set is divided into 4 sections of short videos, each section of short video is divided into a plurality of groups according to a group of 4 frames and recombined, a plurality of recombined short video sequences for training are formed, and the expansion of the training set is realized.

(2) Firstly, a Kinetics human motion video data set is adopted to pre-train the model, then the expanded short video data set is sent into 4 channels of a network in batches, and the motion characteristics of each time sequence are extracted respectively to obtain the corresponding low-level motion characteristics.

(3) The acquired 4-channel low-level motion features are recombined through a short video space-time feature fusion module, the low-level motion features acquired by each channel are firstly split and respectively divided into 4 feature segments, and then the feature segments with correlation are stacked together to acquire a recombined global low-level feature.

(4) And sending the fused global low-level features into a subsequent 3D residual module for continuous training to obtain global high-level features based on the long video, and finally classifying to obtain group emotion. And performing back propagation to optimize network parameters according to the classification result until an optimal network model is obtained.

(5) And inputting the test set data into a network, and verifying the performance of the model.

Claims

1. A group emotion recognition method based on motion characteristics is characterized by comprising the following steps:

a. the long-time video sequence is divided in time sequence, and low-level motion characteristics of each segment are respectively extracted by channels;

the method mainly comprises the following steps:

(2) dividing a video sequence to be analyzed into 4 short-time videos, respectively taking out initial 4 frames in each short-time video as the input of a network, and obtaining low-level motion characteristics on different time sequences;

(3) introducing a multi-channel group emotion recognition network based on a 3D residual convolution neural network, and respectively extracting low-level motion characteristics of corresponding time sequences of 4 short-time videos by using 4 channels sharing weight parameters;

(4) and (4) recombining and fusing the acquired low-level motion characteristics (4 x (C x H W)) in a time dimension through a fusion module: sequentially splicing feature maps of the ith (i belongs to [0, C ]) layer in 4 channels according to the time sequence to obtain C feature blocks of 4H W, and then sequentially combining the feature blocks to obtain a fusion feature ((C4) H W); and then sending the combined global low-level features into a 3D residual error module, analyzing the global high-level features based on the long video, and finally classifying to obtain group emotions.

2. The group emotion recognition method based on motion characteristics as claimed in claim 1, wherein the average frame extraction method is adopted in step (2), and the video sequence to be analyzed is firstly divided into 4 short videos, and then the initial 4 frames of the 4 short videos are respectively taken, and the video sequence is compressed on the premise of keeping a certain global information, so that the calculation efficiency is improved.

3. The group emotion recognition method based on motion features as claimed in claim 1, wherein in step (3), a 3D convolutional neural network is used instead of a 2D convolutional neural network for feature extraction, so that time sequence information between frames is fully retained, and the performance and efficiency of the network are optimized by using a 3D residual module.

4. The group emotion recognition method based on motion features as claimed in claim 1, wherein the motion features extracted from the 4 channels respectively in step (4) are rearranged and fused in the time dimension, the features with correlation in the 4 channels are fused together, the coupling between the features is reduced, the correlation of the motion features in the time dimension is fully mined, and the group emotion analysis performance by the network is improved.