CN110796081A

CN110796081A - Group behavior identification method based on relational graph analysis

Info

Publication number: CN110796081A
Application number: CN201911036597.4A
Authority: CN
Inventors: 李楠楠; 张世雄; 赵翼飞; 李若尘; 李革; 安欣赏; 张伟民
Original assignee: Shenzhen Longgang Intelligent Audiovisual Research Institute
Current assignee: Shenzhen Longgang Intelligent Audiovisual Research Institute
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-02-14
Anticipated expiration: 2039-10-29
Also published as: CN110796081B

Abstract

A group behavior identification method based on relational graph analysis comprises the following steps: s1 sparse sampling is carried out on the video sequence containing the group behavior event to obtain a video sampling frame which is used as a representative of the video event; s2, obtaining the character target characteristics of the single-frame sampling frame through a target detection network and dimensionality reduction operation; s3, constructing a graph model according to the appearance and position relation between individuals and extracting the behavior expression characteristics of the single-frame group by utilizing a graph convolution neural network; and S4, obtaining the group behavior expression characteristics of the whole video by fusing the multi-frame group behavior characteristics. Compared with the traditional video behavior dynamic description method based on the recurrent neural network model, the method provided by the invention can be used for simultaneously processing a plurality of sampling video frames, so that the time efficiency of algorithm operation is improved, and a leading detection level is obtained.

Description

Group behavior identification method based on relational graph analysis

Technical Field

The invention relates to the technical field of machine learning methods and video behavior analysis, in particular to a group behavior identification method based on relational graph analysis.

Background

In recent years, with the great success of deep learning techniques in the field of image analysis and understanding, more and more people begin to use deep neural networks to deal with the problems faced by video behavior analysis, such as video behavior classification, video behavior timeline or spatial axis-timeline positioning. At present, the video behavior analysis problem processed by the deep neural network relates to simple video behaviors, such as single-person sports such as high jump and long jump, and behaviors in daily life such as tooth brushing and head combing. However, in real life, a plurality of persons often participate in the behaviors, and describing the group behaviors not only describes the behaviors of single individuals in a scene, but also considers the interaction relationship among the individuals. At present, the existing method uses a pre-designated graph model to depict the interaction relationship between individuals, so that the flexibility is lacked, and different relationship models need to be designed in advance for different events; secondly, in order to describe the dynamic characteristics of the motion, a Recurrent Neural Network (RNN) is often used to process continuously sampled samples, which is highly complex in calculation, and requires sequential processing of input video frames, and the calculation time cost is correspondingly high.

Disclosure of Invention

The invention aims to provide a group behavior recognition method based on relational graph analysis, which comprises the steps of giving a section of multi-person video sequence as input, designing a relational graph model to model appearance and position relations between paired individuals, and utilizing a graph convolution neural network model to extract characteristics of single-frame multi-person behaviors; meanwhile, a video time axis sampling method is used for fusing single-frame multi-person behavior characteristics to represent the dynamic characteristics of group behaviors, and the purpose of classifying and identifying the group behaviors is achieved.

The technical scheme provided by the invention is as follows:

according to one aspect of the invention, a group behavior identification method based on relational graph analysis is provided, which comprises the following steps: s1 sparse sampling is carried out on the video sequence containing the group behavior event to obtain a video sampling frame which is used as a representative of the video event; s2, obtaining the character target characteristics of the single-frame sampling frame through a target detection network and dimensionality reduction operation; s3, constructing a graph model according to the appearance and position relation between individuals and extracting the behavior expression characteristics of the single-frame group by utilizing a graph convolution neural network; and S4, obtaining the group behavior expression characteristics of the whole video by fusing the multi-frame group behavior characteristics.

Preferably, in the above group behavior recognition method based on the relational graph analysis, in step S1, random sampling or uniform sampling in time order sparsely extracts a number of frames from the video sequence.

Preferably, in the group behavior recognition method based on the relational graph analysis, in step S2, the sample frame P is sampled by using the target detection network_sThe figure target in the system is extracted with characteristics, and the characteristic extraction process is that P is_sInputting the data into a target detection network for forward calculation, extracting the characteristics of the last convolution layer corresponding to the region where the character target is located, taking the characteristics as the characteristics of the character target and recording the characteristics as F_a(ii) a Then use a full connection layer, pair F_aAnd performing dimensionality reduction operation.

Preferably, in the group behavior identification method based on the relational graph analysis, in step S3, for the sampling frame P_sThe human target appearance feature and the position coordinate of the video sample frame obtained after the processing in step S2 are:

constructing a relational network graph in which

I.e. the feature after dimension reduction

l_iIs the central coordinate (x) of the area where the object is located_i，y_i)，

Specifically, for N personal object targets, a N x N relationship network graph G is calculated, wherein

And h_l(l_i，l_j) Respectively, the appearance and the position similarity are defined as follows:

wherein (x)_i，y_i) The coordinates of the center of the area where the human target is located; and

the relationship of target i relative to target j is normalized using the Softmax function, defined as follows:

preferably, in the group behavior recognition method based on the relational graph analysis, in step S3, after the construction of the relational network graph among the N human target objects is completed, the features of the individual human target are weighted and fused by using the convolutional neural network, and the convolutional neural network receives the feature after dimension reduction

And taking the relation network graph as input and output to a single personObject-object weighted fusion feature, denoted F_gBy sampling the frame P_sAll N personal object target features F_gTaking an average operation to obtain P_sThe single-frame group behavior expression characteristics are recorded as

Preferably, in the group behavior identification method based on the relational graph analysis, in step S4, the group behavior expression features of all the sampling frames are summed bitwise through the summation operation, so as to obtain the group behavior expression feature of the whole video, which is denoted as F_vAnd F_vAnd performing behavior category division through the full connection layer to obtain a classification result of the whole video and outputting the result.

Preferably, in the above group behavior identification method based on the relational graph analysis, the classification loss function in the classification is defined as follows:

wherein, y^GFor the real value of the behavior category of the video group,

for model behavior class prediction, L₁Is a cross entropy loss function.

According to another aspect of the invention, a group behavior identification system based on relational graph analysis is provided, which comprises a single-frame group behavior expression feature extraction module and a multi-frame behavior expression feature fusion and classification module, and is used for realizing the group behavior identification method, wherein the single-frame group behavior expression feature extraction module is used for extracting group behavior expression features from video frames obtained by sparse sampling; and the multi-frame behavior expression feature fusion and classification module is used for carrying out multi-frame fusion on the group behavior expression features extracted from the plurality of video sampling frames and constructing a classifier to classify the video behaviors.

Compared with the prior art, the invention has the beneficial effects that:

the group behavior identification method can realize the identification of the group behaviors in the video. Compared with the current method which adopts a manually designed graph model, the graph model based on appearance and position characteristics can automatically learn the relationship between pairs of individuals in an input video, and can be flexibly applied to wider occasions; meanwhile, compared with the traditional method for modeling the dynamic behavior characteristic by using the recurrent neural network, the method for sampling the video time axis provided by the invention can depict the behavior characteristic in a longer time period, and simultaneously, the characteristic extraction can be carried out on a plurality of time sampling frames in parallel, so that the characteristic that the recurrent neural network processes a plurality of frames in sequence is avoided, and the time efficiency of video processing is effectively improved.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below.

FIG. 1 is a flow chart of a group behavior recognition method based on a relational graph analysis according to the present invention.

Fig. 2 is a schematic structural diagram of an overall neural network of the group behavior recognition method based on the relationship diagram analysis of the present invention.

Detailed Description

The principle of the method of the invention is as follows: 1.) constructing a relational graph model between figures according to the appearance and position relation between paired individuals in a video frame, and extracting group behavior relational feature expression of a video single frame by utilizing a graph convolution neural network on the basis; 2.) randomly extracting a plurality of frames from the whole video event segment by utilizing a video time axis sampling method, extracting the single-frame group behavior relation expression characteristics, and performing multi-frame characteristic fusion to represent the dynamic characteristics of group behaviors on the basis of the single-frame group behavior relation expression characteristics to realize video group behavior identification.

A group behavior recognition method based on relational graph analysis extracts group behavior expression characteristics by performing sparse sampling on videos and constructing a pairwise inter-individual appearance and position relational graph model, and realizes video group behavior recognition by fusing multi-frame group behavior expression characteristics and constructing a classifier. Specifically, a system for realizing the identification method can be decomposed into a single-frame group behavior expression feature extraction module and a multi-frame behavior expression feature fusion and classification module. The single-frame group behavior expression feature extraction module is used for extracting group behavior expression features from video frames obtained by sparse sampling, specifically used for constructing a graph model according to appearance and position relations among individuals and extracting the single-frame group behavior expression features by utilizing a graph convolution neural network, namely extracting the features of multi-person behaviors of a single frame; the multi-frame behavior expression feature fusion and classification module is used for performing multi-frame fusion on the group behavior expression features extracted from a plurality of video sampling frames and constructing a classifier to classify video behaviors, namely fusing the single-frame group behavior expression features in an addition mode and constructing a full-connection-layer classifier to classify the video behaviors.

Compared with the existing method, the method provided by the invention has two main improvements: 1.) a relation graph model is automatically constructed based on the appearance and position characteristics among individuals, and compared with the traditional graph model which is artificially constructed based on prior knowledge, the graph model provided by the invention can be more widely applied to different group behavior expressions; 2.) randomly sampling video frames, extracting single-frame group behavior expression characteristics, and then performing characteristic fusion to represent the whole video group event. Compared with the traditional behavior dynamic characteristic description method based on the recurrent neural network, the method has the advantages that the characteristic extraction processes of all sampling frames can be simultaneously carried out, the characteristic that the recurrent neural network sequentially processes the video frames is avoided, and the time efficiency of video processing is greatly improved.

FIG. 1 is a flow chart of the method for identifying group behavior based on graph analysis according to the present invention, which includes steps s 1-s 4; fig. 2 is a schematic diagram of the overall neural network structure of the group behavior recognition method based on the relationship diagram analysis of the present invention, and the steps s 1-s 4 are described as follows in conjunction with fig. 2:

and s1, carrying out sparse sampling on the video sequence containing the group behavior event to obtain a video sample frame as a representative of the video event.That is, a number of frames are sparsely extracted from the video sequence according to a specified rule as a representative of the video event. The rule may be random sampling or uniform sampling according to time sequence, and the sampled video frame is marked as P_sAs shown at 1 in fig. 2.

And s2, obtaining the character target characteristics of the single-frame sampling frame through the target detection network and the dimensionality reduction operation. Specifically, the features of the human target in the sampling frame are extracted and compressed. Using a target detection network, such as the Faster R-CNN (ABS/1409.1556,2014) network 2 constructed based on VGG-16(KarenSimonyan and Andrew Zisserman. Very depth dependent image networks for big-scale image reception. CoRR, abs/1409.1556,2014), sampled frames P are processed using a sample detection network, such as the Faster R-CNN (Kaoqingren, Ross B. Girshick, and Jian Sun. Fastr-CNN: firmware real-time detection with real-scale projection networks. In NIPS, pages 91-99,2015) network 2_sThe character object in (1) is extracted. The characteristic extraction process is that of_sInputting the data into a target detection network for forward calculation, extracting the characteristics of the last convolution layer corresponding to the region where the character target is located, taking the characteristics as the characteristics of the character target and recording the characteristics as F_a. For the target detection network 2 based on the VGG-16 network architecture, the human target feature is a feature corresponding to the 3 rd convolution unit of the 5 th group of convolution layers in the region where the human target is located. Then use a full connection layer 3, pair F_aPerforming dimensionality reduction, i.e. after processing by the full connection layer 3

d is

The dimensions of the features, as shown in fig. 2, the fully connected layer 3 processes three different frames, which are exemplary embodiments only, and actually all sampled frames. P_sAfter the processing in step s2 is completed, N object features and position coordinates thereof are obtained and recorded as:

wherein

I.e. the feature after dimension reduction

l_iIs the central coordinate (x) of the area where the object is located_i，y_i)。

And s3, constructing a graph model according to the appearance and the position relation among individuals and extracting the behavior expression characteristics of the single-frame group by utilizing a graph convolution neural network. In particular, for a sampled frame P_sCharacter target appearance features and position coordinates:

constructing a relational network graph, extracting P by using graph convolution neural network_sThe expressed group behavior characteristics. The relationship network graph may be constructed based on appearance and location characteristics between pairs of people. For N person object targets, a matrix G of N x N, i.e. a relational network diagram (shown as 8 in FIG. 2), is calculated to describe their relationship to each other, wherein

And □_l(l_i，l_j) Respectively, the appearance and the position similarity are defined as follows:

after the construction of the relationship network graph G between N human target objects is completed, the characteristics of the single human target are weighted and fused by using a graph convolution neural network (GCN) (ThomasN. Kipf and Max welling. semi-collaborative classification with graphical connectivity network. CoRR, abs/1609.02907,2016.). As shown in operation 4 of FIG. 2, the atlas neural network receives the dimensionality reduced features

And the relation network graph G (8 in FIG. 2) is used as input, and the weighted fusion characteristic of the single person target is output and is marked as F_g. By sampling frames P_sAll N personal object target features F_gAn averaging operation, shown as operation 5 in FIG. 2, may result in P_sThe single-frame group behavior expression characteristics are recorded as

And s4, obtaining the group behavior expression characteristics of the whole video by fusing the multi-frame group behavior characteristics. Specifically, single-frame group behavior expression features are obtained

Then, through the addition operation 6, the group behavior expression characteristics of all the sampling frames are summed according to the bit to obtain the group behavior expression characteristics of the whole video, which is marked as F_v。F_vAnd the classification result of the whole video can be obtained by performing behavior classification through the full connection layer 7, and the result is output. The classification loss function is defined as follows:

wherein, y^GFor the real value of the behavior category of the video group,

and predicting the value of the behavior category of the model. L is₁Is a cross entropy loss function.

The above is a specific implementation scheme of the group behavior identification method based on the relational graph analysis provided by the invention. This embodiment was validated on the Volleyball (Mostafa S.Ibrahim, Srikanth muradharan, ZhiweiDeng, Arash Vahdat, and group Mori.A. longitudinal deep temporal model for group recognition. in CVPR, pages 1971-. Compared with the existing model, the method provided by the invention achieves the leading detection precision, and the specific comparison result is shown in the following table. In table 1, the higher the accuracy, the better the corresponding model, and the scheme provided by the present invention achieves the leading detection level.

TABLE 1 comparison of the method of the invention with existing methods in the classification of group behaviors

Comparative reference:

[1]Tianmin Shu,Sinisa Todorovic,and Song-Chun Zhu.CERN:confidence-energy recurrent network for group activity recognition.In CVPR,pages 4255–4263,2017.

[2]Mostafa S.Ibrahim and Greg Mori.Hierarchical relational networksfor group activity recognition and retrieval.In ECCV,pages 742–758,2018.

[3]Timur M.Bagautdinov,Alexandre Alahi,Franc，ois Fleuret,Pascal Fua,and Silvio Savarese.Social scene understanding:End-to-end multi-person actionlocalization and collective activity recognition.In CVPR,pages 3425–3434,2017.

Claims

1. a group behavior identification method based on relational graph analysis is characterized by comprising the following steps:

s1 sparse sampling is carried out on the video sequence containing the group behavior event to obtain a video sampling frame which is used as a representative of the video event;

s2, obtaining the character target characteristics of the single-frame sampling frame through a target detection network and dimensionality reduction operation;

s3, constructing a graph model according to the appearance and position relation between individuals and extracting the behavior expression characteristics of the single-frame group by utilizing a graph convolution neural network; and

s4, obtaining the group behavior expression characteristics of the whole video by fusing the multi-frame group behavior characteristics.

2. The method for group behavior recognition based on relational graph analysis according to claim 1, wherein in step S1, random sampling or uniform sampling in time order sparsely extracts a plurality of frames from the video sequence.

3. The method for group behavior recognition based on relational graph analysis according to claim 1, wherein in step S2, sampling frames P are sampled by using a target detection network_sThe figure target in the system is extracted with characteristics, and the characteristic extraction process is that P is_sInputting the data into a target detection network for forward calculation, extracting the characteristics of the last convolution layer corresponding to the region where the character target is located, taking the characteristics as the characteristics of the character target and recording the characteristics as F_a(ii) a Then use a full connection layer, pair F_aAnd performing dimensionality reduction operation.

4. The method for group behavior recognition based on relational graph analysis according to claim 1, wherein in the step S3, for the sampling frame P_sThe human target appearance feature and the position coordinate of the video sample frame obtained after the processing in step S2 are:

constructing a relational network graph in which

I.e. the feature after dimension reduction

5. the method for group behavior recognition based on relational graph analysis according to claim 4, wherein in step S3, after the construction of the relational network graph among N human target objects is completed, the weighted fusion is performed on the characteristics of the individual human target objects by using the graph convolution neural network, and the graph convolution neural network receives the characteristics after the dimension reduction

Taking the sum relation network graph as input, outputting the weighted fusion characteristic of the single person target, and recording the characteristic as F_gBy sampling the frame P_sAll N personal object featuresSign F_gTaking an average operation to obtain P_sThe single-frame group behavior expression characteristics are recorded as

6. The method for identifying group behaviors based on relational graph analysis according to claim 1, wherein in step S4, the group behavior expression features of all the sampling frames are summed bit by bit through the summation operation to obtain the group behavior expression feature of the whole video, which is denoted as F_vAnd F_vAnd performing behavior category division through the full connection layer to obtain a classification result of the whole video and outputting the result.

7. The method of claim 6, wherein the classification loss function in the classification is defined as follows:

wherein, y^GFor the real value of the behavior category of the video group,

for model behavior class prediction, L₁Is a cross entropy loss function.

8. A group behavior recognition system based on relational graph analysis, which is characterized by comprising a single-frame group behavior expression feature extraction module and a multi-frame behavior expression feature fusion and classification module, and is used for realizing the group behavior recognition method according to any one of claims 1 to 7, wherein,

the single-frame group behavior expression feature extraction module is used for extracting group behavior expression features from video frames obtained by sparse sampling; and

the multi-frame behavior expression feature fusion and classification module is used for performing multi-frame fusion on the group behavior expression features extracted from a plurality of video sampling frames and constructing a classifier to classify the video behaviors.