CN108647599B

CN108647599B - Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network

Info

Publication number: CN108647599B
Application number: CN201810394571.6A
Authority: CN
Inventors: 宋佳蓉; 杨忠; 胡国雄; 韩佳明; 徐浩; 陈聪
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2022-04-15
Anticipated expiration: 2038-04-27
Also published as: CN108647599A

Abstract

The invention discloses a human behavior recognition method combining 3D (three-dimensional) jump layer connection and a recurrent neural network, which comprises the following steps of: step 1, dividing each video into N parts, extracting L frames of pictures from each part, wherein N, L are all natural numbers; step 2, performing space-time feature extraction on the video by using the trained 3D convolutional neural network, and performing series connection on the space-time features of different layers to obtain a high-dimensional feature vector; step 3, carrying out normalization processing on the high-dimensional feature vector obtained in the step 2; step 4, sending the high-dimensional feature vector subjected to the normalization processing in the step 3 into a recurrent neural network for feature fusion; and 5, classifying the features fused in the step 4 to obtain action categories corresponding to the videos. Compared with an artificial motion characteristic design method, the method has better robustness and can effectively process video information for a longer time.

Description

Human behavior recognition method combining 3D (three-dimensional) jump layer connection and recurrent neural network

Technical Field

The invention belongs to the technical field of computer vision recognition, and particularly relates to a human behavior recognition method combining 3D convolutional layer jump layer connection and a recurrent neural network.

Background

Human behavior recognition has important application prospects and market values in the fields of video monitoring, human-computer interaction, virtual reality and the like, so human behavior recognition based on videos becomes one of research hotspots in computer vision. Meanwhile, with the deep learning, especially the convolution neural network, obtaining effective results in computer vision, the human body behaviors based on the convolution neural network get the attention of a large number of researchers.

The behavior recognition method based on track and convolutional neural network feature extraction, with the patent number of CN201611117772.9, first extracts a track from input image video data, then extracts a convolutional feature by using a convolutional neural network, then extracts a convolutional feature based on track constraint and a stack-type local fisher vector feature by combining the track and convolutional layer features, and finally trains a support vector machine model to achieve the classification purpose.

In the human behavior recognition method based on the 3D convolutional neural network, with the patent number CN201510527937.9, an image with obvious human behavior characteristics is first screened and stored, then five pieces of channel information in total, including the gray scale, the gradient in the x and y directions, and the optical flow, are respectively extracted from the stored image, and then the convolutional characteristics of the five pieces of channel information are extracted by using the convolutional neural network, and finally classification is achieved.

Both methods need to extract low-dimensional motion information by using video data in advance, and cannot directly send original video data into a network, so that end-to-end classification prediction cannot be realized.

The method for recognizing behaviors based on deep learning and multi-scale information, with the patent number of CN201610047682.0, first splits a deep video into a plurality of video segments, then learns each video segment by using a branch neural network, then simply fuses and connects high-level representations learned by each neural network branch running in parallel, and finally sends the fused high-level representations into a full connection layer and a classification layer for classification and recognition. When a video with long duration is input in the method, the dimensionality of the fused features is too high, so that the network is difficult to train.

In summary, although there are many studies on motion recognition based on a convolutional neural network at home and abroad, there are problems that manual motion information extraction needs to be performed on video data in advance or videos with long duration cannot be processed.

Disclosure of Invention

The invention aims to provide a human behavior identification method combining 3D (three-dimensional) jump layer connection and a recurrent neural network, which does not need to artificially extract low-layer motion information.

In order to achieve the above purpose, the solution of the invention is:

a human behavior recognition method combining 3D jump layer connection and a recurrent neural network comprises the following steps:

step 1, dividing each video into N parts, extracting L frames of pictures from each part, wherein N, L are all natural numbers;

step 2, performing space-time feature extraction on the video by using the trained 3D convolutional neural network, and performing series connection on the space-time features of different layers to obtain a high-dimensional feature vector;

step 3, carrying out normalization processing on the high-dimensional feature vector obtained in the step 2;

step 4, sending the high-dimensional feature vector subjected to the normalization processing in the step 3 into a recurrent neural network for feature fusion;

and 5, classifying the features fused in the step 4 to obtain action categories corresponding to the videos.

In step 1, if the total number of frames in the video is less than 48 frames, the video is discarded, and if the total number of frames in the video cannot be divided by L, the last frames are discarded.

In the step 1, each video segment is divided into N parts, and the content of the L-frame picture extracted from each part is: one video is divided into N-3 parts on average by the number of frames, each part containing the same number of frames, and L-16 frame pictures are extracted from each part at equal intervals.

The specific process of the step 2 is as follows:

transfer learning: performing space-time feature extraction on each 16-frame input in the step 1 by using a trained convolution and pooling layer of a C3D network as a feature extractor to obtain a pool5 num-dimensional output vector, performing space-time feature extraction on the whole video, and expressing the extracted result by using a two-dimensional tensor (3, pool5num), wherein the pool5num expresses the output dimension of the pooling layer 5 of the feature extractor;

series connection: for each 16-frame input, the outputs of the pooling layers 1, 2, 3 and 5 of the feature extractor are connected in series to obtain a feature vector of pool _ num dimension, and a feature series connection operation is performed on the entire video, and the result after the series connection is represented by a two-dimensional tensor (3, pool _ num), where pool _ num is equal to pool1num + pool2num + pool3num + pool5num, pool1num, pool2num, and pool3num represent the output dimensions of the feature extractor pooling layers 1, 2, and 3, respectively.

In the step 3, the specific process of performing the normalization processing is as follows:

solving the mean value E [ x ] of each dimension of the high-dimensional feature vector in the step 2 on the whole training set^(k)]Variance with Var [ x ]^(k)]Then, each dimension of the feature vector is normalized by the formula:

wherein x is^(k)The value of the activation is represented by,

represents a value after the normalization processing;

then, the following formula is used for

Converting to obtain a compound of formula (I) by gamma^(k)And beta^(k)Changed new value y^(k)Then y is^(k)The characteristic value after normalization processing is represented as:

wherein, γ^(k)And beta^(k)All are parameters of the recurrent neural network, and are obtained by network learning.

In the step 4, the high-dimensional feature vector normalized in the step 3 is sent to a recurrent neural network, and the specific content of feature fusion is as follows: and feeding the normalized two-dimensional tensor (3, poroll _ num) into a circulating neural network, wherein the time step of the circulating neural network is 3, the circulating neural network comprises a hidden layer, and the number of neurons in the hidden layer is 256.

In the step 5, the output of the recurrent neural network in the step 4 is linearly classified by using a multi-class Softmax classifier.

After the scheme is adopted, the invention has the following beneficial effects:

(1) the spatial-temporal information of the video is directly extracted by utilizing a C3D network, and the motion information of the video data is not required to be extracted in advance, so that an end-to-end identification mode is realized;

(2) the characteristic information of different levels extracted by the convolution kernel is connected in series, and compared with the method of extracting low-level motion information of a video by manual design, the low-level spatiotemporal information output by the convolution kernel has higher robustness and is more comprehensive;

(3) the features of different levels in the feature extractor are connected in series to obtain high-order feature vectors containing information of different levels, and the step can obviously improve the identification accuracy;

(4) normalizing the high-dimensional feature vector to accelerate network convergence;

(5) and performing further time domain information fusion on the normalized feature vector by using the recurrent neural network, so that the whole network structure can process long-time video input.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a network architecture diagram of the present invention;

fig. 3 is a detail view of a recurrent neural network.

Detailed Description

The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a human behavior recognition method combining 3D jump layer connection and a recurrent neural network, and the specific process is embodied in the following steps:

and (3) video segmentation, namely dividing a video into 3 parts according to the average frame number, extracting 16 frames of pictures from each part at equal intervals to form a segment, wherein if the total frame number of the video is less than 48 frames, the video is discarded, and if the total frame number of the video cannot be divided by 3, the last frames are discarded.

After the video segmentation is finished, one video can be represented as 5-dimensional tensors (3, 16, H, W, 3), and each 16-frame segment can be represented as 4-dimensional tensors (16, H, W, 3), where 3 represents that the video is uniformly divided into 3 parts, 16 represents that 16 frames of pictures are extracted from each part, H and W represent the length and width of the pictures respectively, and 3 represents the number of channels of the pictures, here representing RGB pictures.

The videos of the training set are divided according to the principle, each video in the divided training set is represented as a 5-dimensional tensor (3, 16, H, W, 3), each video is scaled to a size of 3 × 16 × 128 × 171 × 3, each video can be represented as a 5-dimensional tensor (3, 16, 128, 171, 3), 16 represents the number of frames of each segment, and 128, 171, 3 represent the length, width and channel number of each frame of the picture respectively.

For a video, 5-dimensional tensors (3, 16, 128, 171, 3) are converted into 3 representations of 4-dimensional tensors (16, 128, 171, 3).

Following the previous step, all training set data is transformed into (16, 128, 171, 3) form, where each video includes 3 consecutive 4-dimensional tensors (16, 128, 171, 3).

All 4-dimensional tensors (16, 128, 171, 3) of the training set are averaged, and the obtained average value is represented by (16, 128, 171, 3) which is a 4-dimensional tensor mean.

Subtracting mean from all segments in the training set (16, 128, 171, 3) so that each pixel value in the training set is distributed around zero, this step can eliminate the effect of noise on the classification.

For a video, 3 consecutive 4-dimensional tensors (16, 128, 171, 3) with the mean subtracted are converted into 5-dimensional tensors (3, 16, 128, 171, 3).

And converting all video data in the training set into a 5-dimensional tensor (3, 16, 128, 171 and 3) representation form according to the previous step, and cutting the 5-dimensional tensor after the mean value reduction processing into a size of (3, 16, 112, 112 and 3).

And (3) feeding the processed video into a C3D feature extractor, feeding 1 16-frame fragment into each video for 3 times continuously, namely feeding 4-dimensional tensors (16, 112, 112, 3) into each video, outputting a pool5 num-dimensional vector, and finally expressing each video feature by using a 2-dimensional tensor (3, pool5num), wherein the pool5num expresses the output dimension of the feature extractor pooling layer 5.

For each video, the outputs of the pooling layers 1, 2, 3, and 5 of the feature extractor are connected in series, and as shown in fig. 2, the high-dimensional features after the series connection are represented by two-dimensional tensors (3, pool _ num), where pool _ num is pool1num + pool2num + pool3num + pool5num, and pool1num, pool2num, and pool3num represent the output dimensions of the feature extractor pooling layers 1, 2, and 3, respectively.

And sending the whole training set into a feature extractor and performing series connection operation to obtain high-dimensional feature training data.

And inputting the obtained high-dimensional feature training data into a recurrent neural network, as shown in fig. 2, wherein a normalization operation is performed before the high-dimensional feature training data is input into the recurrent neural network, as shown in fig. 2, and a normalization layer is added to accelerate the network convergence speed and the convergence effect.

The normalization operation consists of two steps, first, the features are normalized, and the mean value E [ x ] of each dimension of the high-dimensional features in the pool5num dimension is obtained on the whole training set^(k)]Variance with Var [ x ]^(k)]Then inputting x for each activation^(k)The standardization is carried out so that the standard,

the normalized value is expressed by the formula:

secondly, in order not to change the expression ability of the feature vector, the following formula is used for

wherein, γ^(k)And beta^(k)Obtained by network learning.

Training recurrent neural network parameters and parameters gamma using back propagation^(k)，β^(k)And obtaining the trained network.

When an input video is predicted, a video is divided into 3 parts on average according to the number of frames, and 16 frames of pictures are extracted from each part at equal intervals to form a segment, so that the video can be expressed as a 5-dimensional tensor (3, 16, H, W, 3).

The video to be predicted (3, 16, H, W, 3) is scaled to the size of (3, 16, 128, 171, 3), then the mean value mean is subtracted from each 16-frame video to be (16, 128, 171, 3), and then the video to be predicted is cropped at the center of each frame of picture, and the processed video to be predicted can be represented as a 5-dimensional tensor (3, 16, 112, 112, 3).

And converting the processed video (3, 16, 112, 112, 3) to be predicted into 3 4-dimensional tensors (16, 112112, 3) and sequentially sending the 4-dimensional tensors into a network to obtain the high-dimensional features (3, poroll _ num) connected in series.

And (3) sending the high-dimensional features (3, poroll _ num) of the video to be predicted into the trained BN layer and the recurrent neural network to obtain prediction output.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A human behavior recognition method combining 3D jump layer connection and a recurrent neural network is characterized by comprising the following steps:

step 1, dividing each video into N parts, extracting L frames of pictures from each part, wherein N, L are all natural numbers; the content of extracting L-frame pictures from each part is: dividing a video into N-3 parts according to the average number of frames, wherein each part comprises the same number of frames, and extracting L-16 frame pictures from each part at equal intervals;

step 2, performing space-time feature extraction on the video by using the trained 3D convolutional neural network, and performing series connection on the space-time features of different layers to obtain a high-dimensional feature vector; the specific process is as follows:

series connection: for each 16-frame input, serially connecting outputs of a pooling layer 1, a pooling layer 2, a pooling layer 3 and a pooling layer 5 of a feature extractor to obtain a feature vector of a pool _ num dimension, performing feature serial connection operation on the whole video, and expressing a result after serial connection by using a two-dimensional tensor (3, pool _ num), wherein the pool _ num is pool1num + pool2num + pool3num + pool5num, the pool1num, the pool2num and the pool3num respectively express output dimensions of the pooling layer 1, the pooling layer 2 and the pooling layer 3 of the feature extractor;

step 3, carrying out normalization processing on the high-dimensional feature vector obtained in the step 2; the specific process of carrying out the normalization treatment is as follows:

wherein x is^(k)The value of the activation is represented by,

represents a value after the normalization processing;

then, the following formula is used for

wherein, γ^(k)And beta^(k)All the parameters are cyclic neural network parameters and are obtained by network learning;

step 4, sending the high-dimensional feature vector subjected to the normalization processing in the step 3 into a recurrent neural network for feature fusion; the specific content is as follows: sending the normalized two-dimensional tensor (3, poroll _ num) into a cyclic neural network, wherein the time step of the cyclic neural network is 3, the cyclic neural network comprises a hidden layer, and the number of neurons in the hidden layer is 256;

2. The human behavior recognition method in combination with 3D saltating connectivity and recurrent neural networks of claim 1, wherein: in step 1, if the total number of frames of the video is less than 48 frames, the video is discarded, and if the total number of frames of the video cannot be divided by L, the last frames are discarded.

3. The human behavior recognition method in combination with 3D saltating connectivity and recurrent neural networks of claim 1, wherein: in the step 5, the output of the recurrent neural network in the step 4 is linearly classified by using a multi-class Softmax classifier.