CN110688986A

CN110688986A - Attention branch guided 3D convolution behavior recognition network method

Info

Publication number: CN110688986A
Application number: CN201910984496.3A
Authority: CN
Inventors: 成锋娜; 周宏平; 茹煜; 张玉言
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-01-14
Anticipated expiration: 2039-10-16
Also published as: CN110688986B

Abstract

The invention designs an attention branch guided 3D convolution behavior recognition network method, and the network designs a 3D attention mechanism with different resolutions so that the network pays attention to more interesting spatiotemporal information. Meanwhile, the change of the space-time elements in the attention feature is learned in a convolution mode so as to assist the 3D branch in extracting more robust space-time features. In addition, the two branches learn through different types of convolution and depth to help the network construct complementary information. The method has the advantages of small quantity of parameters and high robustness, and can be used for behavior recognition of a plurality of public places such as schools, markets and the like.

Description

Attention branch guided 3D convolution behavior recognition network method

Technical Field

The invention relates to a 3D convolution behavior recognition network method guided by attention branches, and belongs to the technical field of image processing and pattern recognition.

Background

Behavior recognition has extremely wide application in the fields of public safety, medical treatment, education, entertainment and the like. For example, the application of behavior recognition to driving assistance: the state of the driver is critical to safe driving. In the auxiliary driving system, the video monitoring in the vehicle can detect and identify various behaviors of the driver, so that the driving safety is ensured. One is the detection of bad behaviors in the driving process, such as fatigue driving, smoking, or other bad behaviors that passengers have conversations with drivers. If the driver is detected to be in the bad state, the automobile can send out an alarm to remind the driver or relevant departments, and the traffic accident is prevented. Secondly, improve driver's travelling comfort and reduce driver's distraction. The driving assistance system can provide a suggestion for adjusting the seat through the sitting posture or the action of a driver, and can also carry out daily operations such as answering a phone call or cutting a song through recognizing the action of other passengers. In the field of behavior recognition, although many good works have been proposed, its robustness still fails to meet the actual demand. Therefore, there is still a need to develop more robust algorithms for behavior recognition.

Disclosure of Invention

In order to solve the problems, the invention designs an attention branch guided 3D convolution behavior recognition network method, which adopts a multi-resolution attention mechanism for design and constructs a robust behavior recognition method.

In order to achieve the purpose, the invention provides the following technical scheme:

an attention-branch-directed 3D convolutional behavior recognition network method, the method comprising the steps of:

step 1: making a training and testing data set; collecting or shooting through a network to obtain a behavior video sequence, and making and generating a data set for training and testing;

step 2: in the training process, data are processed and amplified on line through a given rule;

and step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;

and 4, step 4: transmitting the training data into a network, and training the network established in the step 3;

and 5: the model trained in step 4 was tested.

As a modification of the present invention, the step 1: making a training and testing data set; the behavior video sequence is collected or shot through a network, and a data set for training and testing is manufactured and generated, wherein the data set specifically comprises the following steps:

step 101: determining P behavior categories to be predicted, and respectively searching each category through a network or shooting each category in multiple scenes to make a data set;

step 102: respectively generating a training set and a test set according to a ratio of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n-m/5 samples, and recording T_train＝{x₁,x₂,...,x_kIs the training set, where x is the ith sample_i＝{Seq_i；l_iIn which Seq_iVideo sequence representing the ith sample,/_iIs the label for the sample. Accordingly, T_test＝{y₁,y₂,...,y_kThe test set is used as the test set;

step 103: if the collected data set is a video, decoding a video sequence into a plurality of pictures, adjusting the sizes of the pictures to be 240 multiplied by 320, sequentially numbering the pictures as img000001.jpg, img000002.jpg and …, simultaneously storing the pictures into a local folder, and jumping to the step 2 after the video is processed; if the collected data set is a picture, jumping to step 104;

step 104: the size of the picture is adjusted to 240 × 320, the pictures are numbered sequentially and are recorded as img000001.jpg, img000002.jpg and …, the pictures are stored in a local folder, and after the picture is processed, the step 2 is skipped to, wherein the step is to normalize the size and the name of the data, so that the data processing and the amplification in the step 2 are facilitated.

As a modification of the present invention, the step 2: in the training process, the data are processed and amplified, which specifically comprises the following steps:

step 201: randomly extracting 32 continuous video frames from a video as network input, setting a video sequence to have n frames, and if the number of the frames is less than 32, adding the first 32-n as a complementing sequence to the back of the nth frame;

step 202: the tensor of a 224 x 32 network input is randomly clipped from the spatial positions of the five pictures (i.e., the four corners and a center) in order to augment the data in order to train a more robust network model.

As an improvement of the present invention, the structure of the 3D convolution behavior recognition network method guided by attention branching in step 3 is as follows:

the convolutional layer 1: deconvoluting the input of 224 multiplied by 3 multiplied by 32 by 3 multiplied by 7 convolution kernels with 32 step sizes of 2 multiplied by 2, and then obtaining the characteristics of 112 multiplied by 32 multiplied by 16 through a BN layer and a ReLU layer;

and (3) convolutional layer 2: deconvolving the 112 × 112 × 32 × 16 features output by the convolutional layer 1 by using 32 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;

and (3) convolutional layer: deconvoluting the 112 × 112 × 32 × 16 features output by the convolutional layer 2 by using 32 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;

a pooling layer 1: the 112 × 112 × 32 × 16 feature result output by the convolutional layer 3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 56 × 56 × 32 × 8 feature;

and (4) convolutional layer: deconvoluting the 56 × 56 × 32 × 8 features output by the pooling layer 1 by using 64 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;

and (5) convolutional layer: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 4 by using 64 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;

convolutional layer 5_ 1: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 5 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;

convolutional layer 5_ 2: deconvoluting 56 × 56 × 32 × 8 features output by the convolutional layer 5_1 by using 64 1 × 1 × 1 convolution kernels, and then obtaining 56 × 56 × 64 × 8 features through a BN layer and a Sigmoid layer;

convolutional layer 5_ 3: performing dot multiplication on the outputs of the convolutional layer 5 and the convolutional layer 5_2 to obtain the characteristics of 56 multiplied by 64 multiplied by 8;

and (3) a pooling layer 2: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 5_3 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;

and (6) a convolutional layer: deconvoluting the 28 × 28 × 128 × 4 features output by the pooling layer 2 by using 128 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;

and (3) a convolutional layer 7: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 6 by using 128 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;

convolutional layer 7_ 1: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 7 by using 128 1 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;

convolutional layer 7_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_1 by using 128 1 × 1 × 1 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through a BN layer and a Sigmoid layer;

convolutional layer 7_ 3: performing dot multiplication on the output of the convolutional layer 7 and the convolutional layer 7_2 to obtain the characteristics of 28 multiplied by 128 multiplied by 4;

a pooling layer 3: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 7_3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;

and (3) convolutional layer 8: deconvolving the 14 × 14 × 128 × 2 features output by the pooling layer 3 by 256 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 256 × 2 features;

a convolutional layer 9: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 8 with 256 3 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;

convolutional layer 9_ 1: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9 with 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;

convolutional layer 9_ 2: deconvoluting the 14 × 14 × 256 × 2 features output by the convolutional layer 9_1 by 256 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a Sigmoid layer to obtain the 14 × 14 × 256 × 2 features;

convolutional layer 9_ 3: performing dot multiplication on the output of the convolutional layer 9 and the convolutional layer 9_2 to obtain the characteristics of 14 multiplied by 256 multiplied by 2;

and (4) a pooling layer: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 9_3 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;

the convolutional layer 10: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 5_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;

convolutional layer 10_ 1: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;

convolutional layer 10_ 2: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_1 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;

convolutional layer 10_ 3: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_2 by using 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;

convolutional layer 10_ 4: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_3 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;

pooling layer 10_ 5: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 10_4 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;

the convolutional layer 11: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 28 × 28 × 64 × 4 features;

aggregate layer 11_ 0: cascading the output of the pooling layer 10_5 and the output of the convolutional layer 11 along the channel dimension to obtain the characteristics of 28 × 28 × 128 × 4;

convolutional layer 11_ 1: deconvoluting the 28 × 28 × 128 × 4 features output by the aggregation layer 11_0 by using 128 1 × 3 × 3 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;

convolutional layer 11_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_1 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;

convolutional layer 11_ 3: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_2 by using 128 1 × 3 × 3 convolutional kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;

convolutional layer 11_ 4: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_3 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;

pooling layer 11_ 5: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 11_4 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;

the convolutional layer 12: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9_2 by using 128 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 128 × 2 features;

aggregate layer 12_ 0: the output of the pooling layer 11_5 and the output of the convolutional layer 12 are cascaded along the channel dimension to obtain the characteristics of 14 × 14 × 256 × 2;

convolutional layer 12_ 1: deconvoluting the 14 × 14 × 256 × 2 features output by the aggregation layer 12_0 by 256 1 × 3 × 3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;

convolutional layer 12_ 2: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_1 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;

convolutional layer 12_ 3: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_2 by 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;

convolutional layer 12_ 4: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_3 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;

pooling layer 12_ 5: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 12_4 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;

aggregate layer 13: cascading the output of the pooling layer 4 and the output of the pooling layer 12_5 along the channel dimension to obtain the characteristic of 1 × 1 × 512 × 1;

the convolutional layer 14: deconvolving the 1 × 1 × 512 × 1 features output by the aggregation layer 13 with P1 × 1 × 1 × 1 convolution kernels to obtain 1 × 1 × P × 1 features;

classification layer 15: the output 1 × 1 × P × 1 of the convolutional layer 14 is converted into a P-dimensional feature vector as an output of the network. The network model designs modules with different resolution 3D attention mechanisms, which can help to focus on interested features of the network in time and space, and meanwhile, the information of the convolution space-time attention modules is learned by means of different types of convolution so as to assist the network in learning more robust space-time features.

As an improvement of the present invention, in step 4, training data is sent to the network, and the network established in step 3 is trained. The method comprises the following specific steps:

step 401: the data generated in step 202 is input into the network model established in step 3.

Step 402: parameters of the network are trained based on the given labels. Note that the parameter of the deep network model in step 3 is Θ, and the output of the network is Cl. Training the network using a cross entropy loss function:

step 403: the network was trained using a random gradient descent method (SGD). After a given number of training sessions, the parameters of the model are saved.

As a modification of the present invention, step 5: and (3) testing the model trained in the step (4), which specifically comprises the following steps:

step 501: if the tested video sequence is less than 32 frames, the video sequence is subjected to step 201 to complement the video sequence, and then the video sequence is input into the model saved in step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;

step 502: and if the video sequence is greater than or equal to 32 frames, sequentially inputting the video into the model stored in the step 403 according to a video clip of every 32 frames, summing the output results, and selecting the result with the maximum probability as the classification result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the network in the invention designs 3D attention mechanisms with different multi-resolution ratios to help the network to pay attention to the space-time characteristics with different resolution ratios, thereby extracting more robust space-time information;

(2) the method further learns the change of the attention information in a convolution mode to guide the original 3D convolution layer and improve the robustness of prediction;

(3) according to the scheme, two branches adopt different convolution types and different depths, so that complementary space-time information is obtained.

Drawings

FIG. 1 is a diagram of a convolutional network model framework in the present invention.

Detailed Description

The present invention will be further described with reference to the following detailed description and the accompanying drawings, it being understood that the preferred embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.

Example 1: referring to fig. 1, a method of attention-branch-directed 3D convolutional behavior recognition networking, the method comprising the steps of:

and 5: the model trained in step 4 was tested.

The step 1: making a training and testing data set; the behavior video sequence is collected or shot through a network, and a data set for training and testing is manufactured and generated, wherein the data set specifically comprises the following steps:

step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is photographed in multiple scenes to make a data set.

Step 102: and (4) respectively generating a training set and a test set for each category of the data set manufactured in the step (101) according to a ratio of 5: 1. If the training set has m samples, the test set has n-m/5 samples, and T is recorded_train＝{x₁,x₂,...,x_kIs the training set, where x is the ith sample_i＝{Seq_i；l_iIn which Seq_iVideo sequence representing the ith sample,/_iIs the label for the sample. Accordingly, T_test＝{y₁,y₂,...,y_kAnd is a test set.

Step 103: if the collected data set is a video, decoding a video sequence into a plurality of pictures, adjusting the sizes of the pictures to be 240 multiplied by 320, sequentially numbering the pictures as img000001.jpg, img000002.jpg and …, simultaneously storing the pictures into a local folder, and jumping to the step 2 after the video is processed; if the collected data set is a picture, go to step 104.

Step 104: the size of the picture is adjusted to 240 × 320, the pictures are numbered sequentially and are recorded as img000001.jpg, img000002.jpg and …, meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped.

The step 2: in the training process, the data are processed and amplified, which specifically comprises the following steps:

step 202: the tensor of a 224 x 32 network input is randomly clipped from the spatial positions of the five pictures (i.e., the four corners and a center).

The structure of the 3D convolution behavior identification network method guided by the attention branch in the step 3 is as follows:

classification layer 15: converting the output 1 × 1 × P × 1 of the convolutional layer 14 into a P-dimensional feature vector as the output of the network;

in step 4, the training data is sent to the network to train the network established in step 3. The method comprises the following specific steps:

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An attention-branch-directed 3D convolutional behavior recognition network method, comprising the steps of:

and 5: the model trained in step 4 was tested.

2. The attention-branch-directed 3D convolution behavior recognition network method of claim 1,

3. The attention-branch-directed 3D convolutional behavior recognition network method of claim 1, wherein said step 2: in the training process, the data are processed and amplified, which specifically comprises the following steps:

4. The attention-branch-directed 3D convolution behavior recognition network method of claim 1,

classification layer 15: the output 1 × 1 × P × 1 of the convolutional layer 14 is converted into a P-dimensional feature vector as an output of the network.

5. The method for identifying a network according to the attention branch-guided 3D convolution behavior of claim 1, wherein in step 4, training data is sent to the network, and the network established in step 3 is trained, specifically as follows:

step 401: inputting the data generated in step 202 into the network model established in step 3;

step 402: based on the parameters of the given label training network, recording the parameters of the deep network model in the step 3 as theta, and the output of the network as Cl, training the network by using a cross entropy loss function:

step 403: the network is trained using the stochastic gradient descent method (SGD), and after a given number of training passes, the parameters of the model are saved.

6. The attention-branch-directed 3D convolutional behavior recognition network method of claim 1, wherein step 5: and (3) testing the model trained in the step (4), which specifically comprises the following steps: