CN110688986B

CN110688986B - 3D convolution behavior recognition network method guided by attention branches

Info

Publication number: CN110688986B
Application number: CN201910984496.3A
Authority: CN
Inventors: 成锋娜; 周宏平; 茹煜; 张玉言
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2023-06-23
Anticipated expiration: 2039-10-16
Also published as: CN110688986A

Abstract

The invention designs a 3D convolution behavior recognition network method guided by attention branches, and the network designs 3D attention mechanisms with different resolutions so that the network focuses on more interesting space-time information. At the same time, the change of the space-time primitive in the attention feature is learned by a convolution mode so as to assist the 3D branch to extract more robust space-time features. In addition, the two branches learn through different types of convolution and depth to help the network construct complementary information. The invention has the advantages of less parameters and high robustness, and can be used for behavior recognition in a plurality of public places such as schools, markets and the like.

Description

3D convolution behavior recognition network method guided by attention branches

Technical Field

The invention relates to a 3D convolution behavior recognition network method guided by attention branches, and belongs to the technical field of image processing and pattern recognition.

Background

Behavior recognition has extremely wide application in public safety, medical, educational, entertainment, and other fields. For example, behavior recognition is applied to auxiliary driving: the status of the driver is critical for safe driving. In the auxiliary driving system, the video monitoring in the vehicle can detect and identify various behaviors of a driver, so that the driving safety is ensured. One is the detection of bad behavior during driving, such as fatigue driving, smoking, or other bad behavior such as talking to the driver. If the driver is detected to be in the bad state, the automobile can give an alarm to remind the driver or related departments to prevent traffic accidents. And secondly, the comfort of the driver is improved and the distraction of the driver is reduced. The auxiliary driving system can provide a suggestion for adjusting the seat for the sitting posture or the action of the driver, and can also carry out daily operations such as call taking, song cutting and the like by identifying the actions of other passengers. In the behavior recognition field, although many good works have been proposed, the robustness of the behavior recognition field still cannot meet the actual requirements. Therefore, there is still a need to develop more robust algorithms for behavior recognition.

Disclosure of Invention

In order to solve the problems, the invention designs a 3D convolution behavior recognition network method guided by attention branches, which adopts a multi-resolution attention mechanism to design and constructs a robust behavior recognition method.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method of attention-branch guided 3D convolution behavior recognition network, the method comprising the steps of:

step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing;

step 2: in the training process, the data are processed and amplified on line through a given rule;

step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;

step 4: sending training data into a network, and training the network established in the step 3;

step 5: testing is performed on the model trained in step 4.

As an improvement of the present invention, the step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing, wherein the data set specifically comprises the following steps:

step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is shot in multiple scenes to manufacture a data set;

step 102: respectively generating a training set and a test set according to the proportion of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n=m/5 samples, and recording T _train ＝{x ₁ ,x ₂ ,...,x _k Is a training set in which for the ith sample x _i ＝{Seq _i ；l _i }, wherein Seq _i Video sequence representing the ith sample, l _i Is the label of the sample. Accordingly, T _test ＝{y ₁ ,y ₂ ,...,y _k -test set;

step 103: if the collected data set is video, decoding the video sequence into a plurality of pictures, adjusting the sizes of the pictures to 240 multiplied by 320, numbering the pictures in sequence, and storing the pictures in a local folder, and jumping to the step 2 after the video is processed, wherein the pictures are marked as img000001.Jpg, img000002.Jpg and …; if the collected data set is a picture, jumping to step 104;

step 104: the size of the pictures is adjusted to 240 multiplied by 320, the pictures are numbered in sequence and recorded as img000001.Jpg, img000002.Jpg and …, meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped, and the step is used for normalizing the size and naming of the data, so that the data processing and amplification in the step 2 are convenient.

As an improvement of the present invention, the step 2: during the training process, the data are processed and amplified, which specifically comprises:

step 201: randomly extracting 32 continuous video frames from the video as network input, setting a video sequence to have n frames, and adding the first 32-n as a complement sequence to the nth frame if the video sequence is less than 32 frames;

step 202: a 224 x 32 network input tensor is randomly clipped from the spatial locations of the five pictures (i.e., four corners and a center) in order to augment the data to train a more robust network model.

As an improvement of the present invention, the 3D convolution behavior recognition network method of attention branch guidance in the step 3 has the following structure:

convolution layer 1: the 224 x 3 x 32 inputs are deconvolved with 32 3 x 7 convolution kernels of 2 x 2 steps, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;

convolution layer 2: the 112 x 32 x 16 features of the convolutional layer 1 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;

convolution layer 3: the 112 x 32 x 16 features of the convolutional layer 2 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;

pooling layer 1: the 112 x 32 x 16 feature result output by the convolutional layer 3, after passing through the 2 x 2 3D max-pooling layer, features of 56×56×32×8 were obtained;

convolution layer 4: the 56 x 32 x 8 features of the pooled layer 1 output are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;

convolution layer 5: the 56 x 32 x 8 features output by the convolutional layer 4 are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;

convolution layer 5_1: deconvolving the 56×56×32×8 features output by the convolutional layer 5 with 64 1×3×3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;

convolution layer 5_2: the 56 x 32 x 8 features of the output of the convolutional layer 5_1 are deconvolved with 64 1 x 1 convolution kernels, then the characteristics of 56 multiplied by 64 multiplied by 8 are obtained through the BN layer and the Sigmoid layer;

convolution layer 5_3: performing point multiplication on the outputs of the convolution layers 5 and 5_2 to obtain 56×56×64×8 characteristics;

pooling layer 2: the 56 x 64 x 8 characteristic result output from the convolution layer 5_3, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;

convolution layer 6: the 28 x 128 x 4 features of the pooled layer 2 output are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;

convolution layer 7: the 28 x 128 x 4 features output by the convolutional layer 6 are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;

convolution layer 7_1: deconvolving the 28×28×128×4 features output by the convolutional layer 7 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 7_2: the 28×28×128×4 features output by the convolution layer 7_1 are deconvolved with 128 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain the 28×28×128×4 features;

convolution layer 7_3: performing point multiplication on the outputs of the convolution layers 7 and 7_2 to obtain a characteristic of 28×28×128×4;

pooling layer 3: the 28 x 128 x 4 characteristic result output by the convolution layer 7_3, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;

convolution layer 8: the 14 x 128 x 2 features of the pooled layer 3 output are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;

convolution layer 9: the 14 x 256 x 2 features output by the convolutional layer 8 are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;

convolution layer 9_1: the 14×14×256×2 features output by the convolutional layer 9 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

convolution layer 9_2: the 14×14×256×2 features output by the convolutional layer 9_1 are deconvolved with 256 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain 14×14×256×2 features;

convolution layer 9_3: the outputs of convolution layer 9 and convolution layer 9_2 are dot multiplied to obtain a feature of 14×14×256×2;

pooling layer 4: the 14 x 256 x 2 feature result output by the convolution layer 9_3, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;

convolution layer 10: the 56 x 64 x 8 features of the output of convolution layer 5_2 are deconvolved with 64 1 x 1 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;

convolution layer 10_1: deconvolving the 56×56×64×8 features output by the convolutional layer 10 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;

convolution layer 10_2: deconvolving the 56×56×64×8 features output by the convolutional layer 10_1 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;

convolution layer 10_3: deconvolving the 56×56×64×8 features output by the convolutional layer 10_2 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;

convolution layer 10_4: deconvolving the 56×56×64×8 features output by the convolutional layer 10_3 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;

pooling layer 10_5: the 56 x 64 x 8 characteristic result output from the convolution layer 10_4, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;

convolution layer 11: deconvolving the 28×28×128×4 features output by convolution layer 7_2 with 64 1×1 convolution kernels, and passing through BN and ReLU layers to obtain 28×28×64×4 features;

aggregation layer 11_0: cascading the output of the pooling layer 10_5 and the output of the convolution layer 11 along the channel dimension to obtain a 28×28×128×4 feature;

convolution layer 11_1: deconvolving the 28×28×128×4 features output by the aggregation layer 11_0 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 11_2: the 28×28×128×4 features output by the convolutional layer 11_1 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 11_3: the 28×28×128×4 features output by the convolutional layer 11_2 are deconvolved with 128 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 11_4: the 28×28×128×4 features output by the convolutional layer 11_3 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

pooling layer 11_5: the 28 x 128 x 4 characteristic result output by the convolution layer 11_4, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;

convolution layer 12: the 14 x 256 x 2 features output by the convolutional layer 9_2 are deconvolved with 128 1 x 1 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×128×2 characteristic;

aggregation layer 12_0: cascading the output of the pooling layer 11_5 and the output of the convolution layer 12 along the channel dimension to obtain a feature of 14×14×256×2;

convolution layer 12_1: the 14×14×256×2 features output by the aggregation layer 12_0 are deconvolved with 256 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

convolution layer 12_2: the 14×14×256×2 features output by the convolutional layer 12_1 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

convolution layer 12_3: the 14×14×256×2 features output by the convolutional layer 12_2 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

convolution layer 12_4: the 14×14×256×2 features output by the convolutional layer 12_3 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

pooling layer 12_5: the 14 x 256 x 2 feature result output by the convolution layer 12_4, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;

aggregation layer 13: the output of the pooling layer 4 and the output of the pooling layer 12_5, the cascade along the dimensions of the channel, obtaining 1×1×512 x 1 features;

convolution layer 14: the 1 x 512 x 1 features output by the aggregation layer 13 are deconvolved with P1 x 1 convolution kernels, obtaining 1×1×px1 features;

classification layer 15: the output of the convolutional layer 14 is 1 x P x 1, the feature vector converted into the P dimension is used as the output of the network. The network model designs modules of different resolution 3D attention mechanisms that can help focus on interesting features of the network in time and space, while learning information of these convolved spatiotemporal attention modules by means of different types of convolutions to assist the network in learning more robust spatiotemporal features.

As an improvement of the present invention, in step 4, training data is sent to the network to train the network established in step 3. The method comprises the following steps:

step 401: the data generated in step 202 is input into the network model established in step 3.

Step 402: parameters of the network are trained based on a given tag. And (3) recording the parameters of the deep network model in the step (3) as theta and the output of the network as Cl. Training the network using a cross entropy loss function:

step 403: the network is trained using a random gradient descent method (SGD). After training a given number of times, the parameters of the model are saved.

As an improvement of the present invention, step 5: testing the model trained in the step 4, which specifically comprises the following steps:

step 501: if the tested video sequence is less than 32 frames, the video sequence is complemented by the step 201, then the video sequence is input into the model saved by the step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;

step 502: if the video sequence is greater than or equal to 32 frames, inputting the video into the model stored in step 403 sequentially according to one video clip per 32 frames, summing the output results, and selecting the result with the highest probability as the classification result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) The network designs 3D attention mechanisms with different multi-resolutions so as to help the network pay attention to the space-time characteristics with different resolutions, thereby extracting more robust space-time information;

(2) The invention further learns the change of the attention information in a convolution mode to guide the original 3D convolution layer, thereby improving the robustness of prediction;

(3) The two branches of the scheme adopt different convolution types and different depths, so that complementary space-time information is acquired.

Drawings

FIG. 1 is a diagram of a convolutional network model framework in the present invention.

Detailed Description

The invention will be further described with reference to specific embodiments and illustrative drawings, it being understood that the preferred embodiments described herein are for purposes of illustration and explanation only, and are not intended to limit the scope of the present invention.

Example 1: referring to fig. 1, a method for recognizing a network by a 3D convolution behavior of attention branch guidance, the method comprising the steps of:

step 5: testing is performed on the model trained in step 4.

The step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing, wherein the data set specifically comprises the following steps:

step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is shot through multiple scenes to produce a data set.

Step 102: and (3) respectively generating a training set and a testing set according to the proportion of 5:1 for each category of the data set manufactured in the step (101). If the training set has m samples, then the test set has n=m/5 samples, note T _train ＝{x ₁ ,x ₂ ,...,x _k Is a training set in which for the ith sample x _i ＝{Seq _i ；l _i }, wherein Seq _i Video sequence representing the ith sample, l _i Is the label of the sample. Accordingly, T _test ＝{y ₁ ,y ₂ ,...,y _k And } is a test set.

Step 103: if the collected data set is video, decoding the video sequence into a plurality of pictures, adjusting the sizes of the pictures to 240 multiplied by 320, numbering the pictures in sequence, and storing the pictures in a local folder, and jumping to the step 2 after the video is processed, wherein the pictures are marked as img000001.Jpg, img000002.Jpg and …; if the collected dataset is a picture, go to step 104.

Step 104: the pictures are adjusted to 240×320, numbered sequentially, and recorded as img000001.Jpg, img000002.Jpg, …, and saved in the local folder, and after the pictures are processed, the process jumps to step 2.

The step 2: during the training process, the data are processed and amplified, which specifically comprises:

step 202: a 224 x 32 network input tensor is randomly clipped from the spatial positions of the five pictures (i.e., four corners and a center).

The 3D convolution behavior recognition network method guided by the attention branches in the step 3 is structured as follows:

classification layer 15: converting the output 1×1×px1 of the convolution layer 14 into a P-dimensional feature vector as the output of the network;

in step 4, training data is sent to the network, and the network established in step 3 is trained. The method comprises the following steps:

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A method of attention-branch guided 3D convolution behavior recognition network, the method comprising the steps of:

step 3: establishing a 3D convolution behavior recognition network guided by attention branches;

step 5: testing the model trained in the step 4;

step 102: respectively generating a training set and a test set according to the proportion of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n=m/5 samples, and recording T _train ＝{x ₁ ,x ₂ ,...,x _k Is a training set in which for the ith sample x _i ＝{Seq _i ；l _i }, wherein Seq _i Video sequence representing the ith sample, l _i For the label of the sample, accordingly, T _test ＝{y ₁ ,y ₂ ,...,y _k -test set;

step 104: the size of the pictures is adjusted to 240 multiplied by 320, the pictures are numbered in sequence and recorded as img000001.Jpg, img000002.Jpg and …, and meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped;

step 202: randomly clipping a 224 x 32 network input tensor from the spatial positions of the five pictures (i.e., four corners and a center);

convolution layer 7_2: the 28 x 12 x 8 features of the output of the convolution layer 7_1 are deconvolved with 128 1 x 1 convolution kernels, then the characteristics of 28 multiplied by 128 multiplied by 4 are obtained through a BN layer and a Sigmoid layer;

convolution layer 10_3: deconvolving the 56×56×64×features output by the convolutional layer 10_2 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;

convolution layer 11_1: deconvolving the 28×28×128×features output by the aggregation layer 11_0 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 11_2: the 28×28×128×features output by the convolutional layer 11_1 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 11_3: the 28×28×128×features output by the convolutional layer 11_2 are deconvolved with 128 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 11_4: the 28×28×128×features output by the convolutional layer 11_3 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;

convolution layer 12_1: the 14×14×256×features output by the aggregation layer 12_0 are deconvolved with 256 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

convolution layer 12_2: the 14×14×256×features output by the convolutional layer 12_1 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;

classification layer 15: the output of the convolutional layer 14 is 1 x P x 1, the feature vector converted into the P dimension is used as the output of the network.

2. The method for recognizing a 3D convolution behavior guided by attention branches according to claim 1, wherein in step 4, training data is sent to a network to train the network established in step 3, specifically comprising the following steps:

step 401: inputting the data generated in the step 202 into the network model established in the step 3;

step 402: based on given parameters of the label training network, the parameters of the depth network model in the step 3 are recorded as Θ, the output of the network is Cl, and the cross entropy loss function is used for training the network:

step 403: the network is trained using a random gradient descent method (SGD) and after a given number of training, the parameters of the model are saved.

3. The attention branch guided 3D convolution behavior recognition network method according to claim 2, wherein step 5: testing the model trained in the step 4, which specifically comprises the following steps:

step 501: if the tested video sequence is less than 32 frames, supplementing the video sequence by the step 201, inputting the video sequence into the model saved by the step 403, taking the output result as a classification result, and if the tested video sequence is more than or equal to 32 frames, jumping to the step 502;