CN110688986B - 3D convolution behavior recognition network method guided by attention branches - Google Patents

3D convolution behavior recognition network method guided by attention branches Download PDF

Info

Publication number
CN110688986B
CN110688986B CN201910984496.3A CN201910984496A CN110688986B CN 110688986 B CN110688986 B CN 110688986B CN 201910984496 A CN201910984496 A CN 201910984496A CN 110688986 B CN110688986 B CN 110688986B
Authority
CN
China
Prior art keywords
layer
convolution
features
output
relu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910984496.3A
Other languages
Chinese (zh)
Other versions
CN110688986A (en
Inventor
成锋娜
周宏平
茹煜
张玉言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Forestry University
Original Assignee
Nanjing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Forestry University filed Critical Nanjing Forestry University
Priority to CN201910984496.3A priority Critical patent/CN110688986B/en
Publication of CN110688986A publication Critical patent/CN110688986A/en
Application granted granted Critical
Publication of CN110688986B publication Critical patent/CN110688986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention designs a 3D convolution behavior recognition network method guided by attention branches, and the network designs 3D attention mechanisms with different resolutions so that the network focuses on more interesting space-time information. At the same time, the change of the space-time primitive in the attention feature is learned by a convolution mode so as to assist the 3D branch to extract more robust space-time features. In addition, the two branches learn through different types of convolution and depth to help the network construct complementary information. The invention has the advantages of less parameters and high robustness, and can be used for behavior recognition in a plurality of public places such as schools, markets and the like.

Description

3D convolution behavior recognition network method guided by attention branches
Technical Field
The invention relates to a 3D convolution behavior recognition network method guided by attention branches, and belongs to the technical field of image processing and pattern recognition.
Background
Behavior recognition has extremely wide application in public safety, medical, educational, entertainment, and other fields. For example, behavior recognition is applied to auxiliary driving: the status of the driver is critical for safe driving. In the auxiliary driving system, the video monitoring in the vehicle can detect and identify various behaviors of a driver, so that the driving safety is ensured. One is the detection of bad behavior during driving, such as fatigue driving, smoking, or other bad behavior such as talking to the driver. If the driver is detected to be in the bad state, the automobile can give an alarm to remind the driver or related departments to prevent traffic accidents. And secondly, the comfort of the driver is improved and the distraction of the driver is reduced. The auxiliary driving system can provide a suggestion for adjusting the seat for the sitting posture or the action of the driver, and can also carry out daily operations such as call taking, song cutting and the like by identifying the actions of other passengers. In the behavior recognition field, although many good works have been proposed, the robustness of the behavior recognition field still cannot meet the actual requirements. Therefore, there is still a need to develop more robust algorithms for behavior recognition.
Disclosure of Invention
In order to solve the problems, the invention designs a 3D convolution behavior recognition network method guided by attention branches, which adopts a multi-resolution attention mechanism to design and constructs a robust behavior recognition method.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a method of attention-branch guided 3D convolution behavior recognition network, the method comprising the steps of:
step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing;
step 2: in the training process, the data are processed and amplified on line through a given rule;
step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;
step 4: sending training data into a network, and training the network established in the step 3;
step 5: testing is performed on the model trained in step 4.
As an improvement of the present invention, the step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing, wherein the data set specifically comprises the following steps:
step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is shot in multiple scenes to manufacture a data set;
step 102: respectively generating a training set and a test set according to the proportion of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n=m/5 samples, and recording T train ={x 1 ,x 2 ,...,x k Is a training set in which for the ith sample x i ={Seq i ;l i }, wherein Seq i Video sequence representing the ith sample, l i Is the label of the sample. Accordingly, T test ={y 1 ,y 2 ,...,y k -test set;
step 103: if the collected data set is video, decoding the video sequence into a plurality of pictures, adjusting the sizes of the pictures to 240 multiplied by 320, numbering the pictures in sequence, and storing the pictures in a local folder, and jumping to the step 2 after the video is processed, wherein the pictures are marked as img000001.Jpg, img000002.Jpg and …; if the collected data set is a picture, jumping to step 104;
step 104: the size of the pictures is adjusted to 240 multiplied by 320, the pictures are numbered in sequence and recorded as img000001.Jpg, img000002.Jpg and …, meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped, and the step is used for normalizing the size and naming of the data, so that the data processing and amplification in the step 2 are convenient.
As an improvement of the present invention, the step 2: during the training process, the data are processed and amplified, which specifically comprises:
step 201: randomly extracting 32 continuous video frames from the video as network input, setting a video sequence to have n frames, and adding the first 32-n as a complement sequence to the nth frame if the video sequence is less than 32 frames;
step 202: a 224 x 32 network input tensor is randomly clipped from the spatial locations of the five pictures (i.e., four corners and a center) in order to augment the data to train a more robust network model.
As an improvement of the present invention, the 3D convolution behavior recognition network method of attention branch guidance in the step 3 has the following structure:
convolution layer 1: the 224 x 3 x 32 inputs are deconvolved with 32 3 x 7 convolution kernels of 2 x 2 steps, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
convolution layer 2: the 112 x 32 x 16 features of the convolutional layer 1 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
convolution layer 3: the 112 x 32 x 16 features of the convolutional layer 2 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
pooling layer 1: the 112 x 32 x 16 feature result output by the convolutional layer 3, after passing through the 2 x 2 3D max-pooling layer, features of 56×56×32×8 were obtained;
convolution layer 4: the 56 x 32 x 8 features of the pooled layer 1 output are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 5: the 56 x 32 x 8 features output by the convolutional layer 4 are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 5_1: deconvolving the 56×56×32×8 features output by the convolutional layer 5 with 64 1×3×3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 5_2: the 56 x 32 x 8 features of the output of the convolutional layer 5_1 are deconvolved with 64 1 x 1 convolution kernels, then the characteristics of 56 multiplied by 64 multiplied by 8 are obtained through the BN layer and the Sigmoid layer;
convolution layer 5_3: performing point multiplication on the outputs of the convolution layers 5 and 5_2 to obtain 56×56×64×8 characteristics;
pooling layer 2: the 56 x 64 x 8 characteristic result output from the convolution layer 5_3, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;
convolution layer 6: the 28 x 128 x 4 features of the pooled layer 2 output are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;
convolution layer 7: the 28 x 128 x 4 features output by the convolutional layer 6 are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;
convolution layer 7_1: deconvolving the 28×28×128×4 features output by the convolutional layer 7 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 7_2: the 28×28×128×4 features output by the convolution layer 7_1 are deconvolved with 128 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain the 28×28×128×4 features;
convolution layer 7_3: performing point multiplication on the outputs of the convolution layers 7 and 7_2 to obtain a characteristic of 28×28×128×4;
pooling layer 3: the 28 x 128 x 4 characteristic result output by the convolution layer 7_3, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;
convolution layer 8: the 14 x 128 x 2 features of the pooled layer 3 output are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;
convolution layer 9: the 14 x 256 x 2 features output by the convolutional layer 8 are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;
convolution layer 9_1: the 14×14×256×2 features output by the convolutional layer 9 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 9_2: the 14×14×256×2 features output by the convolutional layer 9_1 are deconvolved with 256 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain 14×14×256×2 features;
convolution layer 9_3: the outputs of convolution layer 9 and convolution layer 9_2 are dot multiplied to obtain a feature of 14×14×256×2;
pooling layer 4: the 14 x 256 x 2 feature result output by the convolution layer 9_3, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;
convolution layer 10: the 56 x 64 x 8 features of the output of convolution layer 5_2 are deconvolved with 64 1 x 1 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_1: deconvolving the 56×56×64×8 features output by the convolutional layer 10 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_2: deconvolving the 56×56×64×8 features output by the convolutional layer 10_1 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_3: deconvolving the 56×56×64×8 features output by the convolutional layer 10_2 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_4: deconvolving the 56×56×64×8 features output by the convolutional layer 10_3 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
pooling layer 10_5: the 56 x 64 x 8 characteristic result output from the convolution layer 10_4, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;
convolution layer 11: deconvolving the 28×28×128×4 features output by convolution layer 7_2 with 64 1×1 convolution kernels, and passing through BN and ReLU layers to obtain 28×28×64×4 features;
aggregation layer 11_0: cascading the output of the pooling layer 10_5 and the output of the convolution layer 11 along the channel dimension to obtain a 28×28×128×4 feature;
convolution layer 11_1: deconvolving the 28×28×128×4 features output by the aggregation layer 11_0 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_2: the 28×28×128×4 features output by the convolutional layer 11_1 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_3: the 28×28×128×4 features output by the convolutional layer 11_2 are deconvolved with 128 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_4: the 28×28×128×4 features output by the convolutional layer 11_3 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
pooling layer 11_5: the 28 x 128 x 4 characteristic result output by the convolution layer 11_4, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;
convolution layer 12: the 14 x 256 x 2 features output by the convolutional layer 9_2 are deconvolved with 128 1 x 1 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×128×2 characteristic;
aggregation layer 12_0: cascading the output of the pooling layer 11_5 and the output of the convolution layer 12 along the channel dimension to obtain a feature of 14×14×256×2;
convolution layer 12_1: the 14×14×256×2 features output by the aggregation layer 12_0 are deconvolved with 256 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_2: the 14×14×256×2 features output by the convolutional layer 12_1 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_3: the 14×14×256×2 features output by the convolutional layer 12_2 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_4: the 14×14×256×2 features output by the convolutional layer 12_3 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
pooling layer 12_5: the 14 x 256 x 2 feature result output by the convolution layer 12_4, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;
aggregation layer 13: the output of the pooling layer 4 and the output of the pooling layer 12_5, the cascade along the dimensions of the channel, obtaining 1×1×512 x 1 features;
convolution layer 14: the 1 x 512 x 1 features output by the aggregation layer 13 are deconvolved with P1 x 1 convolution kernels, obtaining 1×1×px1 features;
classification layer 15: the output of the convolutional layer 14 is 1 x P x 1, the feature vector converted into the P dimension is used as the output of the network. The network model designs modules of different resolution 3D attention mechanisms that can help focus on interesting features of the network in time and space, while learning information of these convolved spatiotemporal attention modules by means of different types of convolutions to assist the network in learning more robust spatiotemporal features.
As an improvement of the present invention, in step 4, training data is sent to the network to train the network established in step 3. The method comprises the following steps:
step 401: the data generated in step 202 is input into the network model established in step 3.
Step 402: parameters of the network are trained based on a given tag. And (3) recording the parameters of the deep network model in the step (3) as theta and the output of the network as Cl. Training the network using a cross entropy loss function:
Figure BDA0002236262650000061
step 403: the network is trained using a random gradient descent method (SGD). After training a given number of times, the parameters of the model are saved.
As an improvement of the present invention, step 5: testing the model trained in the step 4, which specifically comprises the following steps:
step 501: if the tested video sequence is less than 32 frames, the video sequence is complemented by the step 201, then the video sequence is input into the model saved by the step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;
step 502: if the video sequence is greater than or equal to 32 frames, inputting the video into the model stored in step 403 sequentially according to one video clip per 32 frames, summing the output results, and selecting the result with the highest probability as the classification result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) The network designs 3D attention mechanisms with different multi-resolutions so as to help the network pay attention to the space-time characteristics with different resolutions, thereby extracting more robust space-time information;
(2) The invention further learns the change of the attention information in a convolution mode to guide the original 3D convolution layer, thereby improving the robustness of prediction;
(3) The two branches of the scheme adopt different convolution types and different depths, so that complementary space-time information is acquired.
Drawings
FIG. 1 is a diagram of a convolutional network model framework in the present invention.
Detailed Description
The invention will be further described with reference to specific embodiments and illustrative drawings, it being understood that the preferred embodiments described herein are for purposes of illustration and explanation only, and are not intended to limit the scope of the present invention.
Example 1: referring to fig. 1, a method for recognizing a network by a 3D convolution behavior of attention branch guidance, the method comprising the steps of:
step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing;
step 2: in the training process, the data are processed and amplified on line through a given rule;
step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;
step 4: sending training data into a network, and training the network established in the step 3;
step 5: testing is performed on the model trained in step 4.
The step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing, wherein the data set specifically comprises the following steps:
step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is shot through multiple scenes to produce a data set.
Step 102: and (3) respectively generating a training set and a testing set according to the proportion of 5:1 for each category of the data set manufactured in the step (101). If the training set has m samples, then the test set has n=m/5 samples, note T train ={x 1 ,x 2 ,...,x k Is a training set in which for the ith sample x i ={Seq i ;l i }, wherein Seq i Video sequence representing the ith sample, l i Is the label of the sample. Accordingly, T test ={y 1 ,y 2 ,...,y k And } is a test set.
Step 103: if the collected data set is video, decoding the video sequence into a plurality of pictures, adjusting the sizes of the pictures to 240 multiplied by 320, numbering the pictures in sequence, and storing the pictures in a local folder, and jumping to the step 2 after the video is processed, wherein the pictures are marked as img000001.Jpg, img000002.Jpg and …; if the collected dataset is a picture, go to step 104.
Step 104: the pictures are adjusted to 240×320, numbered sequentially, and recorded as img000001.Jpg, img000002.Jpg, …, and saved in the local folder, and after the pictures are processed, the process jumps to step 2.
The step 2: during the training process, the data are processed and amplified, which specifically comprises:
step 201: randomly extracting 32 continuous video frames from the video as network input, setting a video sequence to have n frames, and adding the first 32-n as a complement sequence to the nth frame if the video sequence is less than 32 frames;
step 202: a 224 x 32 network input tensor is randomly clipped from the spatial positions of the five pictures (i.e., four corners and a center).
The 3D convolution behavior recognition network method guided by the attention branches in the step 3 is structured as follows:
convolution layer 1: the 224 x 3 x 32 inputs are deconvolved with 32 3 x 7 convolution kernels of 2 x 2 steps, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
convolution layer 2: the 112 x 32 x 16 features of the convolutional layer 1 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
convolution layer 3: the 112 x 32 x 16 features of the convolutional layer 2 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
pooling layer 1: the 112 x 32 x 16 feature result output by the convolutional layer 3, after passing through the 2 x 2 3D max-pooling layer, features of 56×56×32×8 were obtained;
convolution layer 4: the 56 x 32 x 8 features of the pooled layer 1 output are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 5: the 56 x 32 x 8 features output by the convolutional layer 4 are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 5_1: deconvolving the 56×56×32×8 features output by the convolutional layer 5 with 64 1×3×3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 5_2: the 56 x 32 x 8 features of the output of the convolutional layer 5_1 are deconvolved with 64 1 x 1 convolution kernels, then the characteristics of 56 multiplied by 64 multiplied by 8 are obtained through the BN layer and the Sigmoid layer;
convolution layer 5_3: performing point multiplication on the outputs of the convolution layers 5 and 5_2 to obtain 56×56×64×8 characteristics;
pooling layer 2: the 56 x 64 x 8 characteristic result output from the convolution layer 5_3, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;
convolution layer 6: the 28 x 128 x 4 features of the pooled layer 2 output are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;
convolution layer 7: the 28 x 128 x 4 features output by the convolutional layer 6 are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;
convolution layer 7_1: deconvolving the 28×28×128×4 features output by the convolutional layer 7 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 7_2: the 28×28×128×4 features output by the convolution layer 7_1 are deconvolved with 128 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain the 28×28×128×4 features;
convolution layer 7_3: performing point multiplication on the outputs of the convolution layers 7 and 7_2 to obtain a characteristic of 28×28×128×4;
pooling layer 3: the 28 x 128 x 4 characteristic result output by the convolution layer 7_3, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;
convolution layer 8: the 14 x 128 x 2 features of the pooled layer 3 output are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;
convolution layer 9: the 14 x 256 x 2 features output by the convolutional layer 8 are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;
convolution layer 9_1: the 14×14×256×2 features output by the convolutional layer 9 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 9_2: the 14×14×256×2 features output by the convolutional layer 9_1 are deconvolved with 256 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain 14×14×256×2 features;
convolution layer 9_3: the outputs of convolution layer 9 and convolution layer 9_2 are dot multiplied to obtain a feature of 14×14×256×2;
pooling layer 4: the 14 x 256 x 2 feature result output by the convolution layer 9_3, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;
convolution layer 10: the 56 x 64 x 8 features of the output of convolution layer 5_2 are deconvolved with 64 1 x 1 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_1: deconvolving the 56×56×64×8 features output by the convolutional layer 10 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_2: deconvolving the 56×56×64×8 features output by the convolutional layer 10_1 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_3: deconvolving the 56×56×64×8 features output by the convolutional layer 10_2 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_4: deconvolving the 56×56×64×8 features output by the convolutional layer 10_3 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
pooling layer 10_5: the 56 x 64 x 8 characteristic result output from the convolution layer 10_4, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;
convolution layer 11: deconvolving the 28×28×128×4 features output by convolution layer 7_2 with 64 1×1 convolution kernels, and passing through BN and ReLU layers to obtain 28×28×64×4 features;
aggregation layer 11_0: cascading the output of the pooling layer 10_5 and the output of the convolution layer 11 along the channel dimension to obtain a 28×28×128×4 feature;
convolution layer 11_1: deconvolving the 28×28×128×4 features output by the aggregation layer 11_0 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_2: the 28×28×128×4 features output by the convolutional layer 11_1 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_3: the 28×28×128×4 features output by the convolutional layer 11_2 are deconvolved with 128 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_4: the 28×28×128×4 features output by the convolutional layer 11_3 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
pooling layer 11_5: the 28 x 128 x 4 characteristic result output by the convolution layer 11_4, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;
convolution layer 12: the 14 x 256 x 2 features output by the convolutional layer 9_2 are deconvolved with 128 1 x 1 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×128×2 characteristic;
aggregation layer 12_0: cascading the output of the pooling layer 11_5 and the output of the convolution layer 12 along the channel dimension to obtain a feature of 14×14×256×2;
convolution layer 12_1: the 14×14×256×2 features output by the aggregation layer 12_0 are deconvolved with 256 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_2: the 14×14×256×2 features output by the convolutional layer 12_1 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_3: the 14×14×256×2 features output by the convolutional layer 12_2 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_4: the 14×14×256×2 features output by the convolutional layer 12_3 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
pooling layer 12_5: the 14 x 256 x 2 feature result output by the convolution layer 12_4, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;
aggregation layer 13: the output of the pooling layer 4 and the output of the pooling layer 12_5, the cascade along the dimensions of the channel, obtaining 1×1×512 x 1 features;
convolution layer 14: the 1 x 512 x 1 features output by the aggregation layer 13 are deconvolved with P1 x 1 convolution kernels, obtaining 1×1×px1 features;
classification layer 15: converting the output 1×1×px1 of the convolution layer 14 into a P-dimensional feature vector as the output of the network;
in step 4, training data is sent to the network, and the network established in step 3 is trained. The method comprises the following steps:
step 401: the data generated in step 202 is input into the network model established in step 3.
Step 402: parameters of the network are trained based on a given tag. And (3) recording the parameters of the deep network model in the step (3) as theta and the output of the network as Cl. Training the network using a cross entropy loss function:
Figure BDA0002236262650000111
step 403: the network is trained using a random gradient descent method (SGD). After training a given number of times, the parameters of the model are saved.
As an improvement of the present invention, step 5: testing the model trained in the step 4, which specifically comprises the following steps:
step 501: if the tested video sequence is less than 32 frames, the video sequence is complemented by the step 201, then the video sequence is input into the model saved by the step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;
step 502: if the video sequence is greater than or equal to 32 frames, inputting the video into the model stored in step 403 sequentially according to one video clip per 32 frames, summing the output results, and selecting the result with the highest probability as the classification result.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (3)

1. A method of attention-branch guided 3D convolution behavior recognition network, the method comprising the steps of:
step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing;
step 2: in the training process, the data are processed and amplified on line through a given rule;
step 3: establishing a 3D convolution behavior recognition network guided by attention branches;
step 4: sending training data into a network, and training the network established in the step 3;
step 5: testing the model trained in the step 4;
the step 1: creating training and testing data sets; acquiring a behavior video sequence through network collection or shooting, and making and generating a data set for training and testing, wherein the data set specifically comprises the following steps:
step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is shot in multiple scenes to manufacture a data set;
step 102: respectively generating a training set and a test set according to the proportion of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n=m/5 samples, and recording T train ={x 1 ,x 2 ,...,x k Is a training set in which for the ith sample x i ={Seq i ;l i }, wherein Seq i Video sequence representing the ith sample, l i For the label of the sample, accordingly, T test ={y 1 ,y 2 ,...,y k -test set;
step 103: if the collected data set is video, decoding the video sequence into a plurality of pictures, adjusting the sizes of the pictures to 240 multiplied by 320, numbering the pictures in sequence, and storing the pictures in a local folder, and jumping to the step 2 after the video is processed, wherein the pictures are marked as img000001.Jpg, img000002.Jpg and …; if the collected data set is a picture, jumping to step 104;
step 104: the size of the pictures is adjusted to 240 multiplied by 320, the pictures are numbered in sequence and recorded as img000001.Jpg, img000002.Jpg and …, and meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped;
the step 2: during the training process, the data are processed and amplified, which specifically comprises:
step 201: randomly extracting 32 continuous video frames from the video as network input, setting a video sequence to have n frames, and adding the first 32-n as a complement sequence to the nth frame if the video sequence is less than 32 frames;
step 202: randomly clipping a 224 x 32 network input tensor from the spatial positions of the five pictures (i.e., four corners and a center);
the 3D convolution behavior recognition network method guided by the attention branches in the step 3 is structured as follows:
convolution layer 1: the 224 x 3 x 32 inputs are deconvolved with 32 3 x 7 convolution kernels of 2 x 2 steps, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
convolution layer 2: the 112 x 32 x 16 features of the convolutional layer 1 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
convolution layer 3: the 112 x 32 x 16 features of the convolutional layer 2 output are deconvolved with 32 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 112×112×32×16 features;
pooling layer 1: the 112 x 32 x 16 feature result output by the convolutional layer 3, after passing through the 2 x 2 3D max-pooling layer, features of 56×56×32×8 were obtained;
convolution layer 4: the 56 x 32 x 8 features of the pooled layer 1 output are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 5: the 56 x 32 x 8 features output by the convolutional layer 4 are deconvolved with 64 3 x 3 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 5_1: deconvolving the 56×56×32×8 features output by the convolutional layer 5 with 64 1×3×3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 5_2: the 56 x 32 x 8 features of the output of the convolutional layer 5_1 are deconvolved with 64 1 x 1 convolution kernels, then the characteristics of 56 multiplied by 64 multiplied by 8 are obtained through the BN layer and the Sigmoid layer;
convolution layer 5_3: performing point multiplication on the outputs of the convolution layers 5 and 5_2 to obtain 56×56×64×8 characteristics;
pooling layer 2: the 56 x 64 x 8 characteristic result output from the convolution layer 5_3, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;
convolution layer 6: the 28 x 128 x 4 features of the pooled layer 2 output are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;
convolution layer 7: the 28 x 128 x 4 features output by the convolutional layer 6 are deconvolved with 128 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 28×28X104×4 features;
convolution layer 7_1: deconvolving the 28×28×128×4 features output by the convolutional layer 7 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 7_2: the 28 x 12 x 8 features of the output of the convolution layer 7_1 are deconvolved with 128 1 x 1 convolution kernels, then the characteristics of 28 multiplied by 128 multiplied by 4 are obtained through a BN layer and a Sigmoid layer;
convolution layer 7_3: performing point multiplication on the outputs of the convolution layers 7 and 7_2 to obtain a characteristic of 28×28×128×4;
pooling layer 3: the 28 x 128 x 4 characteristic result output by the convolution layer 7_3, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;
convolution layer 8: the 14 x 128 x 2 features of the pooled layer 3 output are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;
convolution layer 9: the 14 x 256 x 2 features output by the convolutional layer 8 are deconvolved with 256 3 x 3 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×256×2 characteristic;
convolution layer 9_1: the 14×14×256×2 features output by the convolutional layer 9 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 9_2: the 14×14×256×2 features output by the convolutional layer 9_1 are deconvolved with 256 1×1 convolution kernels, and then pass through the BN layer and the Sigmoid layer to obtain 14×14×256×2 features;
convolution layer 9_3: the outputs of convolution layer 9 and convolution layer 9_2 are dot multiplied to obtain a feature of 14×14×256×2;
pooling layer 4: the 14 x 256 x 2 feature result output by the convolution layer 9_3, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;
convolution layer 10: the 56 x 64 x 8 features of the output of convolution layer 5_2 are deconvolved with 64 1 x 1 convolution kernels, then passes through BN layer and ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_1: deconvolving the 56×56×64×8 features output by the convolutional layer 10 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_2: deconvolving the 56×56×64×8 features output by the convolutional layer 10_1 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_3: deconvolving the 56×56×64×features output by the convolutional layer 10_2 with 64 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
convolution layer 10_4: deconvolving the 56×56×64×8 features output by the convolutional layer 10_3 with 64 3×1×1 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 56×56×64×8 features;
pooling layer 10_5: the 56 x 64 x 8 characteristic result output from the convolution layer 10_4, after passing through the 2 x 2 3D max-pooling layer, features of 28×28×64×4 were obtained;
convolution layer 11: deconvolving the 28×28×128×4 features output by convolution layer 7_2 with 64 1×1 convolution kernels, and passing through BN and ReLU layers to obtain 28×28×64×4 features;
aggregation layer 11_0: cascading the output of the pooling layer 10_5 and the output of the convolution layer 11 along the channel dimension to obtain a 28×28×128×4 feature;
convolution layer 11_1: deconvolving the 28×28×128×features output by the aggregation layer 11_0 with 128 1×3×3 convolution kernels, and passing through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_2: the 28×28×128×features output by the convolutional layer 11_1 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_3: the 28×28×128×features output by the convolutional layer 11_2 are deconvolved with 128 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
convolution layer 11_4: the 28×28×128×features output by the convolutional layer 11_3 are deconvolved with 128 3×1×1 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 28×28×128×4 features;
pooling layer 11_5: the 28 x 128 x 4 characteristic result output by the convolution layer 11_4, after passing through the 2 x 2 3D max-pooling layer, a feature of 14×14×128×2 is obtained;
convolution layer 12: the 14 x 256 x 2 features output by the convolutional layer 9_2 are deconvolved with 128 1 x 1 convolution kernels, then passing through BN layer and ReLU layer to obtain 14×14×128×2 characteristic;
aggregation layer 12_0: cascading the output of the pooling layer 11_5 and the output of the convolution layer 12 along the channel dimension to obtain a feature of 14×14×256×2;
convolution layer 12_1: the 14×14×256×features output by the aggregation layer 12_0 are deconvolved with 256 1×3×3 convolution kernels, and then passed through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_2: the 14×14×256×features output by the convolutional layer 12_1 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_3: the 14×14×256×2 features output by the convolutional layer 12_2 are deconvolved with 256 1×3×3 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
convolution layer 12_4: the 14×14×256×2 features output by the convolutional layer 12_3 are deconvolved with 256 3×1×1 convolution kernels, and then pass through the BN layer and the ReLU layer to obtain 14×14×256×2 features;
pooling layer 12_5: the 14 x 256 x 2 feature result output by the convolution layer 12_4, after a 1 x 1 3D adaptive averaging pooling layer, obtaining 1×1×256 x 1 features;
aggregation layer 13: the output of the pooling layer 4 and the output of the pooling layer 12_5, the cascade along the dimensions of the channel, obtaining 1×1×512 x 1 features;
convolution layer 14: the 1 x 512 x 1 features output by the aggregation layer 13 are deconvolved with P1 x 1 convolution kernels, obtaining 1×1×px1 features;
classification layer 15: the output of the convolutional layer 14 is 1 x P x 1, the feature vector converted into the P dimension is used as the output of the network.
2. The method for recognizing a 3D convolution behavior guided by attention branches according to claim 1, wherein in step 4, training data is sent to a network to train the network established in step 3, specifically comprising the following steps:
step 401: inputting the data generated in the step 202 into the network model established in the step 3;
step 402: based on given parameters of the label training network, the parameters of the depth network model in the step 3 are recorded as Θ, the output of the network is Cl, and the cross entropy loss function is used for training the network:
Figure FDA0004194555040000051
step 403: the network is trained using a random gradient descent method (SGD) and after a given number of training, the parameters of the model are saved.
3. The attention branch guided 3D convolution behavior recognition network method according to claim 2, wherein step 5: testing the model trained in the step 4, which specifically comprises the following steps:
step 501: if the tested video sequence is less than 32 frames, supplementing the video sequence by the step 201, inputting the video sequence into the model saved by the step 403, taking the output result as a classification result, and if the tested video sequence is more than or equal to 32 frames, jumping to the step 502;
step 502: if the video sequence is greater than or equal to 32 frames, inputting the video into the model stored in step 403 sequentially according to one video clip per 32 frames, summing the output results, and selecting the result with the highest probability as the classification result.
CN201910984496.3A 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches Active CN110688986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910984496.3A CN110688986B (en) 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910984496.3A CN110688986B (en) 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches

Publications (2)

Publication Number Publication Date
CN110688986A CN110688986A (en) 2020-01-14
CN110688986B true CN110688986B (en) 2023-06-23

Family

ID=69112957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910984496.3A Active CN110688986B (en) 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches

Country Status (1)

Country Link
CN (1) CN110688986B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185543A (en) * 2020-09-04 2021-01-05 南京信息工程大学 Construction method of medical induction data flow classification model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059582A (en) * 2019-03-28 2019-07-26 东南大学 Driving behavior recognition methods based on multiple dimensioned attention convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9940539B2 (en) * 2015-05-08 2018-04-10 Samsung Electronics Co., Ltd. Object recognition apparatus and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059582A (en) * 2019-03-28 2019-07-26 东南大学 Driving behavior recognition methods based on multiple dimensioned attention convolutional neural networks

Also Published As

Publication number Publication date
CN110688986A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
CN106960206B (en) Character recognition method and character recognition system
CN109754015B (en) Neural networks for drawing multi-label recognition and related methods, media and devices
Hoang Ngan Le et al. Robust hand detection and classification in vehicles and in the wild
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
US11797845B2 (en) Model learning device, model learning method, and program
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
CN113065550B (en) Text recognition method based on self-attention mechanism
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN111126401A (en) License plate character recognition method based on context information
CN117373111A (en) AutoHOINet-based human-object interaction detection method
CN114821058A (en) Image semantic segmentation method and device, electronic equipment and storage medium
CN111079665A (en) Morse code automatic identification method based on Bi-LSTM neural network
CN116071817A (en) Network architecture and training method of gesture recognition system for automobile cabin
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
CN110688986B (en) 3D convolution behavior recognition network method guided by attention branches
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
CN114241564A (en) Facial expression recognition method based on inter-class difference strengthening network
CN110991219B (en) Behavior identification method based on two-way 3D convolution network
Revi et al. Gan-generated fake face image detection using opponent color local binary pattern and deep learning technique
CN111242114B (en) Character recognition method and device
CN117315752A (en) Training method, device, equipment and medium for face emotion recognition network model
CN116311504A (en) Small sample behavior recognition method, system and equipment
Xuan et al. Scalable fine-grained generated image classification based on deep metric learning
CN115393743A (en) Vehicle detection method based on double-branch encoding and decoding network, unmanned aerial vehicle and medium
CN114241573A (en) Facial micro-expression recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant