CN110688986A - Attention branch guided 3D convolution behavior recognition network method - Google Patents

Attention branch guided 3D convolution behavior recognition network method Download PDF

Info

Publication number
CN110688986A
CN110688986A CN201910984496.3A CN201910984496A CN110688986A CN 110688986 A CN110688986 A CN 110688986A CN 201910984496 A CN201910984496 A CN 201910984496A CN 110688986 A CN110688986 A CN 110688986A
Authority
CN
China
Prior art keywords
layer
features
convolutional
output
convolutional layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910984496.3A
Other languages
Chinese (zh)
Other versions
CN110688986B (en
Inventor
成锋娜
周宏平
茹煜
张玉言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Forestry University
Original Assignee
Nanjing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Forestry University filed Critical Nanjing Forestry University
Priority to CN201910984496.3A priority Critical patent/CN110688986B/en
Publication of CN110688986A publication Critical patent/CN110688986A/en
Application granted granted Critical
Publication of CN110688986B publication Critical patent/CN110688986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention designs an attention branch guided 3D convolution behavior recognition network method, and the network designs a 3D attention mechanism with different resolutions so that the network pays attention to more interesting spatiotemporal information. Meanwhile, the change of the space-time elements in the attention feature is learned in a convolution mode so as to assist the 3D branch in extracting more robust space-time features. In addition, the two branches learn through different types of convolution and depth to help the network construct complementary information. The method has the advantages of small quantity of parameters and high robustness, and can be used for behavior recognition of a plurality of public places such as schools, markets and the like.

Description

Attention branch guided 3D convolution behavior recognition network method
Technical Field
The invention relates to a 3D convolution behavior recognition network method guided by attention branches, and belongs to the technical field of image processing and pattern recognition.
Background
Behavior recognition has extremely wide application in the fields of public safety, medical treatment, education, entertainment and the like. For example, the application of behavior recognition to driving assistance: the state of the driver is critical to safe driving. In the auxiliary driving system, the video monitoring in the vehicle can detect and identify various behaviors of the driver, so that the driving safety is ensured. One is the detection of bad behaviors in the driving process, such as fatigue driving, smoking, or other bad behaviors that passengers have conversations with drivers. If the driver is detected to be in the bad state, the automobile can send out an alarm to remind the driver or relevant departments, and the traffic accident is prevented. Secondly, improve driver's travelling comfort and reduce driver's distraction. The driving assistance system can provide a suggestion for adjusting the seat through the sitting posture or the action of a driver, and can also carry out daily operations such as answering a phone call or cutting a song through recognizing the action of other passengers. In the field of behavior recognition, although many good works have been proposed, its robustness still fails to meet the actual demand. Therefore, there is still a need to develop more robust algorithms for behavior recognition.
Disclosure of Invention
In order to solve the problems, the invention designs an attention branch guided 3D convolution behavior recognition network method, which adopts a multi-resolution attention mechanism for design and constructs a robust behavior recognition method.
In order to achieve the purpose, the invention provides the following technical scheme:
an attention-branch-directed 3D convolutional behavior recognition network method, the method comprising the steps of:
step 1: making a training and testing data set; collecting or shooting through a network to obtain a behavior video sequence, and making and generating a data set for training and testing;
step 2: in the training process, data are processed and amplified on line through a given rule;
and step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;
and 4, step 4: transmitting the training data into a network, and training the network established in the step 3;
and 5: the model trained in step 4 was tested.
As a modification of the present invention, the step 1: making a training and testing data set; the behavior video sequence is collected or shot through a network, and a data set for training and testing is manufactured and generated, wherein the data set specifically comprises the following steps:
step 101: determining P behavior categories to be predicted, and respectively searching each category through a network or shooting each category in multiple scenes to make a data set;
step 102: respectively generating a training set and a test set according to a ratio of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n-m/5 samples, and recording Ttrain={x1,x2,...,xkIs the training set, where x is the ith samplei={Seqi;liIn which SeqiVideo sequence representing the ith sample,/iIs the label for the sample. Accordingly, Ttest={y1,y2,...,ykThe test set is used as the test set;
step 103: if the collected data set is a video, decoding a video sequence into a plurality of pictures, adjusting the sizes of the pictures to be 240 multiplied by 320, sequentially numbering the pictures as img000001.jpg, img000002.jpg and …, simultaneously storing the pictures into a local folder, and jumping to the step 2 after the video is processed; if the collected data set is a picture, jumping to step 104;
step 104: the size of the picture is adjusted to 240 × 320, the pictures are numbered sequentially and are recorded as img000001.jpg, img000002.jpg and …, the pictures are stored in a local folder, and after the picture is processed, the step 2 is skipped to, wherein the step is to normalize the size and the name of the data, so that the data processing and the amplification in the step 2 are facilitated.
As a modification of the present invention, the step 2: in the training process, the data are processed and amplified, which specifically comprises the following steps:
step 201: randomly extracting 32 continuous video frames from a video as network input, setting a video sequence to have n frames, and if the number of the frames is less than 32, adding the first 32-n as a complementing sequence to the back of the nth frame;
step 202: the tensor of a 224 x 32 network input is randomly clipped from the spatial positions of the five pictures (i.e., the four corners and a center) in order to augment the data in order to train a more robust network model.
As an improvement of the present invention, the structure of the 3D convolution behavior recognition network method guided by attention branching in step 3 is as follows:
the convolutional layer 1: deconvoluting the input of 224 multiplied by 3 multiplied by 32 by 3 multiplied by 7 convolution kernels with 32 step sizes of 2 multiplied by 2, and then obtaining the characteristics of 112 multiplied by 32 multiplied by 16 through a BN layer and a ReLU layer;
and (3) convolutional layer 2: deconvolving the 112 × 112 × 32 × 16 features output by the convolutional layer 1 by using 32 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;
and (3) convolutional layer: deconvoluting the 112 × 112 × 32 × 16 features output by the convolutional layer 2 by using 32 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;
a pooling layer 1: the 112 × 112 × 32 × 16 feature result output by the convolutional layer 3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 56 × 56 × 32 × 8 feature;
and (4) convolutional layer: deconvoluting the 56 × 56 × 32 × 8 features output by the pooling layer 1 by using 64 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;
and (5) convolutional layer: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 4 by using 64 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 5_ 1: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 5 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 5_ 2: deconvoluting 56 × 56 × 32 × 8 features output by the convolutional layer 5_1 by using 64 1 × 1 × 1 convolution kernels, and then obtaining 56 × 56 × 64 × 8 features through a BN layer and a Sigmoid layer;
convolutional layer 5_ 3: performing dot multiplication on the outputs of the convolutional layer 5 and the convolutional layer 5_2 to obtain the characteristics of 56 multiplied by 64 multiplied by 8;
and (3) a pooling layer 2: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 5_3 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;
and (6) a convolutional layer: deconvoluting the 28 × 28 × 128 × 4 features output by the pooling layer 2 by using 128 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
and (3) a convolutional layer 7: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 6 by using 128 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 7_ 1: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 7 by using 128 1 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 7_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_1 by using 128 1 × 1 × 1 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through a BN layer and a Sigmoid layer;
convolutional layer 7_ 3: performing dot multiplication on the output of the convolutional layer 7 and the convolutional layer 7_2 to obtain the characteristics of 28 multiplied by 128 multiplied by 4;
a pooling layer 3: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 7_3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;
and (3) convolutional layer 8: deconvolving the 14 × 14 × 128 × 2 features output by the pooling layer 3 by 256 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 256 × 2 features;
a convolutional layer 9: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 8 with 256 3 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 9_ 1: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9 with 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 9_ 2: deconvoluting the 14 × 14 × 256 × 2 features output by the convolutional layer 9_1 by 256 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a Sigmoid layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 9_ 3: performing dot multiplication on the output of the convolutional layer 9 and the convolutional layer 9_2 to obtain the characteristics of 14 multiplied by 256 multiplied by 2;
and (4) a pooling layer: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 9_3 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;
the convolutional layer 10: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 5_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 1: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 2: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_1 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 3: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_2 by using 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 4: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_3 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
pooling layer 10_ 5: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 10_4 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;
the convolutional layer 11: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 28 × 28 × 64 × 4 features;
aggregate layer 11_ 0: cascading the output of the pooling layer 10_5 and the output of the convolutional layer 11 along the channel dimension to obtain the characteristics of 28 × 28 × 128 × 4;
convolutional layer 11_ 1: deconvoluting the 28 × 28 × 128 × 4 features output by the aggregation layer 11_0 by using 128 1 × 3 × 3 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;
convolutional layer 11_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_1 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 11_ 3: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_2 by using 128 1 × 3 × 3 convolutional kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;
convolutional layer 11_ 4: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_3 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;
pooling layer 11_ 5: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 11_4 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;
the convolutional layer 12: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9_2 by using 128 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 128 × 2 features;
aggregate layer 12_ 0: the output of the pooling layer 11_5 and the output of the convolutional layer 12 are cascaded along the channel dimension to obtain the characteristics of 14 × 14 × 256 × 2;
convolutional layer 12_ 1: deconvoluting the 14 × 14 × 256 × 2 features output by the aggregation layer 12_0 by 256 1 × 3 × 3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 12_ 2: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_1 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 12_ 3: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_2 by 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 12_ 4: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_3 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
pooling layer 12_ 5: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 12_4 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;
aggregate layer 13: cascading the output of the pooling layer 4 and the output of the pooling layer 12_5 along the channel dimension to obtain the characteristic of 1 × 1 × 512 × 1;
the convolutional layer 14: deconvolving the 1 × 1 × 512 × 1 features output by the aggregation layer 13 with P1 × 1 × 1 × 1 convolution kernels to obtain 1 × 1 × P × 1 features;
classification layer 15: the output 1 × 1 × P × 1 of the convolutional layer 14 is converted into a P-dimensional feature vector as an output of the network. The network model designs modules with different resolution 3D attention mechanisms, which can help to focus on interested features of the network in time and space, and meanwhile, the information of the convolution space-time attention modules is learned by means of different types of convolution so as to assist the network in learning more robust space-time features.
As an improvement of the present invention, in step 4, training data is sent to the network, and the network established in step 3 is trained. The method comprises the following specific steps:
step 401: the data generated in step 202 is input into the network model established in step 3.
Step 402: parameters of the network are trained based on the given labels. Note that the parameter of the deep network model in step 3 is Θ, and the output of the network is Cl. Training the network using a cross entropy loss function:
Figure BDA0002236262650000061
step 403: the network was trained using a random gradient descent method (SGD). After a given number of training sessions, the parameters of the model are saved.
As a modification of the present invention, step 5: and (3) testing the model trained in the step (4), which specifically comprises the following steps:
step 501: if the tested video sequence is less than 32 frames, the video sequence is subjected to step 201 to complement the video sequence, and then the video sequence is input into the model saved in step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;
step 502: and if the video sequence is greater than or equal to 32 frames, sequentially inputting the video into the model stored in the step 403 according to a video clip of every 32 frames, summing the output results, and selecting the result with the maximum probability as the classification result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the network in the invention designs 3D attention mechanisms with different multi-resolution ratios to help the network to pay attention to the space-time characteristics with different resolution ratios, thereby extracting more robust space-time information;
(2) the method further learns the change of the attention information in a convolution mode to guide the original 3D convolution layer and improve the robustness of prediction;
(3) according to the scheme, two branches adopt different convolution types and different depths, so that complementary space-time information is obtained.
Drawings
FIG. 1 is a diagram of a convolutional network model framework in the present invention.
Detailed Description
The present invention will be further described with reference to the following detailed description and the accompanying drawings, it being understood that the preferred embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.
Example 1: referring to fig. 1, a method of attention-branch-directed 3D convolutional behavior recognition networking, the method comprising the steps of:
step 1: making a training and testing data set; collecting or shooting through a network to obtain a behavior video sequence, and making and generating a data set for training and testing;
step 2: in the training process, data are processed and amplified on line through a given rule;
and step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;
and 4, step 4: transmitting the training data into a network, and training the network established in the step 3;
and 5: the model trained in step 4 was tested.
The step 1: making a training and testing data set; the behavior video sequence is collected or shot through a network, and a data set for training and testing is manufactured and generated, wherein the data set specifically comprises the following steps:
step 101: p behavior categories to be predicted are determined, each category is searched through a network, or each category is photographed in multiple scenes to make a data set.
Step 102: and (4) respectively generating a training set and a test set for each category of the data set manufactured in the step (101) according to a ratio of 5: 1. If the training set has m samples, the test set has n-m/5 samples, and T is recordedtrain={x1,x2,...,xkIs the training set, where x is the ith samplei={Seqi;liIn which SeqiVideo sequence representing the ith sample,/iIs the label for the sample. Accordingly, Ttest={y1,y2,...,ykAnd is a test set.
Step 103: if the collected data set is a video, decoding a video sequence into a plurality of pictures, adjusting the sizes of the pictures to be 240 multiplied by 320, sequentially numbering the pictures as img000001.jpg, img000002.jpg and …, simultaneously storing the pictures into a local folder, and jumping to the step 2 after the video is processed; if the collected data set is a picture, go to step 104.
Step 104: the size of the picture is adjusted to 240 × 320, the pictures are numbered sequentially and are recorded as img000001.jpg, img000002.jpg and …, meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped.
The step 2: in the training process, the data are processed and amplified, which specifically comprises the following steps:
step 201: randomly extracting 32 continuous video frames from a video as network input, setting a video sequence to have n frames, and if the number of the frames is less than 32, adding the first 32-n as a complementing sequence to the back of the nth frame;
step 202: the tensor of a 224 x 32 network input is randomly clipped from the spatial positions of the five pictures (i.e., the four corners and a center).
The structure of the 3D convolution behavior identification network method guided by the attention branch in the step 3 is as follows:
the convolutional layer 1: deconvoluting the input of 224 multiplied by 3 multiplied by 32 by 3 multiplied by 7 convolution kernels with 32 step sizes of 2 multiplied by 2, and then obtaining the characteristics of 112 multiplied by 32 multiplied by 16 through a BN layer and a ReLU layer;
and (3) convolutional layer 2: deconvolving the 112 × 112 × 32 × 16 features output by the convolutional layer 1 by using 32 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;
and (3) convolutional layer: deconvoluting the 112 × 112 × 32 × 16 features output by the convolutional layer 2 by using 32 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;
a pooling layer 1: the 112 × 112 × 32 × 16 feature result output by the convolutional layer 3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 56 × 56 × 32 × 8 feature;
and (4) convolutional layer: deconvoluting the 56 × 56 × 32 × 8 features output by the pooling layer 1 by using 64 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;
and (5) convolutional layer: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 4 by using 64 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 5_ 1: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 5 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 5_ 2: deconvoluting 56 × 56 × 32 × 8 features output by the convolutional layer 5_1 by using 64 1 × 1 × 1 convolution kernels, and then obtaining 56 × 56 × 64 × 8 features through a BN layer and a Sigmoid layer;
convolutional layer 5_ 3: performing dot multiplication on the outputs of the convolutional layer 5 and the convolutional layer 5_2 to obtain the characteristics of 56 multiplied by 64 multiplied by 8;
and (3) a pooling layer 2: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 5_3 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;
and (6) a convolutional layer: deconvoluting the 28 × 28 × 128 × 4 features output by the pooling layer 2 by using 128 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
and (3) a convolutional layer 7: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 6 by using 128 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 7_ 1: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 7 by using 128 1 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 7_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_1 by using 128 1 × 1 × 1 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through a BN layer and a Sigmoid layer;
convolutional layer 7_ 3: performing dot multiplication on the output of the convolutional layer 7 and the convolutional layer 7_2 to obtain the characteristics of 28 multiplied by 128 multiplied by 4;
a pooling layer 3: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 7_3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;
and (3) convolutional layer 8: deconvolving the 14 × 14 × 128 × 2 features output by the pooling layer 3 by 256 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 256 × 2 features;
a convolutional layer 9: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 8 with 256 3 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 9_ 1: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9 with 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 9_ 2: deconvoluting the 14 × 14 × 256 × 2 features output by the convolutional layer 9_1 by 256 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a Sigmoid layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 9_ 3: performing dot multiplication on the output of the convolutional layer 9 and the convolutional layer 9_2 to obtain the characteristics of 14 multiplied by 256 multiplied by 2;
and (4) a pooling layer: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 9_3 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;
the convolutional layer 10: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 5_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 1: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 2: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_1 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 3: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_2 by using 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 4: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_3 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
pooling layer 10_ 5: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 10_4 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;
the convolutional layer 11: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 28 × 28 × 64 × 4 features;
aggregate layer 11_ 0: cascading the output of the pooling layer 10_5 and the output of the convolutional layer 11 along the channel dimension to obtain the characteristics of 28 × 28 × 128 × 4;
convolutional layer 11_ 1: deconvoluting the 28 × 28 × 128 × 4 features output by the aggregation layer 11_0 by using 128 1 × 3 × 3 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;
convolutional layer 11_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_1 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 11_ 3: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_2 by using 128 1 × 3 × 3 convolutional kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;
convolutional layer 11_ 4: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_3 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;
pooling layer 11_ 5: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 11_4 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;
the convolutional layer 12: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9_2 by using 128 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 128 × 2 features;
aggregate layer 12_ 0: the output of the pooling layer 11_5 and the output of the convolutional layer 12 are cascaded along the channel dimension to obtain the characteristics of 14 × 14 × 256 × 2;
convolutional layer 12_ 1: deconvoluting the 14 × 14 × 256 × 2 features output by the aggregation layer 12_0 by 256 1 × 3 × 3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 12_ 2: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_1 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 12_ 3: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_2 by 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 12_ 4: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_3 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
pooling layer 12_ 5: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 12_4 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;
aggregate layer 13: cascading the output of the pooling layer 4 and the output of the pooling layer 12_5 along the channel dimension to obtain the characteristic of 1 × 1 × 512 × 1;
the convolutional layer 14: deconvolving the 1 × 1 × 512 × 1 features output by the aggregation layer 13 with P1 × 1 × 1 × 1 convolution kernels to obtain 1 × 1 × P × 1 features;
classification layer 15: converting the output 1 × 1 × P × 1 of the convolutional layer 14 into a P-dimensional feature vector as the output of the network;
in step 4, the training data is sent to the network to train the network established in step 3. The method comprises the following specific steps:
step 401: the data generated in step 202 is input into the network model established in step 3.
Step 402: parameters of the network are trained based on the given labels. Note that the parameter of the deep network model in step 3 is Θ, and the output of the network is Cl. Training the network using a cross entropy loss function:
Figure BDA0002236262650000111
step 403: the network was trained using a random gradient descent method (SGD). After a given number of training sessions, the parameters of the model are saved.
As a modification of the present invention, step 5: and (3) testing the model trained in the step (4), which specifically comprises the following steps:
step 501: if the tested video sequence is less than 32 frames, the video sequence is subjected to step 201 to complement the video sequence, and then the video sequence is input into the model saved in step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;
step 502: and if the video sequence is greater than or equal to 32 frames, sequentially inputting the video into the model stored in the step 403 according to a video clip of every 32 frames, summing the output results, and selecting the result with the maximum probability as the classification result.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (6)

1. An attention-branch-directed 3D convolutional behavior recognition network method, comprising the steps of:
step 1: making a training and testing data set; collecting or shooting through a network to obtain a behavior video sequence, and making and generating a data set for training and testing;
step 2: in the training process, data are processed and amplified on line through a given rule;
and step 3: establishing a 3D convolution behavior recognition network method guided by attention branches;
and 4, step 4: transmitting the training data into a network, and training the network established in the step 3;
and 5: the model trained in step 4 was tested.
2. The attention-branch-directed 3D convolution behavior recognition network method of claim 1,
the step 1: making a training and testing data set; the behavior video sequence is collected or shot through a network, and a data set for training and testing is manufactured and generated, wherein the data set specifically comprises the following steps:
step 101: determining P behavior categories to be predicted, and respectively searching each category through a network or shooting each category in multiple scenes to make a data set;
step 102: respectively generating a training set and a test set according to a ratio of 5:1 for each category of the data set manufactured in the step 101, if the training set has m samples, the test set has n-m/5 samples, and recording Ttrain={x1,x2,...,xkIs the training set, where x is the ith samplei={Seqi;liIn which SeqiVideo sequence representing the ith sample,/iIs the label for the sample. Accordingly, Ttest={y1,y2,...,ykThe test set is used as the test set;
step 103: if the collected data set is a video, decoding a video sequence into a plurality of pictures, adjusting the sizes of the pictures to be 240 multiplied by 320, sequentially numbering the pictures as img000001.jpg, img000002.jpg and …, simultaneously storing the pictures into a local folder, and jumping to the step 2 after the video is processed; if the collected data set is a picture, jumping to step 104;
step 104: the size of the picture is adjusted to 240 × 320, the pictures are numbered sequentially and are recorded as img000001.jpg, img000002.jpg and …, meanwhile, the pictures are stored in a local folder, and after the pictures are processed, the step 2 is skipped.
3. The attention-branch-directed 3D convolutional behavior recognition network method of claim 1, wherein said step 2: in the training process, the data are processed and amplified, which specifically comprises the following steps:
step 201: randomly extracting 32 continuous video frames from a video as network input, setting a video sequence to have n frames, and if the number of the frames is less than 32, adding the first 32-n as a complementing sequence to the back of the nth frame;
step 202: the tensor of a 224 x 32 network input is randomly clipped from the spatial positions of the five pictures (i.e., the four corners and a center).
4. The attention-branch-directed 3D convolution behavior recognition network method of claim 1,
the structure of the 3D convolution behavior identification network method guided by the attention branch in the step 3 is as follows:
the convolutional layer 1: deconvoluting the input of 224 multiplied by 3 multiplied by 32 by 3 multiplied by 7 convolution kernels with 32 step sizes of 2 multiplied by 2, and then obtaining the characteristics of 112 multiplied by 32 multiplied by 16 through a BN layer and a ReLU layer;
and (3) convolutional layer 2: deconvolving the 112 × 112 × 32 × 16 features output by the convolutional layer 1 by using 32 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;
and (3) convolutional layer: deconvoluting the 112 × 112 × 32 × 16 features output by the convolutional layer 2 by using 32 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 112 × 112 × 32 × 16 features;
a pooling layer 1: the 112 × 112 × 32 × 16 feature result output by the convolutional layer 3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 56 × 56 × 32 × 8 feature;
and (4) convolutional layer: deconvoluting the 56 × 56 × 32 × 8 features output by the pooling layer 1 by using 64 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;
and (5) convolutional layer: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 4 by using 64 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 5_ 1: deconvolving the 56 × 56 × 32 × 8 features output by the convolutional layer 5 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 5_ 2: deconvoluting 56 × 56 × 32 × 8 features output by the convolutional layer 5_1 by using 64 1 × 1 × 1 convolution kernels, and then obtaining 56 × 56 × 64 × 8 features through a BN layer and a Sigmoid layer;
convolutional layer 5_ 3: performing dot multiplication on the outputs of the convolutional layer 5 and the convolutional layer 5_2 to obtain the characteristics of 56 multiplied by 64 multiplied by 8;
and (3) a pooling layer 2: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 5_3 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;
and (6) a convolutional layer: deconvoluting the 28 × 28 × 128 × 4 features output by the pooling layer 2 by using 128 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
and (3) a convolutional layer 7: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 6 by using 128 3 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 7_ 1: deconvolving the 28 × 28 × 128 × 4 features output by the convolutional layer 7 by using 128 1 × 3 × 3 convolutional kernels, and then passing through a BN layer and a ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 7_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_1 by using 128 1 × 1 × 1 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through a BN layer and a Sigmoid layer;
convolutional layer 7_ 3: performing dot multiplication on the output of the convolutional layer 7 and the convolutional layer 7_2 to obtain the characteristics of 28 multiplied by 128 multiplied by 4;
a pooling layer 3: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 7_3 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;
and (3) convolutional layer 8: deconvolving the 14 × 14 × 128 × 2 features output by the pooling layer 3 by 256 3 × 3 × 3 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 256 × 2 features;
a convolutional layer 9: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 8 with 256 3 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 9_ 1: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9 with 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 9_ 2: deconvoluting the 14 × 14 × 256 × 2 features output by the convolutional layer 9_1 by 256 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a Sigmoid layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 9_ 3: performing dot multiplication on the output of the convolutional layer 9 and the convolutional layer 9_2 to obtain the characteristics of 14 multiplied by 256 multiplied by 2;
and (4) a pooling layer: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 9_3 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;
the convolutional layer 10: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 5_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 1: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10 by 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 2: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_1 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 3: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_2 by using 64 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
convolutional layer 10_ 4: deconvolving the 56 × 56 × 64 × 8 features output by the convolutional layer 10_3 by using 64 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 56 × 56 × 64 × 8 features;
pooling layer 10_ 5: passing the 56 × 56 × 64 × 8 feature result output by the convolutional layer 10_4 through a 2 × 2 × 2 3D maximum pooling layer to obtain a 28 × 28 × 64 × 4 feature;
the convolutional layer 11: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 7_2 by using 64 1 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 28 × 28 × 64 × 4 features;
aggregate layer 11_ 0: cascading the output of the pooling layer 10_5 and the output of the convolutional layer 11 along the channel dimension to obtain the characteristics of 28 × 28 × 128 × 4;
convolutional layer 11_ 1: deconvoluting the 28 × 28 × 128 × 4 features output by the aggregation layer 11_0 by using 128 1 × 3 × 3 convolution kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;
convolutional layer 11_ 2: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_1 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;
convolutional layer 11_ 3: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_2 by using 128 1 × 3 × 3 convolutional kernels, and then obtaining the 28 × 28 × 128 × 4 features through the BN layer and the ReLU layer;
convolutional layer 11_ 4: deconvoluting the 28 × 28 × 128 × 4 features output by the convolutional layer 11_3 by using 128 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 28 × 28 × 128 × 4 features;
pooling layer 11_ 5: the 28 × 28 × 128 × 4 feature result output by the convolutional layer 11_4 is subjected to a 2 × 2 × 2 3D maximum pooling layer to obtain a 14 × 14 × 128 × 2 feature;
the convolutional layer 12: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 9_2 by using 128 1 × 1 × 1 convolution kernels, and then passing through a BN layer and a ReLU layer to obtain 14 × 14 × 128 × 2 features;
aggregate layer 12_ 0: the output of the pooling layer 11_5 and the output of the convolutional layer 12 are cascaded along the channel dimension to obtain the characteristics of 14 × 14 × 256 × 2;
convolutional layer 12_ 1: deconvoluting the 14 × 14 × 256 × 2 features output by the aggregation layer 12_0 by 256 1 × 3 × 3 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 12_ 2: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_1 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
convolutional layer 12_ 3: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_2 by 256 1 × 3 × 3 convolutional kernels, and then passing through the BN layer and the ReLU layer to obtain the 14 × 14 × 256 × 2 features;
convolutional layer 12_ 4: deconvolving the 14 × 14 × 256 × 2 features output by the convolutional layer 12_3 with 256 3 × 1 × 1 convolution kernels, and then passing through the BN layer and the ReLU layer to obtain 14 × 14 × 256 × 2 features;
pooling layer 12_ 5: the 14 × 14 × 256 × 2 feature result output by the convolutional layer 12_4 is subjected to a 1 × 1 × 1 3D adaptive average pooling layer to obtain a 1 × 1 × 256 × 1 feature;
aggregate layer 13: cascading the output of the pooling layer 4 and the output of the pooling layer 12_5 along the channel dimension to obtain the characteristic of 1 × 1 × 512 × 1;
the convolutional layer 14: deconvolving the 1 × 1 × 512 × 1 features output by the aggregation layer 13 with P1 × 1 × 1 × 1 convolution kernels to obtain 1 × 1 × P × 1 features;
classification layer 15: the output 1 × 1 × P × 1 of the convolutional layer 14 is converted into a P-dimensional feature vector as an output of the network.
5. The method for identifying a network according to the attention branch-guided 3D convolution behavior of claim 1, wherein in step 4, training data is sent to the network, and the network established in step 3 is trained, specifically as follows:
step 401: inputting the data generated in step 202 into the network model established in step 3;
step 402: based on the parameters of the given label training network, recording the parameters of the deep network model in the step 3 as theta, and the output of the network as Cl, training the network by using a cross entropy loss function:
Figure FDA0002236262640000051
step 403: the network is trained using the stochastic gradient descent method (SGD), and after a given number of training passes, the parameters of the model are saved.
6. The attention-branch-directed 3D convolutional behavior recognition network method of claim 1, wherein step 5: and (3) testing the model trained in the step (4), which specifically comprises the following steps:
step 501: if the tested video sequence is less than 32 frames, the video sequence is subjected to step 201 to complement the video sequence, and then the video sequence is input into the model saved in step 403, and the output result is used as a classification result. If the tested video sequence is greater than or equal to 32 frames, jumping to step 502;
step 502: and if the video sequence is greater than or equal to 32 frames, sequentially inputting the video into the model stored in the step 403 according to a video clip of every 32 frames, summing the output results, and selecting the result with the maximum probability as the classification result.
CN201910984496.3A 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches Active CN110688986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910984496.3A CN110688986B (en) 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910984496.3A CN110688986B (en) 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches

Publications (2)

Publication Number Publication Date
CN110688986A true CN110688986A (en) 2020-01-14
CN110688986B CN110688986B (en) 2023-06-23

Family

ID=69112957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910984496.3A Active CN110688986B (en) 2019-10-16 2019-10-16 3D convolution behavior recognition network method guided by attention branches

Country Status (1)

Country Link
CN (1) CN110688986B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185543A (en) * 2020-09-04 2021-01-05 南京信息工程大学 Construction method of medical induction data flow classification model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328630A1 (en) * 2015-05-08 2016-11-10 Samsung Electronics Co., Ltd. Object recognition apparatus and method
CN110059582A (en) * 2019-03-28 2019-07-26 东南大学 Driving behavior recognition methods based on multiple dimensioned attention convolutional neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328630A1 (en) * 2015-05-08 2016-11-10 Samsung Electronics Co., Ltd. Object recognition apparatus and method
CN110059582A (en) * 2019-03-28 2019-07-26 东南大学 Driving behavior recognition methods based on multiple dimensioned attention convolutional neural networks

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185543A (en) * 2020-09-04 2021-01-05 南京信息工程大学 Construction method of medical induction data flow classification model

Also Published As

Publication number Publication date
CN110688986B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN109754015B (en) Neural networks for drawing multi-label recognition and related methods, media and devices
CN113688723B (en) Infrared image pedestrian target detection method based on improved YOLOv5
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
CN112149720A (en) Fine-grained vehicle type identification method
EP3958168A1 (en) Method and device for identifying video
CN113537110B (en) False video detection method fusing intra-frame differences
CN112150450A (en) Image tampering detection method and device based on dual-channel U-Net model
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN112102176A (en) Image rain removing method based on multi-scale intensive mixed attention neural network
CN115035361A (en) Target detection method and system based on attention mechanism and feature cross fusion
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
CN111046757A (en) Training method and device for face portrait generation model and related equipment
CN113505640A (en) Small-scale pedestrian detection method based on multi-scale feature fusion
CN117373111A (en) AutoHOINet-based human-object interaction detection method
CN114241564A (en) Facial expression recognition method based on inter-class difference strengthening network
CN115546489A (en) Multi-modal image semantic segmentation method based on cross-modal feature enhancement and interaction
CN115908896A (en) Image identification system based on impulse neural network with self-attention mechanism
CN117975565A (en) Action recognition system and method based on space-time diffusion and parallel convertors
CN110688986B (en) 3D convolution behavior recognition network method guided by attention branches
CN110991219B (en) Behavior identification method based on two-way 3D convolution network
US11436432B2 (en) Method and apparatus for artificial neural network
CN112364864A (en) License plate recognition method and device, electronic equipment and storage medium
CN117237796A (en) Marine product detection method based on feature enhancement and sampling offset
CN112308066A (en) License plate recognition system
CN113887419B (en) Human behavior recognition method and system based on extracted video space-time information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant