CN111325155A - Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy - Google Patents

Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy Download PDF

Info

Publication number
CN111325155A
CN111325155A CN202010107288.8A CN202010107288A CN111325155A CN 111325155 A CN111325155 A CN 111325155A CN 202010107288 A CN202010107288 A CN 202010107288A CN 111325155 A CN111325155 A CN 111325155A
Authority
CN
China
Prior art keywords
network
residual
model
space
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010107288.8A
Other languages
Chinese (zh)
Other versions
CN111325155B (en
Inventor
张祖凡
吕宗明
甘臣权
张家波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202010107288.8A priority Critical patent/CN111325155B/en
Publication of CN111325155A publication Critical patent/CN111325155A/en
Application granted granted Critical
Publication of CN111325155B publication Critical patent/CN111325155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video motion classification method based on residual difference type 3D CNN and multi-mode feature fusion, and belongs to the field of computer vision and deep learning. Firstly, changing the traditional C3D network connection mode into residual connection mode; and (3) decomposing the 3D convolution kernel by adopting a kernel decomposition technology to obtain a space convolution kernel and a plurality of parallel time kernels with different time scales, and inserting an attention model after the space convolution kernel to obtain an A3D residual error module and stacking the residual error module into a residual error network. Building a double-flow action recognition model, inputting RGB image features and optical flow features into a space flow network and a time flow network, extracting multi-level convolution feature layer features, and fusing the two networks by using a multi-level feature fusion strategy to realize space-time feature complementation; and finally, reducing the dimensions of the global video motion descriptor subjected to the fractional fusion through PCA, and finishing motion classification by using an SVM classifier.

Description

Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
Technical Field
The invention belongs to the field of computer vision and deep learning, and relates to a video motion recognition method based on a residual difference type 3D CNN and a multi-mode feature fusion strategy.
Background
Today's digital content is essentially multimedia information containing text, audio, images, video, and the like. Particularly, with the prevalence of sensors and the proliferation of mobile devices, video and audio communication has become popular as a means of communicating information through video operations, and has begun to become a new communication means between internet users. In order to more deeply and intelligently mine and understand multimedia information, the scientific research field encourages more and more to develop advanced video understanding technology. Characterization learning is the basis for the success of these technological advances. In recent years, a Convolutional Neural Network (CNN) has emerged, particularly in the image field, a deep Convolutional neural network uses a plurality of different convolution kernels in combination with an information capture mechanism of a local receptive field to traverse a feature plane of a previous layer to capture local features of different granularities, and as the number of layers increases, the extracted significant features are combined and compressed, and different feature layers cover visual perception feature expressions of different layers, so that the deep Convolutional neural network has been widely accepted in the characterization learning field by virtue of its superior learning capability on visual appearance features. The success of Convolutional Neural Networks (CNN) has demonstrated the ability of convolutional neural networks to learn visual manifestations. For example, the residual network achieved a top-5 error rate of 3.57% on the ImageNet test set, refreshing the best recognition performance previously known to humans. However, the video frame is a time-series image, and the large dynamic change and the processing complexity in the time-series image make it difficult for the speed model to learn a strong and universal space-time characterization.
At present, the main method is to extend the convolution kernel of CNN from 2D to 3D, train a completely new 3D CNN, and expand a time dimension on the basis of 2D CNN, so that the network can not only extract the visual appearance features existing in each video image, but also capture the dynamic information between consecutive frames. However, while the 3D convolution kernel improves the model performance, the expensive computation cost in the network training becomes a problem to be solved. Taking a widely adopted 11-layer 3D CNN, i.e., a C3D network as an example, the model size reaches 321MB, and with the increase of the model parameter quadratic mode, it is imperative to study the effective substitution of the 3D convolution kernel. Moreover, in the current dual-flow action recognition model, the spatial flow network and the temporal flow network lack interaction before final decision fusion, the characterization capability accumulated in multiple network layers is not fully developed, and research on how to effectively implement complementation of spatial features and temporal features by fusing dual-flow network multi-level features is relatively less. Therefore, it is a very important task to develop a research on the defects that the parameters of the C3D model are difficult to train and are limited in the shallow network characterization capability, so as to effectively improve the capability and efficiency of the 3D convolutional neural network model in processing video actions, and to sufficiently and effectively realize the dual-stream network fusion complementation and improve the recognition performance.
Disclosure of Invention
In view of the above, the present invention provides a method for classifying video motion based on residual difference type 3D CNN and multi-modal feature fusion.
In order to achieve the purpose, the invention provides the following technical scheme:
a video motion classification method based on residual difference type 3D CNN and multi-mode feature fusion comprises the following steps:
s1: based on a traditional convolution 3D Neural network (C3D), the connection mode of each convolution module is changed into residual error connection, and identity mapping (index mapping) is introduced;
s2: in a residual module, decomposing an original 3D convolution kernel into a space kernel and a plurality of parallel multi-scale time kernels (MTTL) by using A3D kernel decomposition technology to reduce model parameters, and then embedding an attention model (CBAM) to obtain a brand-new residual module (A3D block);
s3: the input and output settings of each module are adjusted by stacking an A3D block and a pooling layer, and the final building of an A3D residual network is completed;
s4: building a space-time double-flow identification model by using a designed A3D convolution residual neural network model, and respectively taking two modes of an RGB video image and an optical flow image as network input;
s5: by jointly utilizing a multi-stage feature fusion and decision fusion method, firstly fusing different layers of features in a time network and a space network on a feature level, and then weighing class fractional vectors of a plurality of softmax classifiers by a decision-stage weight fusion strategy to realize fractional decision fusion;
s6: and then, performing dimensionality reduction and decorrelation on the fused feature descriptors by using a Principal Component Analysis (PCA) dimensionality reduction algorithm, and finally completing classification and identification on video actions through a multi-classification SVM classifier.
Further, in step S1, changing the direct connection manner of the sequence between the feature modules in the original C3D into a residual connection manner specifically includes:
inputting the raw of the feature module xn-1I.e. identity mapping, and the sum of its outputs as a new output ynIs represented by yn=R*(xn-1,W)+xn-1Where W represents trainable parameters in a residual module, original input x is combined by a residual map Rn-1Fitting variable residual values in network training, R + xn-1And the shortcut connection is expressed, so that the front-layer information is not easy to lose when being propagated to a deeper layer of the network, and gradient dispersion and gradient explosion are avoided.
Further, the 3D kernel decomposition described in step S2 includes:
the method comprises the steps of decomposing a3 × 3 × 3 convolution kernel along a space dimension and a time dimension by utilizing a3D kernel decomposition technology to obtain a space convolution kernel of 1 × 3 × 3 and a time convolution kernel of 3 × 1 × 1, reducing model parameters, and simultaneously enriching time kernel scales in order to solve the defect that time grabbing scales are single when a model processes time sequence frame image characteristic information, merging time kernels of 1 × 1 × 1 and 2 × 1 × 1 with different scales, and designing a multi-scale time transformation layer (MTTL) to improve the extraction capability of the model on multi-granularity time information in the time domain.
Further, in step S2, an attention module CBAM is introduced into the residual module, and the CBAM is divided into a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), wherein the CBAM is divided into the Channel Attention Module (CAM) and the Spatial Attention Module (SAM)
In the channel attention model, the input features F ∈ R are first inputC×W×H(wherein C, W, H represent feature plane channel number, width and height value respectively) through maximum pooling (maxpool) and average pooling (avgpool), compress the space dimension, reuse multilayer perception layer (MLP) to prepare the channel weight, add up at last, through relu activation layer, remap to each feature channel of input feature, realize the rational distribution to the input feature channel attention score, the process calculation is expressed as: mc=relu{MLP(max pool(F))+MLP(avgpool(F))},McThe output of the CAM, i.e. the channel weighted saliency features;
in the spatial attention model, M is compressed away, again by maximal pooling (maxpool) and average pooling (avgpool)cObtaining two channel characteristics carrying channel significance by connecting two characteristic descriptors in series, calculating Conv [ max pool (F), avgppool (F) ] by using a convolution operation Conv to obtain a spatial weight, normalizing the spatial weight and McAdding to obtain a spatial saliency characteristic; because the CAM and the SAM are complementary in space attention, the CBAM can realize the omnibearing screening of characteristic space information; in the residual module, the CBAM model directly receives the output of the space kernel as input, and an effective feature screening mechanism is given to the model.
Further, the building process of the dual-stream identification model in step S4 is as follows:
the method comprises the steps that an A3D convolution residual error neural network is used as a basic model of a double-flow network, and RGB image features and corresponding optical flow features are used as the input of a spatial flow network and a time flow network respectively; the optical flow characteristics are obtained by utilizing a Spatial pyramid model (SpyNet), the model is directly connected into a double-flow network, and the model, the time flow network and the space network participate in training together through gradient back propagation to finely adjust parameters of the model. Unlike hand-made based methods for extracting optical flow information, optical flows from learning network computations have more flexibility to characterize motion classifications in real scenes.
Further, the multi-level feature fusion and decision fusion method in step S5 specifically includes:
deriving multi-level complementary features f from different feature layers of A3D convolution residual neural network, including A3D _2a, A3D _3a, A3D _5a and softmax layersi *,fiWherein f isi *,fiRespectively representing multi-level characteristics from a time flow network and a space flow network, and fusing corresponding time flow characteristics and space flow characteristics by adopting a weighted summation mode on the derived characteristics for weighing the contribution of the two-flow network, namely calculating Fi=Wi[fi,fi *]In which F isi,WiThe output of the ith layer feature fusion and the corresponding weight fusion parameter matrix, respectively, are represented as αiiAnd then weighting the fused features through a convolution layer of 1 × 1 × 1 and a maximum pooling layer, obtaining decision scores generated by the fused features of each layer after sofmax, and performing fractional weight fusion on the decision scores of each layer again to obtain a feature descriptor with strong representation power.
The invention has the beneficial effects that: compared with the original C3D model, the space-time double-flow A3D convolution residual error neural network provided by the invention can achieve higher recognition efficiency with fewer model parameters, and meanwhile, the deeper network model is further improved in feature characterization, and the action classification precision can be further improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a flow chart of a video motion classification method based on residual difference type 3D CNN and multi-modal feature fusion according to the present invention;
FIG. 2 is a diagram of the model C3D;
FIG. 3 is a schematic diagram of 2D convolution and 3D convolution operations;
FIG. 4 is a CBAM structural diagram;
FIG. 5 is a schematic diagram of the A3D residual module;
FIG. 6 is a diagram of the A3D convolution residual neural network structure;
fig. 7 is an overall double-flow motion recognition model diagram.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
As shown in FIG. 1, the present invention provides a video motion classification method based on residual difference type 3D CNN and multi-modal feature fusion, first, the present invention extracts the first 20 frames of images in the video, and cuts all input frames to 112 × 112 size, as the input of the network, the used Batchsize is 20 videos, C3D convolutional neural network as the early classical 3DCNN model, including 5 convolutional module 2 layers of full connection layer, a total of 11 layers of shallow layer models, the concrete C3D model structure is shown in FIG. 2. the model performs gradient propagation and parameter update through the single connection mode that the front layer output is sequentially connected to the back layer during training, the large model parameters and the insufficient model representation capability are the places to be improved by the present invention.
A3D convolution residual error neural network construction process:
(1) establishing residual connection: the invention changes the direct connection mode of the sequence of all the feature modules in the original C3D into the residual connection mode, and the specific operation is mainly to input the original input x of the feature modulesn-1(i.e., indexing mapping) and the sum of its outputs as the new output ynThe specific flow is represented as yn=R*(xn-1,W)+xn-1Where W represents trainable parameters in a residual module, original input x is combined by a residual map Rn-1Fitting variable residual values in network training, R + xn-1And short connection is shown, so that the front-layer information is not easy to lose when being transmitted to a deeper layer of the network, and gradient dispersion and gradient explosion are avoided.
(2) The method comprises the following steps of 3D kernel decomposition, wherein 2D convolution output lacks time domain information, and 3D convolution can capture time domain information and space domain information at the same time, the specific operation is shown in figure 3, but heavy training parameters reduce network training efficiency.
(3) Attention module introduction following the above-mentioned process, the present invention introduces an attention model CBAM into the residual module, which is mainly divided into a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), and the model structure can be seen in fig. 4.① in the Channel attention model, the input feature F ∈ R is firstly inputtedC×W×H(wherein C, W, H represent feature plane channel number, width and height value respectively) through maximum pooling (maxpool) and average pooling (avgpool), compress the space dimension, reuse multilayer perception layer (MLP) to prepare the channel weight, add up at last, through relu activation layer, remap to each feature channel of input feature, realize the rational distribution to the input feature channel attention score, the process calculation is expressed as: mc=relu{MLP(max pool(F))+MLP(avgpool(F))},Mc② in spatial attention, two pooling approaches are used to compress McObtaining two channel characteristics carrying channel significance by connecting two characteristic descriptors in series, calculating Conv [ max pool (F), avgppool (F) ] by using a convolution operation Conv to obtain a spatial weight, normalizing the spatial weight and McAnd adding to obtain the spatial saliency characteristics. Because CAM and SAM are complementary in spatial concern, CBAM can realize the all-round screening of characteristic space information. In the residual module, the CBAM model directly receives the output of the space kernel as input, and an effective feature screening mechanism is given to the model.
(4) A3D residual module: with the adoption of the residual error module, a core decomposition technology is used for reducing model parameters, the time characteristic granularity captured by an MTTL (maximum Transmission level translation) rich model is designed, an effective attention model is introduced to improve the robustness of the model, the advantages are combined, and an A3D residual error module is obtained, wherein the detailed structure is shown in figure 5.
(5) Build A3D convolution residual neural network: the invention replaces the convolution module at the corresponding position of the original C3D with the A3D module and adjusts the corresponding dimension output, aiming at keeping the dimension consistent with the input and output dimensions of each convolution module of C3D. And finally, stacking the A3D module to obtain a convolutional neural network structure with more layers, namely an A3D convolutional residual neural network shown in figure 6.
The double-flow identification model building process comprises the following steps:
(1) deriving multi-modal features: the invention uses A3D convolution residual error neural network as the basic model of the double-flow network, and simultaneously uses RGB image characteristics and corresponding optical flow characteristics as the input of the space flow network and the time flow network respectively. The optical flow characteristics are obtained by utilizing a Spatial pyramid model (SpyNet), the model is directly connected into a double-flow network, and the model, the time flow network and the space network participate in training together through gradient back propagation to finely adjust parameters of the model. Unlike hand-made based methods for extracting optical flow information, optical flows from learning network computations have more flexibility to characterize motion classifications in real scenes.
(2) Multi-stage feature fusion and decision methods (multi-stage fusion methods): then, in the constructed double-flow identification network, the invention derives multi-level complementary features f from different feature layers (A3D _2a, A3D _3a, A3D _5a and softmax layers) of the A3D convolution residual neural network respectivelyi *,fiWherein f isi *,fiRespectively representing multi-level characteristics from a time flow network and a space flow network, then fusing corresponding time flow and space flow characteristics by adopting a weighted summation mode on the derived characteristics, aiming at weighing the contribution of the two-flow network, namely calculating Fi=Wi[fi,fi *]In which F isi,WiThe output of the i-th layer feature fusion and the corresponding weight fusion parameter matrix (detailed as α)ii) Then, the weighted and fused features pass through a convolutional layer of 1 × 1 × 1 and a maximum pooling layer, decision scores generated by the fused features of all layers are obtained after sofmax, similarly, the decision scores of all layers are subjected to score-level weight fusion again to prepare feature descriptors with strong representation power, finally, decorrelation and redundancy removal of feature vectors are carried out through PCA, the obtained effective features enter a multi-classification SVM classifier to complete a final recognition task, and the overall double-flow action recognition model is shown in FIG. 7.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (6)

1. A video motion classification method based on residual difference type 3D CNN and multi-mode feature fusion is characterized by comprising the following steps: the method comprises the following steps:
s1: based on the traditional convolution 3D neural network C3D, the connection mode of each convolution module is changed into residual error connection, and identity mapping is introduced;
s2: in a residual module, decomposing an original 3D convolution kernel into a space kernel and a plurality of parallel multi-scale time kernels MTTL by using A3D kernel decomposition technology to reduce model parameters, and then embedding an attention model CBAM to obtain a brand-new residual module A3 Dblock;
s3: the input and output settings of each module are adjusted by stacking an A3D block and a pooling layer, and the final building of an A3D residual network is completed;
s4: building a space-time double-flow identification model by using a designed A3D convolution residual neural network model, and respectively taking two modes of an RGB video image and an optical flow image as network input;
s5: firstly, fusing different layer characteristics in a time network and a space network on a characteristic layer, and then weighing class fraction vectors of a plurality of softmax classifiers by a decision-level weight fusion strategy to realize fractional decision fusion;
s6: and finally, completing classification and identification of video actions through a multi-classification SVM classifier.
2. The method of claim 1, wherein the method comprises the following steps: in step S1, the method of directly connecting the feature modules in the original C3D in sequence is changed to residual connection, which specifically includes:
inputting the raw of the feature module xn-1I.e. identity mapping, and the sum of its outputs as a new output ynIs represented by yn=R*(xn-1,W)+xn-1Where W represents a trainable parameter in a residual module, mapping R by residual*Combining original inputs xn-1Fitting variable residual values in network training, R*+xn-1And the shortcut connection is expressed, so that the front-layer information is not easy to lose when being propagated to a deeper layer of the network, and gradient dispersion and gradient explosion are avoided.
3. The method of claim 1, wherein the method comprises the following steps: the 3D kernel decomposition described in step S2 includes:
the method comprises the steps of decomposing a3 × 3 × 3 convolution kernel along a space dimension and a time dimension by using a3D kernel decomposition technology to obtain a space convolution kernel of 1 × 3 × 3 and a time convolution kernel of 3 × 1 × 1 to reduce model parameters, simultaneously merging time kernels of 1 × 1 × 1 and 2 × 1 × 1 in different scales, and designing a multi-scale time transition layer MTTL to improve the extraction capability of multi-granularity time information in a time domain.
4. The method of claim 1, wherein the method comprises the following steps: step S2 is performed by introducing an attention module CBAM into the residual module, wherein the CBAM is divided into a channel attention CAM and a spatial attention SAM
In the channel attention model, the input features F ∈ R are first inputC×W×HWherein C, W, H represent characteristic plane channel number, width and height value separately, through maximum pooling and average pooling separately, compress the space dimension, reuse multilayer perception layer (MLP) to prepare the channel weight, add up finally, activate the layer through relu, map to each characteristic channel of input characteristic again, realize the rational distribution to the attention score of the input characteristic channel, the course is calculated and expressed as: mc=relu{MLP(maxpool(F))+MLP(avgpool(F))},McThe output of the CAM, i.e. the channel weighted saliency features;
in the spatial attention model, M is compressed away, again by maximum pooling and average poolingcObtaining two channel characteristics carrying channel significance by connecting two characteristic descriptors in series, calculating Conv [ maxpool (F) and avgpool (F) by using a convolution operation Conv to obtain a spatial weight, normalizing the spatial weight and McAdding to obtain a spatial saliency characteristic; because the CAM and the SAM are complementary in space attention, the CBAM can realize the omnibearing screening of characteristic space information; in the residual module, the CBAM model directly receives the output of the space kernel as input, and an effective feature screening mechanism is given to the model.
5. The method of claim 1, wherein the method comprises the following steps: the building process of the double-flow identification model in the step S4 is as follows:
the method comprises the steps that an A3D convolution residual error neural network is used as a basic model of a double-flow network, and RGB image features and corresponding optical flow features are used as the input of a spatial flow network and a time flow network respectively; the optical flow characteristics are obtained by utilizing a space pyramid model SpyNet, the model is directly connected into a double-flow network, and through the back propagation of the gradient, the model is trained together with a time flow network and a space network to fine-tune the parameters of the model.
6. The method of claim 1, wherein the method comprises the following steps: the multi-level feature fusion and decision fusion method in step S5 specifically includes:
deriving multi-level complementary features f from different feature layers of A3D convolution residual neural network, including A3D _2a, A3D _3a, A3D _5a and softmax layersi *,fiWherein f isi *,fiRespectively representing multi-level characteristics from a time flow network and a space flow network, and fusing corresponding time flow characteristics and space flow characteristics by adopting a weighted summation mode on the derived characteristics for weighing the contribution of the two-flow network, namely calculating Fi=Wi[fi,fi *]In which F isi,WiThe output of the ith layer feature fusion and the corresponding weight fusion parameter matrix, respectively, are represented as αiiAnd then weighting the fused features through a convolution layer of 1 × 1 × 1 and a maximum pooling layer, obtaining decision scores generated by the fused features of each layer after sofmax, and performing fractional weight fusion on the decision scores of each layer again to obtain a feature descriptor with strong representation power.
CN202010107288.8A 2020-02-21 2020-02-21 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy Active CN111325155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010107288.8A CN111325155B (en) 2020-02-21 2020-02-21 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010107288.8A CN111325155B (en) 2020-02-21 2020-02-21 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy

Publications (2)

Publication Number Publication Date
CN111325155A true CN111325155A (en) 2020-06-23
CN111325155B CN111325155B (en) 2022-09-23

Family

ID=71171398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010107288.8A Active CN111325155B (en) 2020-02-21 2020-02-21 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy

Country Status (1)

Country Link
CN (1) CN111325155B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898709A (en) * 2020-09-30 2020-11-06 中国人民解放军国防科技大学 Image classification method and device
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN111985369A (en) * 2020-08-07 2020-11-24 西北工业大学 Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN112069884A (en) * 2020-07-28 2020-12-11 中国传媒大学 Violent video classification method, system and storage medium
CN112084891A (en) * 2020-08-21 2020-12-15 西安理工大学 Cross-domain human body action recognition method based on multi-mode features and counterstudy
CN112132089A (en) * 2020-09-28 2020-12-25 天津天地伟业智能安全防范科技有限公司 Excavator behavior analysis method based on 3D convolution and optical flow
CN112163574A (en) * 2020-11-23 2021-01-01 南京航天工业科技有限公司 ETC interference signal transmitter identification method and system based on deep residual error network
CN112288829A (en) * 2020-11-03 2021-01-29 中山大学 Compression method and device for image restoration convolutional neural network
CN112330644A (en) * 2020-11-11 2021-02-05 复旦大学 Medical image diagnosis system based on deep learning
CN112329867A (en) * 2020-11-10 2021-02-05 宁波大学 MRI image classification method based on task-driven hierarchical attention network
CN112507920A (en) * 2020-12-16 2021-03-16 重庆交通大学 Examination abnormal behavior identification method based on time displacement and attention mechanism
CN112784782A (en) * 2021-01-28 2021-05-11 上海理工大学 Three-dimensional object identification method based on multi-view double-attention network
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113052254A (en) * 2021-04-06 2021-06-29 安徽理工大学 Multi-attention ghost residual fusion classification model and classification method thereof
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113158964A (en) * 2021-05-07 2021-07-23 北京工业大学 Sleep staging method based on residual learning and multi-granularity feature fusion
WO2021151047A1 (en) * 2020-01-23 2021-07-29 Impossible Objects, Inc. Camera-based monitoring system for 3-dimensional printing
CN113516133A (en) * 2021-04-01 2021-10-19 中南大学 Multi-modal image classification method and system
CN114067435A (en) * 2021-11-15 2022-02-18 山东大学 Sleep behavior detection method and system based on pseudo-3D convolutional network and attention mechanism
CN114170618A (en) * 2021-09-28 2022-03-11 广州新华学院 Video human behavior recognition algorithm based on double-flow space-time decomposition
CN115223250A (en) * 2022-09-13 2022-10-21 东莞理工学院 Upper limb rehabilitation action recognition method based on multi-scale space-time decomposition convolutional network
CN115348359A (en) * 2021-05-12 2022-11-15 中山大学新华学院 Model-based digital image steganography method
CN115393779A (en) * 2022-10-31 2022-11-25 济宁九德半导体科技有限公司 Control system and control method for laser cladding metal ball manufacturing
CN115406852A (en) * 2021-12-28 2022-11-29 中山小池科技有限公司 Fabric fiber component qualitative method based on multi-label convolutional neural network
CN116778395A (en) * 2023-08-21 2023-09-19 成都理工大学 Mountain torrent flood video identification monitoring method based on deep learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
CN109446923A (en) * 2018-10-10 2019-03-08 北京理工大学 Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method
CN109670529A (en) * 2018-11-14 2019-04-23 天津大学 A kind of separable decomposition residual error modularity for quick semantic segmentation
KR20190080388A (en) * 2017-12-28 2019-07-08 포항공과대학교 산학협력단 Photo Horizon Correction Method based on convolutional neural network and residual network structure
CN110070041A (en) * 2019-04-23 2019-07-30 江西理工大学 A kind of video actions recognition methods of time-space compression excitation residual error multiplication network
CN110188239A (en) * 2018-12-26 2019-08-30 北京大学 A kind of double-current video classification methods and device based on cross-module state attention mechanism
CN110633683A (en) * 2019-09-19 2019-12-31 华侨大学 Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN109214250A (en) * 2017-07-05 2019-01-15 中南大学 A kind of static gesture identification method based on multiple dimensioned convolutional neural networks
KR20190080388A (en) * 2017-12-28 2019-07-08 포항공과대학교 산학협력단 Photo Horizon Correction Method based on convolutional neural network and residual network structure
CN109446923A (en) * 2018-10-10 2019-03-08 北京理工大学 Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method
CN109670529A (en) * 2018-11-14 2019-04-23 天津大学 A kind of separable decomposition residual error modularity for quick semantic segmentation
CN110188239A (en) * 2018-12-26 2019-08-30 北京大学 A kind of double-current video classification methods and device based on cross-module state attention mechanism
CN110070041A (en) * 2019-04-23 2019-07-30 江西理工大学 A kind of video actions recognition methods of time-space compression excitation residual error multiplication network
CN110633683A (en) * 2019-09-19 2019-12-31 华侨大学 Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
SAINING XIE 等: "Rethinking Spatiotemporal Feature Learning:Speed-Accuracy Trade-offs in Video Classification", 《COMPUTER VISION-ECCV 2018》 *
张小俊 等: "基于改进3D卷积神经网络的行为识别", 《计算机集成制造系统》 *
张雨丰: "深度学习的轻量化及其在图像识别中的应用研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
李龙: "融合注意力机制的人体骨骼点动作识别方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
韩敏捷: "基于深度学习框架的多模态动作识别", 《计算机与现代化》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021151047A1 (en) * 2020-01-23 2021-07-29 Impossible Objects, Inc. Camera-based monitoring system for 3-dimensional printing
US11673336B2 (en) 2020-01-23 2023-06-13 Impossible Objects, Inc. Camera-based monitoring system for 3-dimensional printing
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN111931602B (en) * 2020-07-22 2023-08-08 北方工业大学 Attention mechanism-based multi-flow segmented network human body action recognition method and system
CN112069884B (en) * 2020-07-28 2024-03-12 中国传媒大学 Violent video classification method, violent video classification system and storage medium
CN112069884A (en) * 2020-07-28 2020-12-11 中国传媒大学 Violent video classification method, system and storage medium
CN111985369A (en) * 2020-08-07 2020-11-24 西北工业大学 Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN111985369B (en) * 2020-08-07 2021-09-17 西北工业大学 Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN112084891A (en) * 2020-08-21 2020-12-15 西安理工大学 Cross-domain human body action recognition method based on multi-mode features and counterstudy
CN112084891B (en) * 2020-08-21 2023-04-28 西安理工大学 Cross-domain human body action recognition method based on multi-modal characteristics and countermeasure learning
CN112132089A (en) * 2020-09-28 2020-12-25 天津天地伟业智能安全防范科技有限公司 Excavator behavior analysis method based on 3D convolution and optical flow
CN111898709A (en) * 2020-09-30 2020-11-06 中国人民解放军国防科技大学 Image classification method and device
CN112288829A (en) * 2020-11-03 2021-01-29 中山大学 Compression method and device for image restoration convolutional neural network
CN112329867A (en) * 2020-11-10 2021-02-05 宁波大学 MRI image classification method based on task-driven hierarchical attention network
CN112330644A (en) * 2020-11-11 2021-02-05 复旦大学 Medical image diagnosis system based on deep learning
CN112163574A (en) * 2020-11-23 2021-01-01 南京航天工业科技有限公司 ETC interference signal transmitter identification method and system based on deep residual error network
CN112507920A (en) * 2020-12-16 2021-03-16 重庆交通大学 Examination abnormal behavior identification method based on time displacement and attention mechanism
CN112507920B (en) * 2020-12-16 2023-01-24 重庆交通大学 Examination abnormal behavior identification method based on time displacement and attention mechanism
CN112784782A (en) * 2021-01-28 2021-05-11 上海理工大学 Three-dimensional object identification method based on multi-view double-attention network
CN112784782B (en) * 2021-01-28 2023-04-07 上海理工大学 Three-dimensional object identification method based on multi-view double-attention network
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN112818843B (en) * 2021-01-29 2022-08-26 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113516133A (en) * 2021-04-01 2021-10-19 中南大学 Multi-modal image classification method and system
CN113516133B (en) * 2021-04-01 2022-06-17 中南大学 Multi-modal image classification method and system
CN113052254A (en) * 2021-04-06 2021-06-29 安徽理工大学 Multi-attention ghost residual fusion classification model and classification method thereof
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113128395B (en) * 2021-04-16 2022-05-20 重庆邮电大学 Video action recognition method and system based on hybrid convolution multistage feature fusion model
CN113158964A (en) * 2021-05-07 2021-07-23 北京工业大学 Sleep staging method based on residual learning and multi-granularity feature fusion
CN113158964B (en) * 2021-05-07 2024-05-28 北京工业大学 Sleep stage method based on residual error learning and multi-granularity feature fusion
CN115348359A (en) * 2021-05-12 2022-11-15 中山大学新华学院 Model-based digital image steganography method
CN114170618A (en) * 2021-09-28 2022-03-11 广州新华学院 Video human behavior recognition algorithm based on double-flow space-time decomposition
CN114067435A (en) * 2021-11-15 2022-02-18 山东大学 Sleep behavior detection method and system based on pseudo-3D convolutional network and attention mechanism
CN115406852A (en) * 2021-12-28 2022-11-29 中山小池科技有限公司 Fabric fiber component qualitative method based on multi-label convolutional neural network
CN115223250A (en) * 2022-09-13 2022-10-21 东莞理工学院 Upper limb rehabilitation action recognition method based on multi-scale space-time decomposition convolutional network
CN115393779A (en) * 2022-10-31 2022-11-25 济宁九德半导体科技有限公司 Control system and control method for laser cladding metal ball manufacturing
CN116778395A (en) * 2023-08-21 2023-09-19 成都理工大学 Mountain torrent flood video identification monitoring method based on deep learning
CN116778395B (en) * 2023-08-21 2023-10-24 成都理工大学 Mountain torrent flood video identification monitoring method based on deep learning

Also Published As

Publication number Publication date
CN111325155B (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN111325155B (en) Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN108764207B (en) Face expression recognition method based on multitask convolutional neural network
CN113158875B (en) Image-text emotion analysis method and system based on multi-mode interaction fusion network
Jiang et al. Density-aware multi-task learning for crowd counting
CN107273800B (en) Attention mechanism-based motion recognition method for convolutional recurrent neural network
CN110223292B (en) Image evaluation method, device and computer readable storage medium
CN111652903B (en) Pedestrian target tracking method based on convolution association network in automatic driving scene
CN113688894B (en) Fine granularity image classification method integrating multiple granularity features
CN109977893B (en) Deep multitask pedestrian re-identification method based on hierarchical saliency channel learning
CN110378208B (en) Behavior identification method based on deep residual error network
CN111696137A (en) Target tracking method based on multilayer feature mixing and attention mechanism
Hu et al. Learning dual-pooling graph neural networks for few-shot video classification
CN110135251B (en) Group image emotion recognition method based on attention mechanism and hybrid network
CN111489405B (en) Face sketch synthesis system for generating confrontation network based on condition enhancement
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN113255602A (en) Dynamic gesture recognition method based on multi-modal data
CN114882234A (en) Construction method of multi-scale lightweight dense connected target detection network
CN111401116A (en) Bimodal emotion recognition method based on enhanced convolution and space-time L STM network
CN113763417A (en) Target tracking method based on twin network and residual error structure
CN114647752A (en) Lightweight visual question-answering method based on bidirectional separable deep self-attention network
CN117058734A (en) Light macro expression recognition method based on effective attention mechanism
CN116597240A (en) Autoregressive generation type point cloud converter pre-training method
CN115311595B (en) Video feature extraction method and device and electronic equipment
CN115527275A (en) Behavior identification method based on P2CS _3DNet
CN115170888A (en) Electronic component zero sample identification model and method based on visual information and semantic attributes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant