CN113128395B - Video action recognition method and system based on hybrid convolution multistage feature fusion model - Google Patents

Video action recognition method and system based on hybrid convolution multistage feature fusion model Download PDF

Info

Publication number
CN113128395B
CN113128395B CN202110413461.1A CN202110413461A CN113128395B CN 113128395 B CN113128395 B CN 113128395B CN 202110413461 A CN202110413461 A CN 202110413461A CN 113128395 B CN113128395 B CN 113128395B
Authority
CN
China
Prior art keywords
feature
convolution
time
level
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110413461.1A
Other languages
Chinese (zh)
Other versions
CN113128395A (en
Inventor
张祖凡
彭月
甘臣权
张家波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110413461.1A priority Critical patent/CN113128395B/en
Publication of CN113128395A publication Critical patent/CN113128395A/en
Application granted granted Critical
Publication of CN113128395B publication Critical patent/CN113128395B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video motion recognition method and a video motion recognition system based on a hybrid convolution multistage feature fusion model, which belong to the technical field of computer vision and are characterized in that a hybrid convolution module is constructed by adopting two-dimensional convolution and separable three-dimensional convolution; performing channel shift operation on each input feature along a time dimension, constructing a time shift module, promoting information flow between adjacent frames, and compensating the defect that dynamic features are captured by two-dimensional convolution operation; deriving multi-level complementary features from different convolutional layers of a backbone network, and performing spatial modulation and time modulation on the multi-level complementary features, so that each level of features has consistent semantic information in a spatial dimension and has changeable visual rhythm clues in a time dimension; constructing a feature stream from bottom to top and a feature stream from top to bottom to supplement each other and perform parallel processing on the feature streams to realize multi-level feature fusion; and carrying out model training by using a two-stage training strategy.

Description

Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
Technical Field
The invention belongs to the technical field of computer vision, and relates to a video motion recognition method and system based on a hybrid convolution multistage feature fusion model.
Background
The rapid development of the field of artificial intelligence research prompts the human-computer interaction technology to gradually permeate into the daily life of people, and the human body action recognition research derived from the human body action recognition technology is widely concerned. In the task of motion recognition based on video, the traditional method is mainly realized by depending on specific feature design, and has serious field limitation. In order to overcome the defects, more universal feature representations are prepared, and a Convolutional Neural Network (CNN) constructed based on a biological visual perception mechanism is widely applied to the field of action recognition.
The quality of the model about human body action recognition performance is closely related to the strength of the model about the representation capability of the model to a video, and as video data is used as sequential expansion of a two-dimensional plane image in a three-dimensional space-time, a video-based feature extraction process is divided into a space appearance representation part and a time sequence dynamic modeling part which are equally important and complementary. Based on this, although the 2D CNN model can effectively capture the spatial neighborhood correlation of the input video frame, it is limited by the deficiency of the dynamic feature extraction capability, lacks the attention to the dynamic change of the body motion in the time dimension, and has many limitations. Further, the 3D CNN benefits from its inherent internal structure, performs fusion of spatio-temporal features directly on the original input data, and has inherent advantages, but the 3D CNN model requires expensive computation cost and memory overhead. Therefore, in order to find a better compromise between the computation speed and the recognition performance, how to construct a high-performance motion recognition model, which fully combines the low complexity of the two-dimensional convolution operation and the effectiveness of the three-dimensional convolution operation, is undoubtedly an exploitable research direction.
In addition, considering that the limb movements with large visual appearance expression similarity are judged only according to the dynamic change condition of the observed target in the space-time dimension, the confusion of similar movement categories is easily caused in the classification process. To increase the distinctiveness between similar action categories, the model should give the visual rhythm an equal degree of attention. Although the visual rhythm variation is predefined on the input stage, the model identification effect can be obviously improved, but the complexity of the model is greatly increased due to parameter training involving a plurality of network branches. Researchers have confirmed that, as the number of network layers increases, the convolution characteristics derived by the model from the network layers at different depths already contain information about the change of the visual rhythm. Therefore, how to realize modeling of dynamic changes of visual rhythm on a feature level is also an important research direction.
Disclosure of Invention
In view of the above, the present invention provides a video motion recognition method and system based on a hybrid convolution multi-level feature fusion model.
In order to achieve the purpose, the invention provides the following technical scheme:
in one aspect, the invention provides a video motion recognition method based on a hybrid convolution multistage feature fusion model, which comprises the following steps:
the method comprises the following steps: constructing a hybrid convolution module by adopting two-dimensional convolution and separable three-dimensional convolution;
step two: performing channel shift operation on each input feature along a time dimension, constructing a time shift module, promoting information flow between adjacent frames, and compensating the defect that dynamic features are captured by two-dimensional convolution operation;
step three: deriving multi-level complementary features from different convolutional layers of a backbone network, and performing spatial modulation and time modulation on the multi-level complementary features, so that each level of features has consistent semantic information in a spatial dimension and has changeable visual rhythm clues in a time dimension;
step four: constructing a feature stream from bottom to top and a feature stream from top to bottom to supplement each other and perform parallel processing on the feature streams to realize multi-level feature fusion;
step five: and carrying out model training by using a two-stage training strategy.
Further, the hybrid convolution module construction process in the first step includes: following the basic architecture of a three-dimensional residual network, extracting low-level spatial features in a residual network bottom structure by adopting a two-dimensional convolution operation, and extracting high-level space-time features in a network top structure by adopting a separable three-dimensional convolution operation, so as to build a hybrid convolution network, wherein the separable three-dimensional convolution operation refers to decomposing a three-dimensional convolution with a convolution kernel size of t × h × w along a space-time dimension, so as to obtain a time convolution kernel with a size of t × 1 × 1 and a space convolution kernel with a size of 1 × h × w, wherein t, h, and w respectively represent the time dimension, height and width of the convolution kernel.
Further, the second step specifically includes:
first, define Ft∈RH×W×CRepresenting the feature tensor of the t moment, wherein H, W and C respectively represent the height, width and channel dimension of the input feature; the time shifting module performs shifting operation on partial channel information of input features at each moment in a time dimension, so that space semantic information of adjacent frames is fused into a current frame, and information interaction between the adjacent frames is further promoted, wherein the mathematical expression of the time shifting module is as follows:
Figure BDA0003024904870000021
wherein the content of the first and second substances,
Figure BDA0003024904870000022
is shown ast-1Moves forward in the time dimension to time t,
Figure BDA0003024904870000023
is shown as Ft+1Is shifted backwards in the time dimension to time t, Ft 0Is represented by FtChannel information which does not participate in time shifting;
the time shifting module, in which the shifting operation only occurs in the residual mapping branch, enables the original spatial semantic information to still be fully transferred into the subsequent network layer, moves only a small fraction of the channels to model the temporal flow, with the unidirectional channel movement ratio set to 1/8.
Further, the third step specifically includes the following steps:
firstly, defining the input of a multilevel feature fusion module, and collecting convolution layer features of M different depths, wherein the input is expressed as follows:
F={F1,F2,…FM},
wherein the content of the first and second substances,
Figure BDA0003024904870000031
represents the convolution features derived from a certain depth network layer, i ∈ (1, M);
the spatio-temporal modulation process is introduced as follows:
1) spatial modulation: for network top-level feature Ftop∈RT×H×W×CSpatial modulation is equivalent to identity mapping, and the original size is reserved; convolution features for remaining network depths
Figure BDA0003024904870000032
Utilizing a two-dimensional convolution operation with a specific step size design to reduce the size of the space dimension of each hierarchy feature so that the space dimension of each hierarchy feature is matched with the network top-level feature, namely:
Figure BDA0003024904870000033
wherein M isS(-) represents a spatial modulation operation;
2) time modulation: the features updated by the spatial modulation operation are first re-expressed as
Figure BDA0003024904870000034
Then down-sampling it in the time dimension, wherein the down-sampling factor is composed of a set of well-designed hyper-parameters
Figure BDA0003024904870000035
Determination of alphaiRepresenting a downsampling factor corresponding to a feature having a depth level i; and performing down-sampling operation on the channel dimensionality, wherein a down-sampling factor is determined by the number n of network layers participating in feature derivation, namely:
Figure BDA0003024904870000036
wherein M isT(. cndot.) denotes a time modulation operation.
Further, the feature fusion of the fourth step specifically includes:
Figure BDA0003024904870000037
representing the convolution characteristics after space-time modulation; for features of different depth levels, feature aggregation is carried out by utilizing a feature flow from bottom to top and a feature flow from top to bottom;
for bottom-up feature flow, starting with the top-level feature, the top-level feature F'iSequentially applying element-level addition and downsampling operations to the next-level feature F'i+1And (3) supplementing, namely:
Figure BDA0003024904870000038
wherein, F ″)i+1Showing the characteristics after bottom-up flow addition polymerization,
Figure BDA0003024904870000039
representing element-level addition, g (-) representing a downsampling operation to ensure that dimensions of features of layers do not conflict during aggregation, Ti/Ti-1Is a sampling factor;
for top-down feature flow, starting with the bottom-level feature, the next-level feature F'i+1Sequentially enriching the characteristics F of the upper leveliThe spatial semantic information of' i.e.:
Figure BDA0003024904870000041
wherein, Fi' is a feature after top-down flow aggregation, f (-) denotes an upsampling operation, Ti/Ti-1Is a sampling factor;
and fusing the two feature streams, namely generating a final classification discrimination feature by simultaneously processing the two parallel feature streams, and then obtaining a classification prediction result generated by the multi-level fusion feature by using a Softmax function.
Further, the two-stage training strategy in the fifth step specifically comprises: in the first stage, firstly, training is carried out on a backbone network, then parameters of a backbone network part are fixed, and a subsequent multi-stage feature fusion module is trained independently; and in the second stage, initializing a multi-stage feature fusion module by using the weight learned in the first stage, and performing joint training on the whole model through an end-to-end training paradigm.
On the other hand, the invention provides a mixed convolution-based multi-stage feature fusion video motion recognition system, which comprises a mixed convolution module, a time shift module, a multi-stage feature fusion module and a two-stage training strategy module;
the hybrid convolution module is a basic framework following a three-dimensional residual network, low-level spatial features are extracted by adopting two-dimensional convolution operation in a bottom layer structure of the residual network, high-level space-time features are extracted by adopting separable three-dimensional convolution operation in a top layer structure of the network, and a hybrid convolution network is built, wherein the separable three-dimensional convolution operation means that three-dimensional convolution with a convolution kernel size of t x h x w is decomposed along space-time dimensions, so that a time convolution kernel with a size of t x 1 and a space convolution kernel with a size of 1 x h x w are obtained, and t, h and w respectively represent the time dimension, height and width of the convolution kernel;
the time shifting module is used for shifting partial channel information of the input features at all times along a time dimension to compensate the defect that the two-dimensional convolution lacks the dynamic feature extraction capability;
the multi-level feature fusion module is used for deriving multi-level complementary features from convolutional layers with different depths of a backbone network, then enabling each feature to have the same shape in a space dimension by utilizing a space modulation operation, capturing a dynamic change condition of a visual rhythm of an action example by utilizing a time modulation operation, and finally preparing high-quality classification distinguishing features through feature fusion;
the two-stage training strategy module is used for carrying out model training in stages, and limited video data are utilized to the maximum extent.
The invention has the beneficial effects that: the hybrid convolution module is utilized to combine the advantages of low complexity of two-dimensional convolution operation and high efficiency of separable three-dimensional convolution operation, so that the model complexity is obviously reduced, and a better compromise is sought between the calculation speed and the recognition performance; the dynamic change of the limb movement on a short-term time scale is simulated by using a low-cost time shift module, so that the dynamic feature extraction capability which is lacked by two-dimensional convolution operation is compensated to a certain extent; the multi-level feature fusion module is used for realizing effective fusion of multi-depth features, and visual rhythm change of an action example is effectively represented under the condition of fully utilizing visual information of each level of feature.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of the steps of a video motion recognition method based on a hybrid convolution multistage feature fusion model according to the present invention;
FIG. 2 is a model diagram of a video motion recognition system based on a hybrid convolution multistage feature fusion model according to the present invention;
FIG. 3 is a schematic diagram of a hybrid convolution module;
FIG. 4 is a schematic diagram of a time shifting module;
FIG. 5 is a schematic diagram of a multi-level feature fusion module.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
As shown in fig. 1, the present invention provides a video motion recognition method based on hybrid convolution and multi-level feature fusion, which includes the following steps:
the method comprises the following steps: adopting a low-cost two-dimensional convolution operation at the bottom layer of the network and adopting a separable three-dimensional convolution operation at the top layer of the network so as to construct a hybrid convolution module aiming at reducing the redundancy of a full three-dimensional convolution model;
the construction process of the hybrid convolution module is as follows: following the basic architecture of a three-dimensional residual network, apparent detail features on spatial dimensions are extracted by utilizing two-dimensional convolution operation in a residual network bottom layer structure to weaken low-level space-time semantic information, and a three-dimensional convolution structure is reserved in a network top layer structure to extract high-level space-time features, so that more abstract space-time semantic clues are emphasized. In order to further reduce the complexity of model calculation, the three-dimensional convolution involved in the network top-level structure is replaced by separable three-dimensional convolution, specifically, a three-dimensional convolution kernel with the size of t × h × w is decomposed along the space-time dimension, so that a time convolution kernel with the size of t × 1 × 1 and a space convolution kernel with the size of 1 × h × w are obtained, wherein t, h, and w respectively represent the time dimension, height, and width of the convolution kernel.
Step two: performing channel shift operation on each input feature along the time dimension, and constructing a time shift module, thereby promoting information flow between adjacent frames to compensate for dynamic feature extraction capability lacking in two-dimensional convolution operation;
first, define Ft∈RH×W×CThe feature tensor at time t is represented, H, W and C represent the height, width and channel dimensions of the input features, respectively. The time shifting module shifts partial channel information of the input features at each moment in the time dimension, so that the space semantic information of adjacent frames is fused into the current frame, and information interaction between the adjacent frames is promoted. The mathematical representation is as follows:
Figure BDA0003024904870000061
wherein the content of the first and second substances,
Figure BDA0003024904870000062
is shown ast-1Moves forward in the time dimension to time t,
Figure BDA0003024904870000063
is shown ast+1Is shifted backwards in the time dimension to time t, Ft 0Is represented by FtDoes not participate in time-shifted channel information.
In addition, in order to avoid the phenomenon that the space modeling capacity of the model is damaged due to the fact that the channel information is moved in a large area, and further performance attenuation of the model is caused. Thus, the time shift module moves only a small fraction of the channels to model the time flow, and the unidirectional channel movement ratio is typically set to 1/8. In order to further retain the spatial feature learning capability of the model, in the adopted time shifting module, the shifting operation only occurs in the residual mapping branch, so that the original spatial semantic information can still be completely transferred to the subsequent network layer.
Step three: deriving multi-level complementary features from different convolutional layers of a backbone network, and performing spatial modulation and time modulation on the multi-level complementary features, so that each level of features has consistent semantic information in a spatial dimension and has changeable visual rhythm clues in a time dimension;
first, the inputs of the multi-level feature fusion module are defined. In order to fully utilize the visual information of each layer of convolution characteristics, convolution layer characteristics of M different depths are collected and expressed as:
F={F1,F2,…FM},
wherein the content of the first and second substances,
Figure BDA0003024904870000064
representing the convolution characteristics, i e (1, M), derived from a certain depth network layer. To ensure efficient fusion of features, space-time modulation is introduced. The detailed introduction process is as follows:
1) and (4) spatial modulation. On the one hand, for the network top-level feature Ftop∈RT×H×W×CThe spatial modulation is equivalent to identity mapping, and the original size is reserved. Convolution features for the remaining network depths, on the other hand
Figure BDA0003024904870000065
Utilizing a two-dimensional convolution operation with a specific step size design to reduce the size of the space dimension of each hierarchy feature so that the space dimension of each hierarchy feature is matched with the network top-level feature, namely:
Figure BDA0003024904870000071
wherein M isS(-) represents a spatial modulation operation.
2) And (5) time modulation. The features updated by the spatial modulation operation are first re-expressed as
Figure BDA0003024904870000072
Then down-sampling it in the time dimension, wherein the down-sampling factor is composed of a set of well-designed hyper-parameters
Figure BDA0003024904870000073
Determination of alphaiRepresenting a down-sampling factor corresponding to a feature at depth level i. In addition, in order to facilitate the aggregation operation performed by the subsequent features, the channel dimension also needs to perform a downsampling operation, and a downsampling factor is determined by the number n of network layers participating in feature derivation. Namely:
Figure BDA0003024904870000074
wherein M isT(. cndot.) denotes a time modulation operation.
Step four: constructing a feature stream from bottom to top and a feature stream from top to bottom to supplement each other and perform parallel processing on the feature streams to realize multi-level feature fusion;
definition of
Figure BDA0003024904870000075
Representing the convolution characteristics after space-time modulation. Considering that the time receptive field range of the low-level features is small, and the high-level features lack the description of local details, for the features of different depth levels, research is carried out on feature aggregation by using a feature flow from bottom to top and a feature flow from top to bottom, so that the features of different levels supplement each other and complement each other. Next, the characteristic polymerization method employed is described in detail.
For bottom-up feature flow, starting with the top-level feature, the top-level feature F'iSequentially applying element-level addition and downsampling operations to the next-level feature F'i+1And (3) supplementing, namely:
Figure BDA0003024904870000076
wherein, F ″)i+1Showing the characteristics after bottom-up flow addition polymerization,
Figure BDA0003024904870000077
representing element-level addition, g (-) representing a downsampling operation to ensure that dimensions of features of layers do not conflict during aggregation, Ti/Ti-1Is a sampling factor;
for top-down feature flow, starting with the bottom-level feature, the next-level feature F'i+1Sequentially enriching the characteristics F of the upper leveliThe spatial semantic information of' i.e.:
Figure BDA0003024904870000078
wherein, Fi' is a feature after top-down flow aggregation, f (-) denotes an upsampling operation, Ti/Ti-1Is the sampling factor.
In order to execute subsequent classification prediction, the two feature streams need to be fused, namely, the two parallel feature streams are processed simultaneously to generate a final classification discrimination feature, and then a classification prediction result generated by the multi-level fusion feature is obtained by using a Softmax function.
Step five: performing model training by using a two-stage training strategy to further improve the performance of the model;
the model mainly comprises two parts, namely a backbone network formed by a hybrid convolution module and a time shift module, and a multi-stage feature fusion module. The training mode adopted by the model is as follows: in the first stage, firstly, training a backbone network, then fixing backbone network parameters, and training a multilevel feature fusion module; in the second stage, the multi-stage feature fusion module is initialized by utilizing the pre-training weight in the first stage, and then the whole model is trained through an end-to-end training paradigm. Model training is performed through a two-stage training strategy, limited training set data can be utilized to the maximum extent, and the recognition effect of the model is further improved.
Fig. 2 is a system model diagram of the present invention, which is described below with reference to the accompanying drawings, and includes the following modules:
a first module: extracting low-level spatial features by adopting low-cost two-dimensional convolution operation in a network bottom layer structure, preparing high-level space-time features by adopting separable three-dimensional convolution operation in a network top layer structure, and combining the advantages of low complexity of the two-dimensional convolution operation and high efficiency of the separable three-dimensional convolution operation so as to construct a mixed convolution module;
and a second module: by introducing a time shift module, shifting partial channel information of input features at all times along a time dimension to compensate the defect that two-dimensional convolution lacks dynamic feature extraction capability;
and a third module: the method comprises the steps that a multilevel feature fusion module is used for deriving multilevel complementary features from convolutional layers with different depths of a backbone network, then spatial modulation operation is used for enabling all the features to have the same shape in spatial dimension, time modulation operation is used for capturing dynamic change conditions of visual rhythm of an action example, and finally high-quality classification distinguishing features are prepared through effective feature fusion;
and a module IV: model training is performed in stages using a two-stage training strategy, thereby maximizing the use of limited video data and further improving model performance.
Fig. 3 is a schematic diagram of a hybrid convolution module, illustrating the arrangement of hybrid convolution and the corresponding convolution kernel size in detail.
Fig. 4 is a schematic diagram of a time shift module, where after the input features are subjected to channel shift operation in the time dimension, the spatial semantic information of adjacent frames is fused with the current frame, so as to promote the flow of time information between adjacent frames.
Fig. 5 is a schematic diagram of a multi-level feature fusion module, and the final classification distinguishing features are obtained after the input feature set is subjected to spatial modulation operation, temporal modulation operation and feature fusion operation.
Optionally, the module one specifically includes:
and a hybrid convolution module. As shown in fig. 3, the construction process of the hybrid convolution module is as follows: following the basic architecture of a three-dimensional residual network, apparent detail features on spatial dimensions are extracted by utilizing two-dimensional convolution operation in a residual network bottom layer structure to weaken low-level space-time semantic information, and a three-dimensional convolution structure is reserved in a network top layer structure to extract high-level space-time features, so that more abstract space-time semantic clues are emphasized. In order to further reduce the complexity of model calculation, the three-dimensional convolution involved in the network top-level structure is replaced by separable three-dimensional convolution, specifically, a three-dimensional convolution kernel with the size of t × h × w is decomposed along the space-time dimension, so that a time convolution kernel with the size of t × 1 × 1 and a space convolution kernel with the size of 1 × h × w are obtained, wherein t, h, and w respectively represent the time dimension, height, and width of the convolution kernel.
Optionally, the module two specifically includes:
and a time shifting module. As shown in FIG. 4, first define Ft∈RH×W×CThe feature tensor at time t is represented, H, W and C represent the height, width and channel dimensions of the input features, respectively. The time shifting module shifts partial channel information of the input features at each moment in the time dimension, so that the space semantic information of adjacent frames is fused into the current frame, and information interaction between the adjacent frames is promoted. The mathematical representation is as follows:
Figure BDA0003024904870000091
wherein the content of the first and second substances,
Figure BDA0003024904870000092
is shown ast-1Moves forward to time t in the time dimension,
Figure BDA0003024904870000093
is shown ast+1Is shifted backwards in the time dimension to time t, Ft 0Is represented by FtDoes not participate in time shifted channel information.
In addition, in order to avoid the phenomenon that the spatial modeling capacity of the model is damaged due to the fact that the channel information is moved in a large area, and further performance attenuation of the model is caused. Thus, the time shift module moves only a small fraction of the channels to model the time flow, and the unidirectional channel movement ratio is typically set to 1/8. In order to further retain the spatial feature learning capability of the model, in the adopted time shifting module, the shifting operation only occurs in the residual mapping branch, so that the original spatial semantic information can still be completely transferred to the subsequent network layer.
Optionally, the module iii specifically includes:
and a multi-level feature fusion module. As shown in FIG. 5, first, the inputs of the multi-level feature fusion module are defined. In order to fully utilize the visual information of each layer of convolution characteristics, convolution layer characteristics of M different depths are collected and expressed as:
F={F1,F2,…FM},
wherein the content of the first and second substances,
Figure BDA0003024904870000094
representing the convolution characteristics, i e (1, M), derived from a certain depth network layer. To ensure efficient fusion of features, space-time modulation is introduced. The detailed introduction process is as follows:
1) and (4) spatial modulation. On the one hand, for the network top-level feature Ftop∈RT×H×W×CThe spatial modulation is equivalent to identity mapping, and the original size is reserved. Convolution features for the remaining network depths, on the other hand
Figure BDA0003024904870000095
Utilizing a two-dimensional convolution operation with a specific step size design to reduce the size of the space dimension of each hierarchy feature so that the space dimension of each hierarchy feature is matched with the network top-level feature, namely:
Figure BDA0003024904870000096
wherein M isS(-) represents a spatial modulation operation.
2) And (5) time modulation. Will first be subjected to a spatial modulation operationThe updated features are re-expressed as
Figure BDA0003024904870000097
Then down-sampling it in the time dimension, wherein the down-sampling factor is composed of a set of well-designed hyper-parameters
Figure BDA0003024904870000098
Determination of alphaiRepresenting a down-sampling factor corresponding to a feature at depth level i. In addition, in order to facilitate the aggregation operation performed by the subsequent features, the channel dimension also needs to perform a downsampling operation, and a downsampling factor is determined by the number n of network layers participating in feature derivation. Namely:
Figure BDA0003024904870000099
wherein M isT(. cndot.) denotes a time modulation operation.
And (5) feature fusion. Definition of
Figure BDA00030249048700000910
Representing the convolution characteristics after space-time modulation. Considering that the time receptive field range of the low-level features is small, and the high-level features lack the description of local details, for the features of different depth levels, research is carried out on feature aggregation by using a feature flow from bottom to top and a feature flow from top to bottom, so that the features of different levels supplement each other and complement each other. Next, the characteristic polymerization method employed is described in detail.
For bottom-up feature flow, starting with the top-level feature, the top-level feature F'iSequentially processing the next-level feature F' by element-level addition and downsamplingi+1And (3) supplementing, namely:
Figure BDA0003024904870000101
wherein, F ″i+1Showing the characteristics after bottom-up flow addition polymerization,
Figure BDA0003024904870000102
representing element-level addition, g (-) representing a downsampling operation to ensure that dimensions of features of layers do not conflict during aggregation, Ti/Ti-1Is a sampling factor;
for top-down feature flow, starting with the bottom-level feature, the next-level feature F'i+1Sequentially enriching the characteristics F of the upper leveliThe spatial semantic information of' i.e.:
Figure BDA0003024904870000103
wherein, Fi' is a feature after top-down flow aggregation, f (-) denotes an upsampling operation, Ti/Ti-1Is the sampling factor.
In order to execute subsequent classification prediction, the two feature streams need to be fused, namely, the two parallel feature streams are processed simultaneously to generate a final classification discrimination feature, and then a classification prediction result generated by the multi-level fusion feature is obtained by using a Softmax function.
Optionally, the module iv specifically includes:
a two-stage training strategy module. The model mainly comprises two parts, namely a backbone network formed by a hybrid convolution module and a time shift module, and a multi-stage feature fusion module. The training mode adopted by the model is as follows: in the first stage, firstly, training a backbone network, then fixing backbone network parameters, and training a multilevel feature fusion module; in the second stage, the multi-stage feature fusion module is initialized by utilizing the pre-training weight in the first stage, and then the whole model is trained through an end-to-end training paradigm. Model training is performed through a two-stage training strategy, limited training set data can be utilized to the maximum extent, and the recognition effect of the model is further improved. Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (6)

1. A video motion recognition method based on a hybrid convolution multistage feature fusion model is characterized by comprising the following steps: the method comprises the following steps:
the method comprises the following steps: constructing a hybrid convolution module by adopting two-dimensional convolution and separable three-dimensional convolution; the hybrid convolution module construction process comprises: following the basic architecture of a three-dimensional residual error network, extracting low-level spatial features in a bottom structure of the residual error network by adopting two-dimensional convolution operation, and extracting high-level space-time features in a top structure of the network by adopting separable three-dimensional convolution operation, so as to build a hybrid convolution network, wherein the separable three-dimensional convolution operation refers to decomposing three-dimensional convolution with a convolution kernel size of t × h × w along a space-time dimension, so as to obtain a time convolution kernel with a size of t × 1 × 1 and a space convolution kernel with a size of 1 × h × w, wherein t, h, w respectively represent the time dimension, height and width of the convolution kernel;
step two: performing channel shift operation on each input feature along a time dimension, constructing a time shift module, promoting information flow between adjacent frames, and compensating the defect that dynamic features are captured by two-dimensional convolution operation;
step three: deriving multi-level complementary features from different convolutional layers of a backbone network, and performing spatial modulation and time modulation on the multi-level complementary features, so that each level of features has consistent semantic information in a spatial dimension and has changeable visual rhythm clues in a time dimension;
step four: constructing a feature stream from bottom to top and a feature stream from top to bottom to supplement each other and perform parallel processing on the feature streams to realize multi-level feature fusion;
step five: and carrying out model training by using a two-stage training strategy.
2. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the second step specifically comprises:
first, define Ft∈RH×W×CRepresenting the feature tensor of the t-th moment, wherein H, W and C respectively represent the height, width and channel dimension of the input feature; the time shifting module performs shifting operation on partial channel information of input features at each moment in a time dimension, so that space semantic information of adjacent frames is fused into a current frame, and information interaction between the adjacent frames is further promoted, wherein the mathematical expression of the time shifting module is as follows:
Figure FDA0003601820330000011
wherein the content of the first and second substances,
Figure FDA0003601820330000012
is shown ast-1Moves forward in the time dimension to time t,
Figure FDA0003601820330000013
is shown ast+1Is shifted backwards in the time dimension to time t, Ft 0Is represented by FtChannel information which does not participate in time shifting;
the time shifting module, in which the shifting operation only occurs in the residual mapping branch, enables the original spatial semantic information to still be fully transferred into the subsequent network layer, moves only a small fraction of the channels to model the temporal flow, with the unidirectional channel movement ratio set to 1/8.
3. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the third step specifically comprises the following steps:
firstly, defining the input of a multilevel feature fusion module, and collecting convolution layer features of M different depths, wherein the input is expressed as follows:
F={F1,F2,…FM},
wherein the content of the first and second substances,
Figure FDA0003601820330000021
represents the convolution features derived from a certain depth network layer, i ∈ (1, M);
the spatio-temporal modulation process is introduced as follows:
1) spatial modulation: for network top-level feature Ftop∈RT×H×W×CThe spatial modulation is equivalent to identity mapping, and the original size is reserved; convolution features for remaining network depths
Figure FDA0003601820330000022
Utilizing a two-dimensional convolution operation with a specific step size design to reduce the size of the space dimension of each hierarchy feature so that the space dimension of each hierarchy feature is matched with the network top-level feature, namely:
Figure FDA0003601820330000023
wherein M isS(-) represents a spatial modulation operation;
2) time modulation: the features updated by the spatial modulation operation are first re-expressed as
Figure FDA0003601820330000024
Then down-sampling it in the time dimension, wherein the down-sampling factor is composed of a set of well-designed hyper-parameters
Figure FDA0003601820330000025
Determination of alphaiRepresenting a downsampling factor corresponding to a feature having a depth level i; and performing down-sampling operation on the channel dimension, wherein a down-sampling factor is determined by the number n of network layers participating in feature derivation, namely:
Figure FDA0003601820330000026
wherein M isT(. cndot.) denotes a time modulation operation.
4. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the feature fusion specifically includes:
Figure FDA0003601820330000027
representing the convolution characteristics after space-time modulation; for features of different depth levels, feature aggregation is carried out by utilizing a feature flow from bottom to top and a feature flow from top to bottom;
for bottom-up feature flow, starting with the top-level feature, the top-level feature Fi' successively applying element-level addition and downsampling operations to the next level of feature Fi+1And (3) supplementing, namely:
Figure FDA0003601820330000028
wherein, F ″)i+1Showing the characteristics after bottom-up flow addition polymerization,
Figure FDA0003601820330000029
representing element-level addition, g (-) representing a downsampling operation to ensure that dimensions of features of layers do not conflict during aggregation, Ti/Ti-1Is a sampling factor;
for top-down feature flow, starting with the bottom-level feature, the next-level feature F'i+1Sequentially enriching upper-level feature F'iThe spatial semantic information of (a), namely:
Figure FDA00036018203300000210
wherein, F'iFor top-down flow aggregated features, f (-) denotes an upsample operation, Ti/Ti-1Is a sampling factor;
and fusing the two feature streams, namely, simultaneously processing the two parallel feature streams to generate a final classification discrimination feature, and then obtaining a classification prediction result generated by the multi-level fusion feature by using a Softmax function.
5. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the two-stage training strategy in the fifth step is specifically as follows: in the first stage, firstly, training is carried out on a backbone network, then parameters of a backbone network part are fixed, and a subsequent multi-stage feature fusion module is trained independently; and in the second stage, initializing a multi-stage feature fusion module by using the weight learned in the first stage, and performing joint training on the whole model through an end-to-end training paradigm.
6. A video motion recognition system based on hybrid convolution and multi-level feature fusion is characterized in that: the system comprises a hybrid convolution module, a time shift module, a multi-stage feature fusion module and a two-stage training strategy module;
the hybrid convolution module is a basic framework following a three-dimensional residual network, low-level spatial features are extracted by adopting two-dimensional convolution operation in a bottom layer structure of the residual network, high-level space-time features are extracted by adopting separable three-dimensional convolution operation in a top layer structure of the network, and a hybrid convolution network is built, wherein the separable three-dimensional convolution operation means that three-dimensional convolution with a convolution kernel size of t x h x w is decomposed along space-time dimensions, so that a time convolution kernel with a size of t x 1 and a space convolution kernel with a size of 1 x h x w are obtained, and t, h and w respectively represent the time dimension, height and width of the convolution kernel;
the time shifting module is used for shifting partial channel information of the input features at all times along a time dimension to compensate the defect that the two-dimensional convolution lacks the dynamic feature extraction capability;
the multi-level feature fusion module is used for deriving multi-level complementary features from convolutional layers with different depths of a backbone network, then enabling each feature to have the same shape in a space dimension by utilizing a space modulation operation, capturing a dynamic change condition of a visual rhythm of an action example by utilizing a time modulation operation, and finally preparing high-quality classification distinguishing features through feature fusion;
the two-stage training strategy module is used for carrying out model training in stages, and limited video data are utilized to the maximum extent.
CN202110413461.1A 2021-04-16 2021-04-16 Video action recognition method and system based on hybrid convolution multistage feature fusion model Active CN113128395B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110413461.1A CN113128395B (en) 2021-04-16 2021-04-16 Video action recognition method and system based on hybrid convolution multistage feature fusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110413461.1A CN113128395B (en) 2021-04-16 2021-04-16 Video action recognition method and system based on hybrid convolution multistage feature fusion model

Publications (2)

Publication Number Publication Date
CN113128395A CN113128395A (en) 2021-07-16
CN113128395B true CN113128395B (en) 2022-05-20

Family

ID=76777114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110413461.1A Active CN113128395B (en) 2021-04-16 2021-04-16 Video action recognition method and system based on hybrid convolution multistage feature fusion model

Country Status (1)

Country Link
CN (1) CN113128395B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113850368A (en) * 2021-09-08 2021-12-28 深圳供电局有限公司 Lightweight convolutional neural network model suitable for edge-end equipment
CN116612524A (en) * 2022-02-07 2023-08-18 北京字跳网络技术有限公司 Action recognition method and device, electronic equipment and storage medium
CN115083434B (en) * 2022-07-22 2022-11-25 平安银行股份有限公司 Emotion recognition method and device, computer equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689065A (en) * 2019-09-23 2020-01-14 云南电网有限责任公司电力科学研究院 Hyperspectral image classification method based on flat mixed convolution neural network
CN111062264A (en) * 2019-11-27 2020-04-24 重庆邮电大学 Document object classification method based on dual-channel hybrid convolution network
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
CN112149504A (en) * 2020-08-21 2020-12-29 浙江理工大学 Motion video identification method combining residual error network and attention of mixed convolution
CN112270246A (en) * 2020-10-23 2021-01-26 泰康保险集团股份有限公司 Video behavior identification method and device, storage medium and electronic equipment
CN112348222A (en) * 2020-05-08 2021-02-09 东南大学 Network coupling time sequence information flow prediction method based on causal logic and graph convolution feature extraction
CN112613349A (en) * 2020-12-04 2021-04-06 北京理工大学 Time sequence action detection method and device based on deep hybrid convolutional neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163133A (en) * 2019-05-10 2019-08-23 南京邮电大学 A kind of Human bodys' response method based on depth residual error network
CN111382677B (en) * 2020-02-25 2023-06-20 华南理工大学 Human behavior recognition method and system based on 3D attention residual error model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689065A (en) * 2019-09-23 2020-01-14 云南电网有限责任公司电力科学研究院 Hyperspectral image classification method based on flat mixed convolution neural network
CN111062264A (en) * 2019-11-27 2020-04-24 重庆邮电大学 Document object classification method based on dual-channel hybrid convolution network
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
CN111325155A (en) * 2020-02-21 2020-06-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN112348222A (en) * 2020-05-08 2021-02-09 东南大学 Network coupling time sequence information flow prediction method based on causal logic and graph convolution feature extraction
CN112149504A (en) * 2020-08-21 2020-12-29 浙江理工大学 Motion video identification method combining residual error network and attention of mixed convolution
CN112270246A (en) * 2020-10-23 2021-01-26 泰康保险集团股份有限公司 Video behavior identification method and device, storage medium and electronic equipment
CN112613349A (en) * 2020-12-04 2021-04-06 北京理工大学 Time sequence action detection method and device based on deep hybrid convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于时空特征和深度学习的人体动作识别研究;李旭阳;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20170315;I138-5818 *

Also Published As

Publication number Publication date
CN113128395A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
CN113128395B (en) Video action recognition method and system based on hybrid convolution multistage feature fusion model
Zhou et al. Contextual ensemble network for semantic segmentation
Mo et al. Review the state-of-the-art technologies of semantic segmentation based on deep learning
Yang et al. Small object augmentation of urban scenes for real-time semantic segmentation
Chandio et al. Precise single-stage detector
CN111325155A (en) Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111402310A (en) Monocular image depth estimation method and system based on depth estimation network
CN113065450B (en) Human body action recognition method based on separable three-dimensional residual error attention network
CN112800937A (en) Intelligent face recognition method
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
Fu et al. Learning semantic-aware spatial-temporal attention for interpretable action recognition
Kumar Shukla et al. Comparative analysis of machine learning based approaches for face detection and recognition
Le et al. REDN: a recursive encoder-decoder network for edge detection
CN112819692A (en) Real-time arbitrary style migration method based on double attention modules
CN114511502A (en) Gastrointestinal endoscope image polyp detection system based on artificial intelligence, terminal and storage medium
CN113505719A (en) Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm
Lu et al. MFNet: Multi-feature fusion network for real-time semantic segmentation in road scenes
Yan et al. Weakly supervised regional and temporal learning for facial action unit recognition
Cheng et al. A survey on image semantic segmentation using deep learning techniques
Wani et al. Deep learning-based video action recognition: a review
Li et al. Progressive cross-domain knowledge distillation for efficient unsupervised domain adaptive object detection
Chong et al. Multi-hierarchy feature extraction and multi-step cost aggregation for stereo matching
CN114091583A (en) Salient object detection system and method based on attention mechanism and cross-modal fusion
Robert The Role of Deep Learning in Computer Vision
Xiao et al. Lightweight sea cucumber recognition network using improved YOLOv5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant