CN113128395B

CN113128395B - Video action recognition method and system based on hybrid convolution multistage feature fusion model

Info

Publication number: CN113128395B
Application number: CN202110413461.1A
Authority: CN
Inventors: 张祖凡; 彭月; 甘臣权; 张家波
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2022-05-20
Anticipated expiration: 2041-04-16
Also published as: CN113128395A

Abstract

The invention relates to a video motion recognition method and a video motion recognition system based on a hybrid convolution multistage feature fusion model, which belong to the technical field of computer vision and are characterized in that a hybrid convolution module is constructed by adopting two-dimensional convolution and separable three-dimensional convolution; performing channel shift operation on each input feature along a time dimension, constructing a time shift module, promoting information flow between adjacent frames, and compensating the defect that dynamic features are captured by two-dimensional convolution operation; deriving multi-level complementary features from different convolutional layers of a backbone network, and performing spatial modulation and time modulation on the multi-level complementary features, so that each level of features has consistent semantic information in a spatial dimension and has changeable visual rhythm clues in a time dimension; constructing a feature stream from bottom to top and a feature stream from top to bottom to supplement each other and perform parallel processing on the feature streams to realize multi-level feature fusion; and carrying out model training by using a two-stage training strategy.

Description

Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model

Technical Field

The invention belongs to the technical field of computer vision, and relates to a video motion recognition method and system based on a hybrid convolution multistage feature fusion model.

Background

The rapid development of the field of artificial intelligence research prompts the human-computer interaction technology to gradually permeate into the daily life of people, and the human body action recognition research derived from the human body action recognition technology is widely concerned. In the task of motion recognition based on video, the traditional method is mainly realized by depending on specific feature design, and has serious field limitation. In order to overcome the defects, more universal feature representations are prepared, and a Convolutional Neural Network (CNN) constructed based on a biological visual perception mechanism is widely applied to the field of action recognition.

The quality of the model about human body action recognition performance is closely related to the strength of the model about the representation capability of the model to a video, and as video data is used as sequential expansion of a two-dimensional plane image in a three-dimensional space-time, a video-based feature extraction process is divided into a space appearance representation part and a time sequence dynamic modeling part which are equally important and complementary. Based on this, although the 2D CNN model can effectively capture the spatial neighborhood correlation of the input video frame, it is limited by the deficiency of the dynamic feature extraction capability, lacks the attention to the dynamic change of the body motion in the time dimension, and has many limitations. Further, the 3D CNN benefits from its inherent internal structure, performs fusion of spatio-temporal features directly on the original input data, and has inherent advantages, but the 3D CNN model requires expensive computation cost and memory overhead. Therefore, in order to find a better compromise between the computation speed and the recognition performance, how to construct a high-performance motion recognition model, which fully combines the low complexity of the two-dimensional convolution operation and the effectiveness of the three-dimensional convolution operation, is undoubtedly an exploitable research direction.

In addition, considering that the limb movements with large visual appearance expression similarity are judged only according to the dynamic change condition of the observed target in the space-time dimension, the confusion of similar movement categories is easily caused in the classification process. To increase the distinctiveness between similar action categories, the model should give the visual rhythm an equal degree of attention. Although the visual rhythm variation is predefined on the input stage, the model identification effect can be obviously improved, but the complexity of the model is greatly increased due to parameter training involving a plurality of network branches. Researchers have confirmed that, as the number of network layers increases, the convolution characteristics derived by the model from the network layers at different depths already contain information about the change of the visual rhythm. Therefore, how to realize modeling of dynamic changes of visual rhythm on a feature level is also an important research direction.

Disclosure of Invention

In view of the above, the present invention provides a video motion recognition method and system based on a hybrid convolution multi-level feature fusion model.

In order to achieve the purpose, the invention provides the following technical scheme:

in one aspect, the invention provides a video motion recognition method based on a hybrid convolution multistage feature fusion model, which comprises the following steps:

the method comprises the following steps: constructing a hybrid convolution module by adopting two-dimensional convolution and separable three-dimensional convolution;

step two: performing channel shift operation on each input feature along a time dimension, constructing a time shift module, promoting information flow between adjacent frames, and compensating the defect that dynamic features are captured by two-dimensional convolution operation;

step three: deriving multi-level complementary features from different convolutional layers of a backbone network, and performing spatial modulation and time modulation on the multi-level complementary features, so that each level of features has consistent semantic information in a spatial dimension and has changeable visual rhythm clues in a time dimension;

step four: constructing a feature stream from bottom to top and a feature stream from top to bottom to supplement each other and perform parallel processing on the feature streams to realize multi-level feature fusion;

step five: and carrying out model training by using a two-stage training strategy.

Further, the hybrid convolution module construction process in the first step includes: following the basic architecture of a three-dimensional residual network, extracting low-level spatial features in a residual network bottom structure by adopting a two-dimensional convolution operation, and extracting high-level space-time features in a network top structure by adopting a separable three-dimensional convolution operation, so as to build a hybrid convolution network, wherein the separable three-dimensional convolution operation refers to decomposing a three-dimensional convolution with a convolution kernel size of t × h × w along a space-time dimension, so as to obtain a time convolution kernel with a size of t × 1 × 1 and a space convolution kernel with a size of 1 × h × w, wherein t, h, and w respectively represent the time dimension, height and width of the convolution kernel.

Further, the second step specifically includes:

first, define F_t∈R^H×W×CRepresenting the feature tensor of the t moment, wherein H, W and C respectively represent the height, width and channel dimension of the input feature; the time shifting module performs shifting operation on partial channel information of input features at each moment in a time dimension, so that space semantic information of adjacent frames is fused into a current frame, and information interaction between the adjacent frames is further promoted, wherein the mathematical expression of the time shifting module is as follows:

wherein the content of the first and second substances,

is shown as_t-1Moves forward in the time dimension to time t,

is shown as F_t+1Is shifted backwards in the time dimension to time t, F_t ⁰Is represented by F_tChannel information which does not participate in time shifting;

the time shifting module, in which the shifting operation only occurs in the residual mapping branch, enables the original spatial semantic information to still be fully transferred into the subsequent network layer, moves only a small fraction of the channels to model the temporal flow, with the unidirectional channel movement ratio set to 1/8.

Further, the third step specifically includes the following steps:

firstly, defining the input of a multilevel feature fusion module, and collecting convolution layer features of M different depths, wherein the input is expressed as follows:

F＝{F₁,F₂,…F_M},

wherein the content of the first and second substances,

represents the convolution features derived from a certain depth network layer, i ∈ (1, M);

the spatio-temporal modulation process is introduced as follows:

1) spatial modulation: for network top-level feature F_top∈R^T×H×W×CSpatial modulation is equivalent to identity mapping, and the original size is reserved; convolution features for remaining network depths

Utilizing a two-dimensional convolution operation with a specific step size design to reduce the size of the space dimension of each hierarchy feature so that the space dimension of each hierarchy feature is matched with the network top-level feature, namely:

wherein M is_S(-) represents a spatial modulation operation;

2) time modulation: the features updated by the spatial modulation operation are first re-expressed as

Then down-sampling it in the time dimension, wherein the down-sampling factor is composed of a set of well-designed hyper-parameters

Determination of alpha_iRepresenting a downsampling factor corresponding to a feature having a depth level i; and performing down-sampling operation on the channel dimensionality, wherein a down-sampling factor is determined by the number n of network layers participating in feature derivation, namely:

wherein M is_T(. cndot.) denotes a time modulation operation.

Further, the feature fusion of the fourth step specifically includes:

representing the convolution characteristics after space-time modulation; for features of different depth levels, feature aggregation is carried out by utilizing a feature flow from bottom to top and a feature flow from top to bottom;

for bottom-up feature flow, starting with the top-level feature, the top-level feature F'_iSequentially applying element-level addition and downsampling operations to the next-level feature F'_i+1And (3) supplementing, namely:

wherein, F ″)_i+1Showing the characteristics after bottom-up flow addition polymerization,

representing element-level addition, g (-) representing a downsampling operation to ensure that dimensions of features of layers do not conflict during aggregation, T_i/T_i-1Is a sampling factor;

for top-down feature flow, starting with the bottom-level feature, the next-level feature F'_i+1Sequentially enriching the characteristics F of the upper level_iThe spatial semantic information of' i.e.:

wherein, F_i' is a feature after top-down flow aggregation, f (-) denotes an upsampling operation, T_i/T_i-1Is a sampling factor;

and fusing the two feature streams, namely generating a final classification discrimination feature by simultaneously processing the two parallel feature streams, and then obtaining a classification prediction result generated by the multi-level fusion feature by using a Softmax function.

Further, the two-stage training strategy in the fifth step specifically comprises: in the first stage, firstly, training is carried out on a backbone network, then parameters of a backbone network part are fixed, and a subsequent multi-stage feature fusion module is trained independently; and in the second stage, initializing a multi-stage feature fusion module by using the weight learned in the first stage, and performing joint training on the whole model through an end-to-end training paradigm.

On the other hand, the invention provides a mixed convolution-based multi-stage feature fusion video motion recognition system, which comprises a mixed convolution module, a time shift module, a multi-stage feature fusion module and a two-stage training strategy module;

the hybrid convolution module is a basic framework following a three-dimensional residual network, low-level spatial features are extracted by adopting two-dimensional convolution operation in a bottom layer structure of the residual network, high-level space-time features are extracted by adopting separable three-dimensional convolution operation in a top layer structure of the network, and a hybrid convolution network is built, wherein the separable three-dimensional convolution operation means that three-dimensional convolution with a convolution kernel size of t x h x w is decomposed along space-time dimensions, so that a time convolution kernel with a size of t x 1 and a space convolution kernel with a size of 1 x h x w are obtained, and t, h and w respectively represent the time dimension, height and width of the convolution kernel;

the time shifting module is used for shifting partial channel information of the input features at all times along a time dimension to compensate the defect that the two-dimensional convolution lacks the dynamic feature extraction capability;

the multi-level feature fusion module is used for deriving multi-level complementary features from convolutional layers with different depths of a backbone network, then enabling each feature to have the same shape in a space dimension by utilizing a space modulation operation, capturing a dynamic change condition of a visual rhythm of an action example by utilizing a time modulation operation, and finally preparing high-quality classification distinguishing features through feature fusion;

the two-stage training strategy module is used for carrying out model training in stages, and limited video data are utilized to the maximum extent.

The invention has the beneficial effects that: the hybrid convolution module is utilized to combine the advantages of low complexity of two-dimensional convolution operation and high efficiency of separable three-dimensional convolution operation, so that the model complexity is obviously reduced, and a better compromise is sought between the calculation speed and the recognition performance; the dynamic change of the limb movement on a short-term time scale is simulated by using a low-cost time shift module, so that the dynamic feature extraction capability which is lacked by two-dimensional convolution operation is compensated to a certain extent; the multi-level feature fusion module is used for realizing effective fusion of multi-depth features, and visual rhythm change of an action example is effectively represented under the condition of fully utilizing visual information of each level of feature.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a diagram of the steps of a video motion recognition method based on a hybrid convolution multistage feature fusion model according to the present invention;

FIG. 2 is a model diagram of a video motion recognition system based on a hybrid convolution multistage feature fusion model according to the present invention;

FIG. 3 is a schematic diagram of a hybrid convolution module;

FIG. 4 is a schematic diagram of a time shifting module;

FIG. 5 is a schematic diagram of a multi-level feature fusion module.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

As shown in fig. 1, the present invention provides a video motion recognition method based on hybrid convolution and multi-level feature fusion, which includes the following steps:

the method comprises the following steps: adopting a low-cost two-dimensional convolution operation at the bottom layer of the network and adopting a separable three-dimensional convolution operation at the top layer of the network so as to construct a hybrid convolution module aiming at reducing the redundancy of a full three-dimensional convolution model;

the construction process of the hybrid convolution module is as follows: following the basic architecture of a three-dimensional residual network, apparent detail features on spatial dimensions are extracted by utilizing two-dimensional convolution operation in a residual network bottom layer structure to weaken low-level space-time semantic information, and a three-dimensional convolution structure is reserved in a network top layer structure to extract high-level space-time features, so that more abstract space-time semantic clues are emphasized. In order to further reduce the complexity of model calculation, the three-dimensional convolution involved in the network top-level structure is replaced by separable three-dimensional convolution, specifically, a three-dimensional convolution kernel with the size of t × h × w is decomposed along the space-time dimension, so that a time convolution kernel with the size of t × 1 × 1 and a space convolution kernel with the size of 1 × h × w are obtained, wherein t, h, and w respectively represent the time dimension, height, and width of the convolution kernel.

Step two: performing channel shift operation on each input feature along the time dimension, and constructing a time shift module, thereby promoting information flow between adjacent frames to compensate for dynamic feature extraction capability lacking in two-dimensional convolution operation;

first, define F_t∈R^H×W×CThe feature tensor at time t is represented, H, W and C represent the height, width and channel dimensions of the input features, respectively. The time shifting module shifts partial channel information of the input features at each moment in the time dimension, so that the space semantic information of adjacent frames is fused into the current frame, and information interaction between the adjacent frames is promoted. The mathematical representation is as follows:

wherein the content of the first and second substances,

is shown as_t-1Moves forward in the time dimension to time t,

is shown as_t+1Is shifted backwards in the time dimension to time t, F_t ⁰Is represented by F_tDoes not participate in time-shifted channel information.

In addition, in order to avoid the phenomenon that the space modeling capacity of the model is damaged due to the fact that the channel information is moved in a large area, and further performance attenuation of the model is caused. Thus, the time shift module moves only a small fraction of the channels to model the time flow, and the unidirectional channel movement ratio is typically set to 1/8. In order to further retain the spatial feature learning capability of the model, in the adopted time shifting module, the shifting operation only occurs in the residual mapping branch, so that the original spatial semantic information can still be completely transferred to the subsequent network layer.

first, the inputs of the multi-level feature fusion module are defined. In order to fully utilize the visual information of each layer of convolution characteristics, convolution layer characteristics of M different depths are collected and expressed as:

F＝{F₁,F₂,…F_M},

wherein the content of the first and second substances,

representing the convolution characteristics, i e (1, M), derived from a certain depth network layer. To ensure efficient fusion of features, space-time modulation is introduced. The detailed introduction process is as follows:

1) and (4) spatial modulation. On the one hand, for the network top-level feature F_top∈R^T×H×W×CThe spatial modulation is equivalent to identity mapping, and the original size is reserved. Convolution features for the remaining network depths, on the other hand

wherein M is_S(-) represents a spatial modulation operation.

2) And (5) time modulation. The features updated by the spatial modulation operation are first re-expressed as

Determination of alpha_iRepresenting a down-sampling factor corresponding to a feature at depth level i. In addition, in order to facilitate the aggregation operation performed by the subsequent features, the channel dimension also needs to perform a downsampling operation, and a downsampling factor is determined by the number n of network layers participating in feature derivation. Namely:

wherein M is_T(. cndot.) denotes a time modulation operation.

definition of

Representing the convolution characteristics after space-time modulation. Considering that the time receptive field range of the low-level features is small, and the high-level features lack the description of local details, for the features of different depth levels, research is carried out on feature aggregation by using a feature flow from bottom to top and a feature flow from top to bottom, so that the features of different levels supplement each other and complement each other. Next, the characteristic polymerization method employed is described in detail.

wherein, F_i' is a feature after top-down flow aggregation, f (-) denotes an upsampling operation, T_i/T_i-1Is the sampling factor.

In order to execute subsequent classification prediction, the two feature streams need to be fused, namely, the two parallel feature streams are processed simultaneously to generate a final classification discrimination feature, and then a classification prediction result generated by the multi-level fusion feature is obtained by using a Softmax function.

Step five: performing model training by using a two-stage training strategy to further improve the performance of the model;

the model mainly comprises two parts, namely a backbone network formed by a hybrid convolution module and a time shift module, and a multi-stage feature fusion module. The training mode adopted by the model is as follows: in the first stage, firstly, training a backbone network, then fixing backbone network parameters, and training a multilevel feature fusion module; in the second stage, the multi-stage feature fusion module is initialized by utilizing the pre-training weight in the first stage, and then the whole model is trained through an end-to-end training paradigm. Model training is performed through a two-stage training strategy, limited training set data can be utilized to the maximum extent, and the recognition effect of the model is further improved.

Fig. 2 is a system model diagram of the present invention, which is described below with reference to the accompanying drawings, and includes the following modules:

a first module: extracting low-level spatial features by adopting low-cost two-dimensional convolution operation in a network bottom layer structure, preparing high-level space-time features by adopting separable three-dimensional convolution operation in a network top layer structure, and combining the advantages of low complexity of the two-dimensional convolution operation and high efficiency of the separable three-dimensional convolution operation so as to construct a mixed convolution module;

and a second module: by introducing a time shift module, shifting partial channel information of input features at all times along a time dimension to compensate the defect that two-dimensional convolution lacks dynamic feature extraction capability;

and a third module: the method comprises the steps that a multilevel feature fusion module is used for deriving multilevel complementary features from convolutional layers with different depths of a backbone network, then spatial modulation operation is used for enabling all the features to have the same shape in spatial dimension, time modulation operation is used for capturing dynamic change conditions of visual rhythm of an action example, and finally high-quality classification distinguishing features are prepared through effective feature fusion;

and a module IV: model training is performed in stages using a two-stage training strategy, thereby maximizing the use of limited video data and further improving model performance.

Fig. 3 is a schematic diagram of a hybrid convolution module, illustrating the arrangement of hybrid convolution and the corresponding convolution kernel size in detail.

Fig. 4 is a schematic diagram of a time shift module, where after the input features are subjected to channel shift operation in the time dimension, the spatial semantic information of adjacent frames is fused with the current frame, so as to promote the flow of time information between adjacent frames.

Fig. 5 is a schematic diagram of a multi-level feature fusion module, and the final classification distinguishing features are obtained after the input feature set is subjected to spatial modulation operation, temporal modulation operation and feature fusion operation.

Optionally, the module one specifically includes:

and a hybrid convolution module. As shown in fig. 3, the construction process of the hybrid convolution module is as follows: following the basic architecture of a three-dimensional residual network, apparent detail features on spatial dimensions are extracted by utilizing two-dimensional convolution operation in a residual network bottom layer structure to weaken low-level space-time semantic information, and a three-dimensional convolution structure is reserved in a network top layer structure to extract high-level space-time features, so that more abstract space-time semantic clues are emphasized. In order to further reduce the complexity of model calculation, the three-dimensional convolution involved in the network top-level structure is replaced by separable three-dimensional convolution, specifically, a three-dimensional convolution kernel with the size of t × h × w is decomposed along the space-time dimension, so that a time convolution kernel with the size of t × 1 × 1 and a space convolution kernel with the size of 1 × h × w are obtained, wherein t, h, and w respectively represent the time dimension, height, and width of the convolution kernel.

Optionally, the module two specifically includes:

and a time shifting module. As shown in FIG. 4, first define F_t∈R^H×W×CThe feature tensor at time t is represented, H, W and C represent the height, width and channel dimensions of the input features, respectively. The time shifting module shifts partial channel information of the input features at each moment in the time dimension, so that the space semantic information of adjacent frames is fused into the current frame, and information interaction between the adjacent frames is promoted. The mathematical representation is as follows:

wherein the content of the first and second substances,

is shown as_t-1Moves forward to time t in the time dimension,

is shown as_t+1Is shifted backwards in the time dimension to time t, F_t ⁰Is represented by F_tDoes not participate in time shifted channel information.

In addition, in order to avoid the phenomenon that the spatial modeling capacity of the model is damaged due to the fact that the channel information is moved in a large area, and further performance attenuation of the model is caused. Thus, the time shift module moves only a small fraction of the channels to model the time flow, and the unidirectional channel movement ratio is typically set to 1/8. In order to further retain the spatial feature learning capability of the model, in the adopted time shifting module, the shifting operation only occurs in the residual mapping branch, so that the original spatial semantic information can still be completely transferred to the subsequent network layer.

Optionally, the module iii specifically includes:

and a multi-level feature fusion module. As shown in FIG. 5, first, the inputs of the multi-level feature fusion module are defined. In order to fully utilize the visual information of each layer of convolution characteristics, convolution layer characteristics of M different depths are collected and expressed as:

F＝{F₁,F₂,…F_M},

wherein the content of the first and second substances,

wherein M is_S(-) represents a spatial modulation operation.

2) And (5) time modulation. Will first be subjected to a spatial modulation operationThe updated features are re-expressed as

wherein M is_T(. cndot.) denotes a time modulation operation.

And (5) feature fusion. Definition of

For bottom-up feature flow, starting with the top-level feature, the top-level feature F'_iSequentially processing the next-level feature F' by element-level addition and downsampling_i+1And (3) supplementing, namely:

wherein, F ″_i+1Showing the characteristics after bottom-up flow addition polymerization,

Optionally, the module iv specifically includes:

a two-stage training strategy module. The model mainly comprises two parts, namely a backbone network formed by a hybrid convolution module and a time shift module, and a multi-stage feature fusion module. The training mode adopted by the model is as follows: in the first stage, firstly, training a backbone network, then fixing backbone network parameters, and training a multilevel feature fusion module; in the second stage, the multi-stage feature fusion module is initialized by utilizing the pre-training weight in the first stage, and then the whole model is trained through an end-to-end training paradigm. Model training is performed through a two-stage training strategy, limited training set data can be utilized to the maximum extent, and the recognition effect of the model is further improved. Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A video motion recognition method based on a hybrid convolution multistage feature fusion model is characterized by comprising the following steps: the method comprises the following steps:

the method comprises the following steps: constructing a hybrid convolution module by adopting two-dimensional convolution and separable three-dimensional convolution; the hybrid convolution module construction process comprises: following the basic architecture of a three-dimensional residual error network, extracting low-level spatial features in a bottom structure of the residual error network by adopting two-dimensional convolution operation, and extracting high-level space-time features in a top structure of the network by adopting separable three-dimensional convolution operation, so as to build a hybrid convolution network, wherein the separable three-dimensional convolution operation refers to decomposing three-dimensional convolution with a convolution kernel size of t × h × w along a space-time dimension, so as to obtain a time convolution kernel with a size of t × 1 × 1 and a space convolution kernel with a size of 1 × h × w, wherein t, h, w respectively represent the time dimension, height and width of the convolution kernel;

2. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the second step specifically comprises:

first, define F_t∈R^H×W×CRepresenting the feature tensor of the t-th moment, wherein H, W and C respectively represent the height, width and channel dimension of the input feature; the time shifting module performs shifting operation on partial channel information of input features at each moment in a time dimension, so that space semantic information of adjacent frames is fused into a current frame, and information interaction between the adjacent frames is further promoted, wherein the mathematical expression of the time shifting module is as follows:

wherein the content of the first and second substances,

is shown as_t-1Moves forward in the time dimension to time t,

is shown as_t+1Is shifted backwards in the time dimension to time t, F_t ⁰Is represented by F_tChannel information which does not participate in time shifting;

3. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the third step specifically comprises the following steps:

F＝{F₁,F₂,…F_M},

wherein the content of the first and second substances,

the spatio-temporal modulation process is introduced as follows:

1) spatial modulation: for network top-level feature F_top∈R^T×H×W×CThe spatial modulation is equivalent to identity mapping, and the original size is reserved; convolution features for remaining network depths

wherein M is_S(-) represents a spatial modulation operation;

Determination of alpha_iRepresenting a downsampling factor corresponding to a feature having a depth level i; and performing down-sampling operation on the channel dimension, wherein a down-sampling factor is determined by the number n of network layers participating in feature derivation, namely:

wherein M is_T(. cndot.) denotes a time modulation operation.

4. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the feature fusion specifically includes:

for bottom-up feature flow, starting with the top-level feature, the top-level feature F_i' successively applying element-level addition and downsampling operations to the next level of feature F_i+1And (3) supplementing, namely:

for top-down feature flow, starting with the bottom-level feature, the next-level feature F'_i+1Sequentially enriching upper-level feature F'_iThe spatial semantic information of (a), namely:

wherein, F'_iFor top-down flow aggregated features, f (-) denotes an upsample operation, T_i/T_i-1Is a sampling factor;

and fusing the two feature streams, namely, simultaneously processing the two parallel feature streams to generate a final classification discrimination feature, and then obtaining a classification prediction result generated by the multi-level fusion feature by using a Softmax function.

5. The method for video motion recognition based on the hybrid convolution multistage feature fusion model according to claim 1, wherein: the two-stage training strategy in the fifth step is specifically as follows: in the first stage, firstly, training is carried out on a backbone network, then parameters of a backbone network part are fixed, and a subsequent multi-stage feature fusion module is trained independently; and in the second stage, initializing a multi-stage feature fusion module by using the weight learned in the first stage, and performing joint training on the whole model through an end-to-end training paradigm.

6. A video motion recognition system based on hybrid convolution and multi-level feature fusion is characterized in that: the system comprises a hybrid convolution module, a time shift module, a multi-stage feature fusion module and a two-stage training strategy module;