CN115424175A - Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application - Google Patents

Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application Download PDF

Info

Publication number
CN115424175A
CN115424175A CN202211053069.1A CN202211053069A CN115424175A CN 115424175 A CN115424175 A CN 115424175A CN 202211053069 A CN202211053069 A CN 202211053069A CN 115424175 A CN115424175 A CN 115424175A
Authority
CN
China
Prior art keywords
convolution
hourglass
network
layer
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211053069.1A
Other languages
Chinese (zh)
Inventor
郝艳宾
谭懿
汪远
何向南
王硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202211053069.1A priority Critical patent/CN115424175A/en
Publication of CN115424175A publication Critical patent/CN115424175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application thereof, wherein the method comprises the following steps: 1. extracting and preprocessing video data; 2. constructing a hierarchical hourglass convolutional network, comprising: a frame level dynamic information capture network, a fragment level dynamic information capture network and a classification network; 3. and constructing a cross entropy loss function, and training the hierarchical hourglass convolution network to obtain a video action classifier for realizing video action classification. The hourglass convolution can realize better modeling of video dynamics, and meanwhile, the frame-level dynamic information capture network and the segment-level dynamic information capture network based on the hourglass convolution can realize hierarchical modeling of video dynamic information from multiple levels, so that higher-precision character action video identification can be realized.

Description

Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application
Technical Field
The invention relates to the field of computer vision, in particular to a video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application thereof.
Background
The scale, position and view angle modes of visual clues (such as semantics and objects) in the video evolve along with the change of a time axis, and a discriminant motion mode obtained by aggregating the dynamic change information is important for classifying video contents. To capture these features, the following methods mainly exist at present:
using optical flow information as external information may enhance dynamic modeling of motion in a video. The most representative work is the double-stream network, which represents the motion in the form of optical flow, and respectively sends the static (RGB) and dynamic (optical flow) information into two independent convolution neural networks, and then fuses the classification results of the two streams to obtain the final video classification result. Although the dual-stream network is effective in learning dynamic features, the acquisition of optical flow information and the addition of an additional convolutional neural network branch make the dual-stream network computationally burdensome.
It was subsequently found that dynamic information can be modeled well by time aggregation at the same spatial location at adjacent times using one-dimensional time convolution. In particular, a two-dimensional convolutional neural network has time perception capability by combining one-dimensional time convolution and two-dimensional space convolution in a cascading or parallel manner in the network, so that the paradigm is widely favored in network design for video classification tasks. However, networks designed based on this paradigm have limited time modeling capabilities if the time dimension is not of particular concern. At the same time, the potentially large visual displacement between adjacent time frames makes rigid one-dimensional time convolution less well suited to capture motion patterns. For example, the action of "picking up a table tennis ball and placing the ball on a table" includes the interaction of core objects such as "hands" and "tennis balls". Over time, the spatial semantics of a single frame gradually change from "picking up the ball" to "lifting the ball in the air" and "putting the ball on a table". During this process, the dimensions, positions and patterns of the "hand" and the "tennis ball" are changed. The rigid one-dimensional time convolution only considers the dynamic change of the same spatial position at different times, and does not consider the large change, so that when the target object moves out of the receptive field in the adjacent frames, the core visual clues of the object are easily lost.
Attention strategies, a method of representing motion patterns by similarities between space-time variations, can also effectively model dynamically changing information. But since the pair-wise similarity is computationally inefficient, it has a significant computational burden as the optical flow-based approach.
In summary, the current technical means applied to video classification has many disadvantages and drawbacks, which result in poor video classification effect and low precision.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a video motion classification method based on the hierarchical dynamic modeling of hourglass convolution and application thereof, so that better modeling of video dynamics can be realized by utilizing the hourglass convolution, and video dynamic information is hierarchically modeled from multiple levels by utilizing a frame-level dynamic information capture network and a segment-level dynamic information capture network based on the hourglass convolution, so that the accuracy of character motion video identification can be improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a video motion classification method based on hierarchical dynamic modeling of hourglass convolution, which is characterized by comprising the following steps of:
step 1, video data extraction and pretreatment:
uniformly sampling T frame key frame images from the character motion video V according to the fixed frame number, and marking as F = [ F ] 1 ,F 2 ,…,F t ,…,F T ],F t Representing the T-th key frame, and T representing the number of key frames;
sampling the tth key frame F t Two consecutive frames before and after the character motion video V, and F t Two consecutive frames before and after it are denoted as the t-th slice
Figure BDA0003824059320000021
Is represented by F t The first two frames of the frame (a),
Figure BDA0003824059320000022
is represented by F t The frame of the previous frame of the frame,
Figure BDA0003824059320000023
is represented by F t The next frame of the frame (a) to (b),
Figure BDA0003824059320000024
is represented by F t The second two frames of (1);
the t-th fragment C t After each frame of resolution is zoomed, an image block with the resolution of H multiplied by W is taken out from each frame and then normalized and preprocessed to obtain the t-th input video data tensor
Figure BDA0003824059320000025
Thereby obtaining an input video data tensor C '= [ C ] of the human motion video V' 1 ,C' 2 ,…,C' t ,…,C' T ]Wherein H and W respectively represent C' t D represents C' t The number of channels of (a);
step 2, constructing a hierarchical hourglass convolution network, comprising the following steps: a frame-level dynamic information capture network, a fragment-level dynamic information capture network and a classification network;
step 2.1, constructing hourglass convolution:
the hourglass convolution is composed of a group of spatial convolution with kernel size of (p · | i | + 1) and a time convolution with kernel size of K, wherein p is a parameter, and i is time offset;
the hourglass convolution has a dimension of
Figure BDA0003824059320000026
Is processed to obtain an output characteristic HgC (X), T 'represents the time dimension, H' represents the height, W 'represents the width, D' represents the number of channels, wherein the T-th characteristic HgC (X) of the output characteristic HgC (X) t Is obtained by using a formula (1):
Figure BDA0003824059320000027
in the formula (1), X t+i For the T + i-th input feature of tensor X in the time dimension of T i Is the i-th parameter of the time convolution layer, f is the spatial convolution function, W p·|i|+1,p·|i|+1 Is a parameter of the spatial convolution layer; t E [0,T' -1];
Step 2.2, the frame-level dynamic information capturing network is composed of a first volume block of a ResNet50 network and a frame-level dynamic information capturing module:
the first convolution block of the ResNet50 network is a spatial convolution with a convolution kernel of a x a;
the frame-level dynamic information capturing module consists of a down-sampling layer, an hourglass convolution layer, a space convolution layer and an up-sampling layer:
the down-sampling layer is a spatial average pooling layer with the kernel size of b multiplied by b; the hourglass convolution layer consists of two serially connected hourglass convolutions; the space convolution layer is a space convolution with a convolution kernel of a multiplied by a; the up-sampling layer is used for up-sampling operation of copying one pixel into four adjacent pixels;
the key frame image F = [ F ] of the character motion video V 1 ,F 2 ,…,F t ,…,F T ]Inputting the data into a first volume block of a ResNet50 network for processing, and obtaining an output characteristic F S
Video V of character movementInput video data tensor C '= [ C' 1 ,C' 2 ,…,C' t ,…,C' T ]Inputting the data into a frame-level dynamic information capturing module, and sequentially processing a down-sampling layer, a hourglass convolution layer, a space convolution layer and an up-sampling layer to obtain an output characteristic F fm
F is to be S And F fm Adding to obtain the output M of the frame-level dynamic information capturing network fm
2.3, the fragment-level dynamic information capture network consists of four convolution blocks which are connected in series, each convolution block consists of repeated units which are connected in series, and the number of the repeated units contained in each convolution block is different;
the repeating unit consists of a residual block and a fragment-level dynamic information capturing module; the residual block comprises convolution layers with convolution kernels of 1 × 1 and convolution layers with convolution kernels of 3 × 3; the segment-level dynamic information capture module comprises two convolution layers of 1 multiplied by 1, an hourglass convolution, a global average pooling layer and a Sigmoid activation function layer;
will M fm Inputting a first 1 x 1 convolutional layer of a first repeating unit in a first convolutional block of a segment-level dynamic information capture network to obtain a characteristic Y, inputting Y into a segment-level dynamic information capture module, sequentially processing the first 1 x 1 convolutional layer, an hourglass convolutional layer, a global average pooling layer, a second 1 x 1 convolutional layer and a Sigmoid activation function layer to obtain a characteristic A, multiplying A and Y, inputting a residual block of the first repeating unit in the first convolutional block, sequentially processing the 3 x 3 convolutional layer and the second 1 x 1 convolutional layer to obtain an output Z' of the first repeating unit of the first convolutional block;
z' is input into a second repeating unit in the first volume block again, and the result obtained after the same processing is input into the next repeating unit again, so that the results obtained after the processing of all the repeating units in the first volume block are input into the next volume block for processing, and finally the output Z of the hierarchical hourglass convolution network is obtained by the last repeating unit of the fourth volume block;
step 3, the classification network is formed by connecting a global average pooling layer and a full-connection layer in series; inputting Z into the classification network for processing to obtain a final action category;
and 4, constructing a cross entropy loss function as a loss function L of the hierarchical hourglass convolutional network, training the hierarchical hourglass convolutional network by using an SGD optimizer, and calculating the loss function L to adjust network parameters, so that the trained hierarchical hourglass convolutional network is finally obtained and used as a video motion classifier for realizing video motion classification.
The electronic device comprises a memory and a processor, and is characterized in that the memory is used for storing a program for supporting the processor to execute the video action classification method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the video motion classification method.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a novel time Convolution, namely Hourglass Convolution (HgC), and a video motion recognition network is constructed based on the Hourglass Convolution layering, so that target loss caused by visual displacement between different moments of a video can be effectively solved, and the accuracy of character motion video recognition is improved.
2. The hourglass convolution proposed by the present invention has an hourglass-like reception field, specifically: the space receptive field is amplified in the front time point and the back time point, so that large visual displacement can be captured, and the space-time dynamic information modeling capacity of hourglass convolution is improved; finally, the recognition accuracy of the character action video is improved.
3. The invention relates to a character video motion classification Network (H) constructed based on Hourglass convolution layering 2 CN) simultaneously from between adjacent frames and between adjacent fragmentsThe two levels mine the space-time dynamic information, provide abundant space-time dynamic information for the network, and further improve the identification precision of the character action video of the network.
Drawings
FIG. 1 is a flow chart of a video classification method according to an embodiment of the present invention;
FIG. 2 is a schematic illustration of an hourglass convolution in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a video motion classification network based on a hierarchical dynamic modeling of hourglass convolution according to an embodiment of the present invention;
FIG. 4 is a diagram of a frame-level motion information capture network in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a segment-level dynamic information capture network according to an embodiment of the present invention.
Detailed Description
In this embodiment, as shown in fig. 1, a video motion classification method based on a hierarchical dynamic modeling of hourglass convolution is performed according to the following steps:
step 1, video data extraction and preprocessing:
uniformly sampling T frame key frame images from the character motion video V according to the fixed frame number, and marking as F = [ F ] 1 ,F 2 ,…,F t ,…,F T ],F t The method comprises the steps of representing a T-th key frame, wherein T represents the number of key frames, and the number of the key frames can be generally 8, 16, 32 and the like;
sampling the tth key frame F t Two consecutive frames before and after the character motion video V, and F t Two consecutive frames before and after it are denoted as the t-th slice
Figure BDA0003824059320000051
Is represented by F t The first two frames of (a) are,
Figure BDA0003824059320000052
is shown as F t The frame of the previous frame of the frame,
Figure BDA0003824059320000053
is represented by F t The next frame of the frame (a) to (b),
Figure BDA0003824059320000054
is represented by F t The second two frames of (2);
the t-th fragment C t After each frame of resolution is zoomed, an image block with the resolution of H multiplied by W is taken out from each frame and then normalized and preprocessed to obtain the t-th input video data tensor
Figure BDA0003824059320000055
Thereby obtaining an input video data tensor C ' = [ C ' of the human motion video V ' 1 ,C' 2 ,…,C' t ,…,C' T ]Wherein H and W respectively represent C' t The height and width of (A) can be 224, D represents C 'when the recognition accuracy and the calculation efficiency are balanced' t The number of channels of (a) is 3 in widely used RGB images;
step 2, as shown in fig. 3, constructing a hierarchical hourglass convolution network, comprising: a frame level dynamic information capture network, a fragment level dynamic information capture network and a classification network;
step 2.1, constructing hourglass convolution:
the hourglass convolution is made up of a set of spatial convolutions of kernel size (p · | i | + 1) and a temporal convolution of kernel size K, where i is the time offset and p is the slope of the field of view increasing with time offset. For example, when K =3,p =2 is set, the spatial convolution kernel sizes corresponding to the { t-1,t, t +1} th frame are { (3,3), (1,1), (3,3) }.
The hourglass convolution has for any dimension of
Figure BDA0003824059320000057
The tensor X of (2) is processed, wherein T 'represents the time dimension, H' represents the height, W 'represents the width, D' represents the channel number, and the process of obtaining the output characteristic HgC (X) is as follows: the aggregation of the spatial dimension information of the frames at the corresponding time offsets is first achieved using the spatial convolution kernel described above. The time information is then aggregated along the time axis using the time convolution described above. T-th feature HgC of output feature HgC (X)(X) t The obtaining process is as shown in formula (1):
Figure BDA0003824059320000056
in the formula (1), X t+i For the T + i-th input feature of tensor X in the time dimension of T i Is the i-th parameter of the time convolution layer, f is the spatial convolution function, W p·|i|+1,p·|i|+1 Is a parameter of the spatial convolution layer; t epsilon [0,T' -1](ii) a Compared with the traditional method for identifying video motion, the hourglass convolution additionally utilizes the spatial convolution to firstly aggregate spatial information of frames at different time offsets, so that the hourglass convolution has an hourglass-shaped receptive field (as shown in fig. 2), thereby helping the hourglass convolution to aggregate spatial-temporal information which is difficult to aggregate due to visual offset in other time offsets. Therefore, compared with the time convolution widely used in the traditional method, the hourglass convolution can capture the space-time information which cannot be captured by the time convolution, and better fits the space-time dynamic characteristics of the video data.
Step 2.2, the frame-level dynamic information capturing network is composed of a first volume block of a ResNet50 network and a frame-level dynamic information capturing module:
the first convolution block of the ResNet50 network is a space convolution with a convolution kernel of a x a, and the value of a is generally 7;
the frame-level dynamic information capturing module consists of a down-sampling layer, an hourglass convolution layer, a space convolution layer and an up-sampling layer:
the down-sampling layer is a spatial average pooling layer with the kernel size of b multiplied by b, and in order to give consideration to both the identification precision and the calculation efficiency, the classical value of b in the invention is 2; the hourglass convolution layer consists of two serially connected hourglass convolutions; the space convolution layer is a space convolution with a convolution kernel of a multiplied by a; an upsampling layer for an upsampling operation that replicates one pixel into four adjacent pixels;
the key frame image F = [ F ] of the character motion video V 1 ,F 2 ,…,F t ,…,F T ]Input into the first volume block of the ResNet50 networkLine processing, and obtaining an output characteristic F S
The process of obtaining frame level dynamic information is shown in fig. 4: tensor C ' = [ C ' of input video data of human motion video V ' 1 ,C' 2 ,…,C' t ,…,C' T ]Inputting the data into a frame-level dynamic information capturing module, and sequentially processing the data by a down-sampling layer, a hourglass convolution layer, a space convolution layer and an up-sampling layer to obtain an output characteristic F fm (ii) a The invention firstly utilizes the down-sampling layer to reduce the resolution of the input video data before the calculation of the hourglass convolution layer, thereby reducing the calculation consumption, and utilizes the up-sampling layer to recover the resolution of the input video data after the calculation of the hourglass convolution layer, thereby not influencing the subsequent calculation.
Then F is put S And F fm Adding to obtain the output M of the frame-level dynamic information capturing network fm (ii) a In conventional methods only F is obtained at this stage S Compared with the method provided by the invention, the method is characterized by lacking frame-level dynamic information, so that the method has stronger identification precision
2.3, the fragment-level dynamic information capture network consists of four convolution blocks which are connected in series, each convolution block consists of repeated units which are connected in series, and the number of the repeated units contained in each convolution block is different;
the repeating unit consists of a residual block and a fragment-level dynamic information capturing module; the residual block comprises convolution layers with two convolution kernels of 1 multiplied by 1 and convolution layers with one convolution kernel of 3 multiplied by 3; the segment-level dynamic information capture module comprises two convolution layers of 1 multiplied by 1, an hourglass convolution, a global average pooling layer and a Sigmoid activation function layer;
the process of obtaining fragment-level dynamic information is shown in fig. 5: will M fm Inputting the characteristic Y into a segment-level dynamic information capture module after inputting the characteristic Y into a first 1 x 1 convolutional layer of a first repeating unit in a first convolutional block of a segment-level dynamic information capture network, and sequentially passing the first 1 x 1 convolutional layer, an hourglass convolutional layer, a global average pooling layer, a second 1 x 1 convolutional layer and the second 1 x 1 convolutional layerObtaining a characteristic A after the Sigmoid activation function layer is processed, multiplying the A and the Y, inputting the multiplied A and Y into a residual block of a first repeating unit in a first volume block, and obtaining an output Z' of the first repeating unit of the first volume block after the processing of a 3 x 3 volume layer and a second 1 x 1 volume layer in sequence; in the process, the hourglass convolution layer which needs to expend extra calculation amount is placed between two 1 × 1 × 1 convolution layers, the channel dimension reduction is carried out by using the first 1 × 1 × 1 convolution layer, the consumption of calculation resources is reduced, and then the second 1 × 1 × 1 convolution layer is restored to the channel dimension. The traditional network uses time convolution to carry out segment-level dynamic information modeling, and the invention can capture space-time information which cannot be captured by the time convolution by utilizing hourglass convolution. Meanwhile, by modeling the dynamic information at the frame level and the fragment level, the invention models the spatio-temporal dynamic information in the video data in a layering manner, so that compared with the traditional method, the method provided by the invention has higher identification precision.
Z' is input into a second repeating unit in the first volume block again, and the result obtained after the same processing is input into the next repeating unit again, so that the results obtained after the processing of all the repeating units in the first volume block are input into the next volume block for processing, and finally the output Z of the hierarchical hourglass convolution network is obtained by the last repeating unit of the fourth volume block;
step 3, the classification network is formed by connecting a global average pooling layer and a full-connection layer in series; inputting Z into a classification network for processing to obtain a final action type;
and 4, constructing a cross entropy loss function as a loss function L of the hierarchical hourglass convolutional network, training the hierarchical hourglass convolutional network by using an SGD optimizer, calculating the loss function L at the same time to adjust network parameters, and finally obtaining the trained hierarchical hourglass convolutional network as a video action classifier for realizing video action classification.
In this embodiment, an electronic device includes a memory for storing a program that enables the processor to execute the video action classification method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program, and the computer program is executed by a processor to perform the steps of the video motion classification method.
To demonstrate the effectiveness of the present invention, the following experiments were performed for verification.
1) The hourglass convolution was inserted into the ResNet network, named HgC-ResNet, and compared on Something-SomethingV1 with TSN without time convolution, R with normal time convolution (2+1) D, with the results shown in Table 1.
TABLE 1 comparison of Performance of the hourglass convolution with R (2+1) D, TSN
Method Top-1 #P FLOPS
TSN 19.7 23.9M 32.9G
R(2+1)D 46.0 23.9M 32.9G
HgC-ResNet 47.0 23.9M 33.1G
As can be observed from Table 1, both time convolution (R (2+1) D) and hourglass convolution (HgC-ResNet) significantly improved the performance of two-dimensional convolutional neural network backbone (TSN). While HgC-ResNet exceeded R (2+1) D by a significant magnitude (1%), the computational cost was nearly the same, a comparison that mainly shows the good ability of hourglass convolution in video motion modeling.
2) In Something-SomethingV1&V2, comparing the video motion classification method (H) based on the hierarchical dynamic modeling of hourglass convolution provided by the invention 2 CN) and other most advanced motion recognition models, the results are shown in table 2.
TABLE 2H 2 CN and other models in SomethingV1&Comparison of Performance at V2
Method BackBone #Pretrain Something V1 Something V2
GST ResNet-50 ImageNet 47.0 61.6
TSM+TPN ResNet-50 ImageNet 49.0 62.0
TEINeT ResNet-50 ImageNet 47.4 61.3
TAM ResNet-50 ImageNet 46.5 60.5
STM ResNet-50 ImageNet 49.2 62.3
TDN ResNet-50 ImageNet 52.3 64.0
SELFYNeT ResNet-50 ImageNet 52.5 64.5
SmallBig ResNet-50 ImageNet 48.3 61.6
TimeSformer-HR Transformer Kinetics -- 62.5
ECO ResNet-18 Kinetics 39.6 --
I3D 3DResNet-50 ImageNet 41.6 --
H 2 CN ResNet-50 ImageNet 53.6 65.2
As shown in Table 2, add H 2 CN is compared to convolutional neural network based architectures, including classical methods such as I3D, GST, TSM, and more recent methods such as TDN and SELFYNet. H 2 CN in Something V1&Top-1 accuracies of 53.6% and 65.2% were achieved on V2, respectively. H2CN and method based on convolution neural networkSignificant advantages over them. These results demonstrate H 2 The ability of the CN to capture a variety of dynamic information. With more sophisticated Transformer-based methods such as TimeSformer-HR]In contrast, H 2 The performance of CN remains competitive.
3) On the diveng 48, the motion recognition accuracy of the present invention was compared with other most advanced motion recognition models, and the results are shown in table 3.
TABLE 3H 2 Comparison of CN Performance with other most advanced models on Diving48
Figure BDA0003824059320000081
Figure BDA0003824059320000091
From Table 3, it can be seen that H compares to the convolutional neural network baseline 2 CN achieved the best performance of 87.0%. More importantly, H 2 The performance of CN was better than that of VIMPAC (85.5%) which is the best method based on Transformer.

Claims (3)

1. A video motion classification method based on hierarchical dynamic modeling of hourglass convolution is characterized by comprising the following steps:
step 1, video data extraction and preprocessing:
uniformly sampling T frame key frame images from the character motion video V according to the fixed frame number, and marking as F = [ F ] 1 ,F 2 ,…,F t ,…,F T ],F t Representing the T-th key frame, and T representing the number of key frames;
sampling the tth key frame F t Two consecutive frames before and after each in the character motion video V, and F t Two consecutive frames before and after it are denoted as the t-th slice
Figure FDA0003824059310000011
Figure FDA0003824059310000012
Is represented by F t The first two frames of the frame (a),
Figure FDA0003824059310000013
is represented by F t The frame of the previous frame of the frame,
Figure FDA0003824059310000014
is represented by F t The frame following the frame of the mobile communication terminal,
Figure FDA0003824059310000015
is represented by F t The second two frames of (1);
the t-th fragment C t After each frame of resolution is zoomed, an image block with the resolution of H multiplied by W is taken out from each frame and then normalized and preprocessed to obtain the t-th input video data tensor
Figure FDA0003824059310000016
Thereby obtaining an input video data tensor C '= [ C ] of the human motion video V' 1 ,C' 2 ,…,C' t ,…,C' T ]Wherein H and W respectively represent C' t D represents C' t The number of channels of (a);
step 2, constructing a hierarchical hourglass convolution network, comprising the following steps: a frame level dynamic information capture network, a fragment level dynamic information capture network and a classification network;
step 2.1, constructing hourglass convolution:
the hourglass convolution is composed of a group of spatial convolution with kernel size of (p · | i | + 1) and a time convolution with kernel size of K, wherein p is a parameter, and i is time offset;
the hourglass convolution has a dimension of
Figure FDA0003824059310000017
Is processed to obtain output characteristics HgC (X), T 'represents the time dimension, H' represents the height, W 'represents the width, D' represents the channel number, whereinOutputting the tth characteristic HgC (X) of the characteristic HgC (X) t Is obtained by using a formula (1):
Figure FDA0003824059310000018
in the formula (1), X t+i For the T + i-th input feature of tensor X in the time dimension of T i Is the i-th parameter of the time convolution layer, f is the spatial convolution function, W p·|i|+1,p·|i|+1 Is a parameter of the spatial convolution layer; t epsilon [0,T' -1];
Step 2.2, the frame-level dynamic information capturing network is composed of a first volume block and a frame-level dynamic information capturing module of a ResNet50 network:
the first convolution block of the ResNet50 network is a spatial convolution with a convolution kernel of a x a;
the frame-level dynamic information capturing module consists of a down-sampling layer, an hourglass convolution layer, a space convolution layer and an up-sampling layer:
the down-sampling layer is a spatial average pooling layer with the kernel size of b multiplied by b; the hourglass convolution layer consists of two serially connected hourglass convolutions; the space convolution layer is a space convolution with a convolution kernel of a multiplied by a; the up-sampling layer is used for up-sampling operation of copying one pixel into four adjacent pixels;
key frame images F = [ F ] of the person action video V 1 ,F 2 ,…,F t ,…,F T ]Inputting the data into a first volume block of a ResNet50 network for processing, and obtaining an output characteristic F S
Tensor C ' = [ C ' of input video data of human motion video V ' 1 ,C' 2 ,…,C' t ,…,C' T ]Inputting the data into a frame-level dynamic information capturing module, and sequentially processing the data by a down-sampling layer, a hourglass convolution layer, a space convolution layer and an up-sampling layer to obtain an output characteristic F fm
F is to be S And F fm Adding to obtain the output M of the frame-level dynamic information capturing network fm
2.3, the fragment-level dynamic information capturing network consists of four convolution blocks which are connected in series, each convolution block consists of repeated units which are connected in series, and the number of the repeated units contained in each convolution block is different;
the repeating unit consists of a residual block and a fragment-level dynamic information capturing module; the residual block comprises convolution layers with convolution kernels of 1 × 1 and convolution layers with convolution kernels of 3 × 3; the segment-level dynamic information capture module comprises two convolution layers of 1 multiplied by 1, an hourglass convolution, a global average pooling layer and a Sigmoid activation function layer;
will M fm Inputting a first 1 x 1 convolutional layer of a first repeating unit in a first convolutional block of a segment-level dynamic information capture network to obtain a characteristic Y, inputting Y into a segment-level dynamic information capture module, sequentially processing the first 1 x 1 convolutional layer, an hourglass convolutional layer, a global average pooling layer, a second 1 x 1 convolutional layer and a Sigmoid activation function layer to obtain a characteristic A, multiplying A and Y, inputting a residual block of the first repeating unit in the first convolutional block, sequentially processing the 3 x 3 convolutional layer and the second 1 x 1 convolutional layer to obtain an output Z' of the first repeating unit of the first convolutional block;
z' is input into a second repeating unit in the first volume block again, and the result obtained after the same processing is input into the next repeating unit again, so that the results obtained after the processing of all the repeating units in the first volume block are input into the next volume block for processing, and finally the output Z of the hierarchical hourglass convolution network is obtained by the last repeating unit of the fourth volume block;
step 3, the classification network is formed by connecting a global average pooling layer and a full-connection layer in series; inputting Z into the classification network for processing to obtain a final action category;
and 4, constructing a cross entropy loss function as a loss function L of the hierarchical hourglass convolutional network, training the hierarchical hourglass convolutional network by using an SGD optimizer, and calculating the loss function L to adjust network parameters, so that the trained hierarchical hourglass convolutional network is finally obtained and used as a video motion classifier for realizing video motion classification.
2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that enables the processor to perform the video action classification method of claim 1, and the processor is configured to execute the program stored in the memory.
3. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the video action classification method according to claim 1.
CN202211053069.1A 2022-08-31 2022-08-31 Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application Pending CN115424175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211053069.1A CN115424175A (en) 2022-08-31 2022-08-31 Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211053069.1A CN115424175A (en) 2022-08-31 2022-08-31 Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application

Publications (1)

Publication Number Publication Date
CN115424175A true CN115424175A (en) 2022-12-02

Family

ID=84200282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211053069.1A Pending CN115424175A (en) 2022-08-31 2022-08-31 Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application

Country Status (1)

Country Link
CN (1) CN115424175A (en)

Similar Documents

Publication Publication Date Title
Kim et al. Fully deep blind image quality predictor
Liu et al. Robust video super-resolution with learned temporal dynamics
Kang et al. Incorporating side information by adaptive convolution
CN112149504A (en) Motion video identification method combining residual error network and attention of mixed convolution
CN110599401A (en) Remote sensing image super-resolution reconstruction method, processing device and readable storage medium
CN109816011A (en) Generate the method and video key frame extracting method of portrait parted pattern
CN110516100A (en) A kind of calculation method of image similarity, system, storage medium and electronic equipment
CN113449691A (en) Human shape recognition system and method based on non-local attention mechanism
CN114529982A (en) Lightweight human body posture estimation method and system based on stream attention
CN114821058A (en) Image semantic segmentation method and device, electronic equipment and storage medium
Lu et al. A no-reference image sharpness metric based on structural information using sparse representation
Niu et al. Machine learning-based framework for saliency detection in distorted images
CN114897728A (en) Image enhancement method and device, terminal equipment and storage medium
CN110503002B (en) Face detection method and storage medium
Liu et al. Coordfill: Efficient high-resolution image inpainting via parameterized coordinate querying
CN112528077B (en) Video face retrieval method and system based on video embedding
CN114830168A (en) Image reconstruction method, electronic device, and computer-readable storage medium
CN111667495A (en) Image scene analysis method and device
CN114663315B (en) Image bit enhancement method and device for generating countermeasure network based on semantic fusion
CN116469172A (en) Bone behavior recognition video frame extraction method and system under multiple time scales
CN116168197A (en) Image segmentation method based on Transformer segmentation network and regularization training
CN116385281A (en) Remote sensing image denoising method based on real noise model and generated countermeasure network
Kligvasser et al. Deep self-dissimilarities as powerful visual fingerprints
CN115424175A (en) Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application
CN107133921A (en) The image super-resolution rebuilding method and system being embedded in based on multi-level neighborhood

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination