CN115424175A

CN115424175A - Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application

Info

Publication number: CN115424175A
Application number: CN202211053069.1A
Authority: CN
Inventors: 郝艳宾; 谭懿; 汪远; 何向南; 王硕
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-02

Abstract

The invention discloses a video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application thereof, wherein the method comprises the following steps: 1. extracting and preprocessing video data; 2. constructing a hierarchical hourglass convolutional network, comprising: a frame level dynamic information capture network, a fragment level dynamic information capture network and a classification network; 3. and constructing a cross entropy loss function, and training the hierarchical hourglass convolution network to obtain a video action classifier for realizing video action classification. The hourglass convolution can realize better modeling of video dynamics, and meanwhile, the frame-level dynamic information capture network and the segment-level dynamic information capture network based on the hourglass convolution can realize hierarchical modeling of video dynamic information from multiple levels, so that higher-precision character action video identification can be realized.

Description

Video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application

Technical Field

The invention relates to the field of computer vision, in particular to a video motion classification method based on hierarchical dynamic modeling of hourglass convolution and application thereof.

Background

The scale, position and view angle modes of visual clues (such as semantics and objects) in the video evolve along with the change of a time axis, and a discriminant motion mode obtained by aggregating the dynamic change information is important for classifying video contents. To capture these features, the following methods mainly exist at present:

using optical flow information as external information may enhance dynamic modeling of motion in a video. The most representative work is the double-stream network, which represents the motion in the form of optical flow, and respectively sends the static (RGB) and dynamic (optical flow) information into two independent convolution neural networks, and then fuses the classification results of the two streams to obtain the final video classification result. Although the dual-stream network is effective in learning dynamic features, the acquisition of optical flow information and the addition of an additional convolutional neural network branch make the dual-stream network computationally burdensome.

It was subsequently found that dynamic information can be modeled well by time aggregation at the same spatial location at adjacent times using one-dimensional time convolution. In particular, a two-dimensional convolutional neural network has time perception capability by combining one-dimensional time convolution and two-dimensional space convolution in a cascading or parallel manner in the network, so that the paradigm is widely favored in network design for video classification tasks. However, networks designed based on this paradigm have limited time modeling capabilities if the time dimension is not of particular concern. At the same time, the potentially large visual displacement between adjacent time frames makes rigid one-dimensional time convolution less well suited to capture motion patterns. For example, the action of "picking up a table tennis ball and placing the ball on a table" includes the interaction of core objects such as "hands" and "tennis balls". Over time, the spatial semantics of a single frame gradually change from "picking up the ball" to "lifting the ball in the air" and "putting the ball on a table". During this process, the dimensions, positions and patterns of the "hand" and the "tennis ball" are changed. The rigid one-dimensional time convolution only considers the dynamic change of the same spatial position at different times, and does not consider the large change, so that when the target object moves out of the receptive field in the adjacent frames, the core visual clues of the object are easily lost.

Attention strategies, a method of representing motion patterns by similarities between space-time variations, can also effectively model dynamically changing information. But since the pair-wise similarity is computationally inefficient, it has a significant computational burden as the optical flow-based approach.

In summary, the current technical means applied to video classification has many disadvantages and drawbacks, which result in poor video classification effect and low precision.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a video motion classification method based on the hierarchical dynamic modeling of hourglass convolution and application thereof, so that better modeling of video dynamics can be realized by utilizing the hourglass convolution, and video dynamic information is hierarchically modeled from multiple levels by utilizing a frame-level dynamic information capture network and a segment-level dynamic information capture network based on the hourglass convolution, so that the accuracy of character motion video identification can be improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a video motion classification method based on hierarchical dynamic modeling of hourglass convolution, which is characterized by comprising the following steps of:

step 1, video data extraction and pretreatment:

uniformly sampling T frame key frame images from the character motion video V according to the fixed frame number, and marking as F = [ F ] ₁ ,F ₂ ,…,F _t ,…,F _T ]，F _t Representing the T-th key frame, and T representing the number of key frames;

sampling the tth key frame F _t Two consecutive frames before and after the character motion video V, and F _t Two consecutive frames before and after it are denoted as the t-th slice

Is represented by F _t The first two frames of the frame (a),

is represented by F _t The frame of the previous frame of the frame,

is represented by F _t The next frame of the frame (a) to (b),

is represented by F _t The second two frames of (1);

the t-th fragment C _t After each frame of resolution is zoomed, an image block with the resolution of H multiplied by W is taken out from each frame and then normalized and preprocessed to obtain the t-th input video data tensor

Thereby obtaining an input video data tensor C '= [ C ] of the human motion video V' ₁ ,C' ₂ ,…,C' _t ,…,C' _T ]Wherein H and W respectively represent C' _t D represents C' _t The number of channels of (a);

step 2, constructing a hierarchical hourglass convolution network, comprising the following steps: a frame-level dynamic information capture network, a fragment-level dynamic information capture network and a classification network;

step 2.1, constructing hourglass convolution:

the hourglass convolution is composed of a group of spatial convolution with kernel size of (p · | i | + 1) and a time convolution with kernel size of K, wherein p is a parameter, and i is time offset;

the hourglass convolution has a dimension of

Is processed to obtain an output characteristic HgC (X), T 'represents the time dimension, H' represents the height, W 'represents the width, D' represents the number of channels, wherein the T-th characteristic HgC (X) of the output characteristic HgC (X) _t Is obtained by using a formula (1):

in the formula (1), X _t+i For the T + i-th input feature of tensor X in the time dimension of T _i Is the i-th parameter of the time convolution layer, f is the spatial convolution function, W _{p·|i|+1,p·|i|+1} Is a parameter of the spatial convolution layer; t E [0,T' -1]；

Step 2.2, the frame-level dynamic information capturing network is composed of a first volume block of a ResNet50 network and a frame-level dynamic information capturing module:

the first convolution block of the ResNet50 network is a spatial convolution with a convolution kernel of a x a;

the frame-level dynamic information capturing module consists of a down-sampling layer, an hourglass convolution layer, a space convolution layer and an up-sampling layer:

the down-sampling layer is a spatial average pooling layer with the kernel size of b multiplied by b; the hourglass convolution layer consists of two serially connected hourglass convolutions; the space convolution layer is a space convolution with a convolution kernel of a multiplied by a; the up-sampling layer is used for up-sampling operation of copying one pixel into four adjacent pixels;

the key frame image F = [ F ] of the character motion video V ₁ ,F ₂ ,…,F _t ,…,F _T ]Inputting the data into a first volume block of a ResNet50 network for processing, and obtaining an output characteristic F ^S ；

Video V of character movementInput video data tensor C '= [ C' ₁ ,C' ₂ ,…,C' _t ,…,C' _T ]Inputting the data into a frame-level dynamic information capturing module, and sequentially processing a down-sampling layer, a hourglass convolution layer, a space convolution layer and an up-sampling layer to obtain an output characteristic F ^fm ；

F is to be ^S And F ^fm Adding to obtain the output M of the frame-level dynamic information capturing network ^fm ；

2.3, the fragment-level dynamic information capture network consists of four convolution blocks which are connected in series, each convolution block consists of repeated units which are connected in series, and the number of the repeated units contained in each convolution block is different;

the repeating unit consists of a residual block and a fragment-level dynamic information capturing module; the residual block comprises convolution layers with convolution kernels of 1 × 1 and convolution layers with convolution kernels of 3 × 3; the segment-level dynamic information capture module comprises two convolution layers of 1 multiplied by 1, an hourglass convolution, a global average pooling layer and a Sigmoid activation function layer;

will M ^fm Inputting a first 1 x 1 convolutional layer of a first repeating unit in a first convolutional block of a segment-level dynamic information capture network to obtain a characteristic Y, inputting Y into a segment-level dynamic information capture module, sequentially processing the first 1 x 1 convolutional layer, an hourglass convolutional layer, a global average pooling layer, a second 1 x 1 convolutional layer and a Sigmoid activation function layer to obtain a characteristic A, multiplying A and Y, inputting a residual block of the first repeating unit in the first convolutional block, sequentially processing the 3 x 3 convolutional layer and the second 1 x 1 convolutional layer to obtain an output Z' of the first repeating unit of the first convolutional block;

z' is input into a second repeating unit in the first volume block again, and the result obtained after the same processing is input into the next repeating unit again, so that the results obtained after the processing of all the repeating units in the first volume block are input into the next volume block for processing, and finally the output Z of the hierarchical hourglass convolution network is obtained by the last repeating unit of the fourth volume block;

step 3, the classification network is formed by connecting a global average pooling layer and a full-connection layer in series; inputting Z into the classification network for processing to obtain a final action category;

and 4, constructing a cross entropy loss function as a loss function L of the hierarchical hourglass convolutional network, training the hierarchical hourglass convolutional network by using an SGD optimizer, and calculating the loss function L to adjust network parameters, so that the trained hierarchical hourglass convolutional network is finally obtained and used as a video motion classifier for realizing video motion classification.

The electronic device comprises a memory and a processor, and is characterized in that the memory is used for storing a program for supporting the processor to execute the video action classification method, and the processor is configured to execute the program stored in the memory.

The invention relates to a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the video motion classification method.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a novel time Convolution, namely Hourglass Convolution (HgC), and a video motion recognition network is constructed based on the Hourglass Convolution layering, so that target loss caused by visual displacement between different moments of a video can be effectively solved, and the accuracy of character motion video recognition is improved.

2. The hourglass convolution proposed by the present invention has an hourglass-like reception field, specifically: the space receptive field is amplified in the front time point and the back time point, so that large visual displacement can be captured, and the space-time dynamic information modeling capacity of hourglass convolution is improved; finally, the recognition accuracy of the character action video is improved.

3. The invention relates to a character video motion classification Network (H) constructed based on Hourglass convolution layering ² CN) simultaneously from between adjacent frames and between adjacent fragmentsThe two levels mine the space-time dynamic information, provide abundant space-time dynamic information for the network, and further improve the identification precision of the character action video of the network.

Drawings

FIG. 1 is a flow chart of a video classification method according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of an hourglass convolution in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a video motion classification network based on a hierarchical dynamic modeling of hourglass convolution according to an embodiment of the present invention;

FIG. 4 is a diagram of a frame-level motion information capture network in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a segment-level dynamic information capture network according to an embodiment of the present invention.

Detailed Description

In this embodiment, as shown in fig. 1, a video motion classification method based on a hierarchical dynamic modeling of hourglass convolution is performed according to the following steps:

step 1, video data extraction and preprocessing:

uniformly sampling T frame key frame images from the character motion video V according to the fixed frame number, and marking as F = [ F ] ₁ ,F ₂ ,…,F _t ,…,F _T ]，F _t The method comprises the steps of representing a T-th key frame, wherein T represents the number of key frames, and the number of the key frames can be generally 8, 16, 32 and the like;

Is represented by F _t The first two frames of (a) are,

is shown as F _t The frame of the previous frame of the frame,

is represented by F _t The next frame of the frame (a) to (b),

is represented by F _t The second two frames of (2);

Thereby obtaining an input video data tensor C ' = [ C ' of the human motion video V ' ₁ ,C' ₂ ,…,C' _t ,…,C' _T ]Wherein H and W respectively represent C' _t The height and width of (A) can be 224, D represents C 'when the recognition accuracy and the calculation efficiency are balanced' _t The number of channels of (a) is 3 in widely used RGB images;

step 2, as shown in fig. 3, constructing a hierarchical hourglass convolution network, comprising: a frame level dynamic information capture network, a fragment level dynamic information capture network and a classification network;

step 2.1, constructing hourglass convolution:

the hourglass convolution is made up of a set of spatial convolutions of kernel size (p · | i | + 1) and a temporal convolution of kernel size K, where i is the time offset and p is the slope of the field of view increasing with time offset. For example, when K =3,p =2 is set, the spatial convolution kernel sizes corresponding to the { t-1,t, t +1} th frame are { (3,3), (1,1), (3,3) }.

The hourglass convolution has for any dimension of

The tensor X of (2) is processed, wherein T 'represents the time dimension, H' represents the height, W 'represents the width, D' represents the channel number, and the process of obtaining the output characteristic HgC (X) is as follows: the aggregation of the spatial dimension information of the frames at the corresponding time offsets is first achieved using the spatial convolution kernel described above. The time information is then aggregated along the time axis using the time convolution described above. T-th feature HgC of output feature HgC (X)(X) _t The obtaining process is as shown in formula (1):

in the formula (1), X _t+i For the T + i-th input feature of tensor X in the time dimension of T _i Is the i-th parameter of the time convolution layer, f is the spatial convolution function, W _{p·|i|+1,p·|i|+1} Is a parameter of the spatial convolution layer; t epsilon [0,T' -1](ii) a Compared with the traditional method for identifying video motion, the hourglass convolution additionally utilizes the spatial convolution to firstly aggregate spatial information of frames at different time offsets, so that the hourglass convolution has an hourglass-shaped receptive field (as shown in fig. 2), thereby helping the hourglass convolution to aggregate spatial-temporal information which is difficult to aggregate due to visual offset in other time offsets. Therefore, compared with the time convolution widely used in the traditional method, the hourglass convolution can capture the space-time information which cannot be captured by the time convolution, and better fits the space-time dynamic characteristics of the video data.

the first convolution block of the ResNet50 network is a space convolution with a convolution kernel of a x a, and the value of a is generally 7;

the down-sampling layer is a spatial average pooling layer with the kernel size of b multiplied by b, and in order to give consideration to both the identification precision and the calculation efficiency, the classical value of b in the invention is 2; the hourglass convolution layer consists of two serially connected hourglass convolutions; the space convolution layer is a space convolution with a convolution kernel of a multiplied by a; an upsampling layer for an upsampling operation that replicates one pixel into four adjacent pixels;

the key frame image F = [ F ] of the character motion video V ₁ ,F ₂ ,…,F _t ,…,F _T ]Input into the first volume block of the ResNet50 networkLine processing, and obtaining an output characteristic F ^S ；

The process of obtaining frame level dynamic information is shown in fig. 4: tensor C ' = [ C ' of input video data of human motion video V ' ₁ ,C' ₂ ,…,C' _t ,…,C' _T ]Inputting the data into a frame-level dynamic information capturing module, and sequentially processing the data by a down-sampling layer, a hourglass convolution layer, a space convolution layer and an up-sampling layer to obtain an output characteristic F ^fm (ii) a The invention firstly utilizes the down-sampling layer to reduce the resolution of the input video data before the calculation of the hourglass convolution layer, thereby reducing the calculation consumption, and utilizes the up-sampling layer to recover the resolution of the input video data after the calculation of the hourglass convolution layer, thereby not influencing the subsequent calculation.

Then F is put ^S And F ^fm Adding to obtain the output M of the frame-level dynamic information capturing network ^fm (ii) a In conventional methods only F is obtained at this stage ^S Compared with the method provided by the invention, the method is characterized by lacking frame-level dynamic information, so that the method has stronger identification precision

the repeating unit consists of a residual block and a fragment-level dynamic information capturing module; the residual block comprises convolution layers with two convolution kernels of 1 multiplied by 1 and convolution layers with one convolution kernel of 3 multiplied by 3; the segment-level dynamic information capture module comprises two convolution layers of 1 multiplied by 1, an hourglass convolution, a global average pooling layer and a Sigmoid activation function layer;

the process of obtaining fragment-level dynamic information is shown in fig. 5: will M ^fm Inputting the characteristic Y into a segment-level dynamic information capture module after inputting the characteristic Y into a first 1 x 1 convolutional layer of a first repeating unit in a first convolutional block of a segment-level dynamic information capture network, and sequentially passing the first 1 x 1 convolutional layer, an hourglass convolutional layer, a global average pooling layer, a second 1 x 1 convolutional layer and the second 1 x 1 convolutional layerObtaining a characteristic A after the Sigmoid activation function layer is processed, multiplying the A and the Y, inputting the multiplied A and Y into a residual block of a first repeating unit in a first volume block, and obtaining an output Z' of the first repeating unit of the first volume block after the processing of a 3 x 3 volume layer and a second 1 x 1 volume layer in sequence; in the process, the hourglass convolution layer which needs to expend extra calculation amount is placed between two 1 × 1 × 1 convolution layers, the channel dimension reduction is carried out by using the first 1 × 1 × 1 convolution layer, the consumption of calculation resources is reduced, and then the second 1 × 1 × 1 convolution layer is restored to the channel dimension. The traditional network uses time convolution to carry out segment-level dynamic information modeling, and the invention can capture space-time information which cannot be captured by the time convolution by utilizing hourglass convolution. Meanwhile, by modeling the dynamic information at the frame level and the fragment level, the invention models the spatio-temporal dynamic information in the video data in a layering manner, so that compared with the traditional method, the method provided by the invention has higher identification precision.

step 3, the classification network is formed by connecting a global average pooling layer and a full-connection layer in series; inputting Z into a classification network for processing to obtain a final action type;

and 4, constructing a cross entropy loss function as a loss function L of the hierarchical hourglass convolutional network, training the hierarchical hourglass convolutional network by using an SGD optimizer, calculating the loss function L at the same time to adjust network parameters, and finally obtaining the trained hierarchical hourglass convolutional network as a video action classifier for realizing video action classification.

In this embodiment, an electronic device includes a memory for storing a program that enables the processor to execute the video action classification method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program, and the computer program is executed by a processor to perform the steps of the video motion classification method.

To demonstrate the effectiveness of the present invention, the following experiments were performed for verification.

1) The hourglass convolution was inserted into the ResNet network, named HgC-ResNet, and compared on Something-SomethingV1 with TSN without time convolution, R with normal time convolution (2+1) D, with the results shown in Table 1.

TABLE 1 comparison of Performance of the hourglass convolution with R (2+1) D, TSN

Method	Top-1	#P	FLOPS
				TSN	19.7	23.9M	32.9G
R(2+1)D	46.0	23.9M	32.9G
				HgC-ResNet	47.0	23.9M	33.1G

As can be observed from Table 1, both time convolution (R (2+1) D) and hourglass convolution (HgC-ResNet) significantly improved the performance of two-dimensional convolutional neural network backbone (TSN). While HgC-ResNet exceeded R (2+1) D by a significant magnitude (1%), the computational cost was nearly the same, a comparison that mainly shows the good ability of hourglass convolution in video motion modeling.

2) In Something-SomethingV1&V2, comparing the video motion classification method (H) based on the hierarchical dynamic modeling of hourglass convolution provided by the invention ² CN) and other most advanced motion recognition models, the results are shown in table 2.

TABLE 2H ² CN and other models in SomethingV1&Comparison of Performance at V2

Method	BackBone	#Pretrain	Something V1	Something V2
					GST	ResNet-50	ImageNet	47.0	61.6
TSM+TPN	ResNet-50	ImageNet	49.0	62.0
					TEINeT	ResNet-50	ImageNet	47.4	61.3
TAM	ResNet-50	ImageNet	46.5	60.5
					STM	ResNet-50	ImageNet	49.2	62.3
TDN	ResNet-50	ImageNet	52.3	64.0
					SELFYNeT	ResNet-50	ImageNet	52.5	64.5
SmallBig	ResNet-50	ImageNet	48.3	61.6
					TimeSformer-HR	Transformer	Kinetics	--	62.5
ECO	ResNet-18	Kinetics	39.6	--
					I3D	3DResNet-50	ImageNet	41.6	--
H ² CN	ResNet-50	ImageNet	53.6	65.2

As shown in Table 2, add H ² CN is compared to convolutional neural network based architectures, including classical methods such as I3D, GST, TSM, and more recent methods such as TDN and SELFYNet. H ² CN in Something V1&Top-1 accuracies of 53.6% and 65.2% were achieved on V2, respectively. H2CN and method based on convolution neural networkSignificant advantages over them. These results demonstrate H ² The ability of the CN to capture a variety of dynamic information. With more sophisticated Transformer-based methods such as TimeSformer-HR]In contrast, H ² The performance of CN remains competitive.

3) On the diveng 48, the motion recognition accuracy of the present invention was compared with other most advanced motion recognition models, and the results are shown in table 3.

TABLE 3H ² Comparison of CN Performance with other most advanced models on Diving48

From Table 3, it can be seen that H compares to the convolutional neural network baseline ² CN achieved the best performance of 87.0%. More importantly, H ² The performance of CN was better than that of VIMPAC (85.5%) which is the best method based on Transformer.

Claims

1. A video motion classification method based on hierarchical dynamic modeling of hourglass convolution is characterized by comprising the following steps:

step 1, video data extraction and preprocessing:

sampling the tth key frame F _t Two consecutive frames before and after each in the character motion video V, and F _t Two consecutive frames before and after it are denoted as the t-th slice

Is represented by F _t The first two frames of the frame (a),

is represented by F _t The frame of the previous frame of the frame,

is represented by F _t The frame following the frame of the mobile communication terminal,

is represented by F _t The second two frames of (1);

step 2, constructing a hierarchical hourglass convolution network, comprising the following steps: a frame level dynamic information capture network, a fragment level dynamic information capture network and a classification network;

step 2.1, constructing hourglass convolution:

the hourglass convolution has a dimension of

Is processed to obtain output characteristics HgC (X), T 'represents the time dimension, H' represents the height, W 'represents the width, D' represents the channel number, whereinOutputting the tth characteristic HgC (X) of the characteristic HgC (X) _t Is obtained by using a formula (1):

in the formula (1), X _t+i For the T + i-th input feature of tensor X in the time dimension of T _i Is the i-th parameter of the time convolution layer, f is the spatial convolution function, W _{p·|i|+1,p·|i|+1} Is a parameter of the spatial convolution layer; t epsilon [0,T' -1]；

Step 2.2, the frame-level dynamic information capturing network is composed of a first volume block and a frame-level dynamic information capturing module of a ResNet50 network:

key frame images F = [ F ] of the person action video V ₁ ,F ₂ ,…,F _t ,…,F _T ]Inputting the data into a first volume block of a ResNet50 network for processing, and obtaining an output characteristic F ^S ；

Tensor C ' = [ C ' of input video data of human motion video V ' ₁ ,C' ₂ ,…,C' _t ,…,C' _T ]Inputting the data into a frame-level dynamic information capturing module, and sequentially processing the data by a down-sampling layer, a hourglass convolution layer, a space convolution layer and an up-sampling layer to obtain an output characteristic F ^fm ；

2.3, the fragment-level dynamic information capturing network consists of four convolution blocks which are connected in series, each convolution block consists of repeated units which are connected in series, and the number of the repeated units contained in each convolution block is different;

2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that enables the processor to perform the video action classification method of claim 1, and the processor is configured to execute the program stored in the memory.

3. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the video action classification method according to claim 1.