CN113673559A

CN113673559A - Video character space-time feature extraction method based on residual error network

Info

Publication number: CN113673559A
Application number: CN202110793379.6A
Authority: CN
Inventors: 陈志�; 江婧; 岳文静
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-19
Anticipated expiration: 2041-07-14
Also published as: CN113673559B

Abstract

The invention discloses a method for extracting space-time characteristics of video characters based on a residual error network, which solves the problems of high calculation cost and large memory requirement in the process of extracting the space-time characteristics in a video. The invention first decomposes the 3D filter into spatial and temporal forms, then designs three different forms of residual blocks for the decomposed (2D +1D) form convolution kernel based on a residual network, and then designs each residual block at a different location of the entire ResNet structure. And finally, combining the video image with a designed hourglass structure with deep convolution added at the tail end of a residual path to form a new 3D residual network to extract the space-time characteristics of the video character. The invention can enhance the diversity of the network structure, so that the whole network can be universally used for various video analysis tasks and the performance and the time efficiency are improved.

Description

Video character space-time feature extraction method based on residual error network

Technical Field

The invention belongs to the technical field of computer vision domains, and particularly relates to a video character space-time feature extraction method based on a residual error network.

Background

At present, in the field of image recognition, especially a feature extraction part is a big hot spot in the field of computer vision. The quality of feature extraction has a great influence on generalization capability, and the work aims to establish features which provide some image information and are not redundant from a set of initial data, so as to promote subsequent detection or classification tasks.

Most of feature extraction methods are for feature extraction processing of images, and the main methods are as follows: HOG (histogram of oriented gradients), SIFT (scale invariant feature transform), HAAR, etc. The current method for directly extracting features from video is TSN (time slot network), C3D.

The TSN network is composed of a time flow convolution network and a space flow convolution network, the TSN randomly samples a plurality of segments from a given video, then each selected segment makes a preliminary judgment on the type according to the information of the segment, and finally a final video prediction result is obtained according to the segments in a comprehensive mode. TSN is modeled for a long-range time mechanism, using sparse sampling strategies and video-level surveillance to make a given video learning efficiency most efficient and effective.

Another C3D uses 3D convolutional neural network to construct a network structure, which is more suitable for extracting spatio-temporal features than 2D convolutional neural network, 2D convolutional neural network ignores information in time after each operation, while 3D convolutional and pooling operations are more effective to model time information, C3D is the best learner, and the most effective convolution kernel size is 3 x 3.

With the intellectualization of various devices and the rapid growth of multimedia on the internet, videos slowly become a brand new communication mode between users, which is a great test for encouraging the development of the leading-edge technology and the development of the advanced technology. The video is composed of many time series of frames, which is more complex than the picture video, and the shot switching is frequent, which adds difficulty to training a universal powerful classifier for extracting spatio-temporal features, the spatio-temporal information can be extracted from the video by using a common method to train a new 3D convolutional neural network, the time information between the present and continuous frames of each video frame can be accessed, but the training of a 3DCNN network from scratch is computationally expensive, and the size of the model is increased by 2 times compared with that of 2 DCNN. These problems are all problems that are now urgently sought to be solved.

Disclosure of Invention

The technical problem is as follows: the invention aims to solve the technical problem of designing a novel loss function in a crowded scene to improve the robustness of the problem of multi-person posture estimation, and provides a video character space-time feature extraction method based on a residual error network.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

a video character space-time feature extraction method based on a residual error network comprises the following steps:

step 1) inputting a video V, wherein the video V is a multi-person video comprising two or more persons, the video size is c multiplied by n multiplied by h multiplied by w, wherein c is the number of channels, n is the number of frames in a single video, and h and w are the height and width of each frame;

step 2) decomposing the 3D convolution filter with size 3 x 3 into spatial and temporal (2D +1D) forms, i.e. a spatial 2D convolution filter and a temporal 1D convolution filter, using the 1 x 3 convolution filter and the 3 x 1 convolution filter instead of the 3 x 3 convolution filter;

step 3) combining the decoupled spatial 2-dimensional convolution filter and temporal 1-dimensional convolution filter with a residual error network, and designing 3 different 3D residual error blocks: the device comprises a 3D serial residual block, a 3D parallel residual block and a 3D serial-parallel residual block;

and 4) respectively combining the 3 kinds of residual blocks in the step 3) with the hourglass structure, positioning shortcuts to be connected with high-dimensional representation, and obtaining 3 kinds of hourglass residual structures: the sandglass residual error serial structure HRS-I, the sandglass residual error parallel structure HRS-II and the sandglass residual error serial and parallel structure HRS-III;

step 5) respectively fusing the 3 hourglass residual error structures in the step 4) into a residual error network to form three new residual error networks; combining the 3 hourglass residual error structures in the step 4) and then fusing the combined hourglass residual error structures into a residual error network to form another new residual error network; comparing the four obtained residual error networks to obtain a residual error network with the best performance;

step 6) training the residual error network with the best performance obtained in the step 5) on a gpu of 1080ti by using a data set, wherein 70% of the data set is used as a training set, 10% of the data set is used as a verification set, and 20% of the data set is used as a test set;

and 7) carrying out space-time feature extraction on the video V by using the trained new residual error network.

Further, the step 3) specifically comprises the following steps:

step 31) setting a residual function to be F (x)_l)，H(x_l)＝F(x_l)+x_l+1Wherein H (x)_l) Features learned for residual networks, x_l+1Is the output of the l residual unit;

step 32) setting F (x)_l) When equal to 0, H (x)_l)＝x_lThe output x of the ith residual unit can be obtained_l+1＝x_l+F'*x_lWherein F'. x_lRepresents the result of performing a residual function F on x;

step 33) designing a series residual block to connect the one-dimensional convolution filter and the two-dimensional convolution filter in a series mode; let the residual function be T (S (x)_l) Output is represented as x)_l+1＝x_l(1+ T 'S'), where T denotes the use of a one-dimensional filter, S denotes the use of a two-dimensional filter, and T ', S' are the results of performing residual function T, S, respectively;

step 34) designing a parallel residual block, arranging two convolution filters on different paths in parallel, so that the two convolution filters have no direct influence and only indirect influence, and accumulating the two convolution filters into a final output; let the residual function be T (x)_l)+S(x_l) The output is represented as x_l+1＝x_l(1+T'+s’)；

Step 35) designing serial-parallel residual blocks, and simultaneously constructing direct influence between the one-dimensional convolution filter and the two-dimensional convolution filter and final output to realize quick connection of the spatial dimension sum of the serial residual blocks; let the residual function be S (x)_l)+T(S(x_l) Output is represented as x)_l+1＝x_l(1+T's'+s')。

Further, the step 4) specifically includes the following steps:

step 41) in order to ensure that the shortcut is connected with the high-dimensional representation, reversing the sequence of two point-by-point convolutions, wherein the point-by-point convolutions are convolutions which are performed by 1 x 1, and extracting the features on a single point to obtain a feature map;

step 42) provide

In order to input the tensor,

is the output tensor of the residual structure, where D_f×D_fXm is the size of the characteristic map obtained in the input step 41), regardless of the depth convolution layer and the active layer, the hourglass structure is represented as:

wherein

For the point convolution of the channel spread,

convolution for the point with reduced channel;

step 43) adding depth convolution at the tail end of the residual path, and designing point direction convolution in the middle of the depth convolution; the hourglass structure can be expressed as:

wherein

For the 1 st point direction convolution,

convolution for the 1 st depth direction; wherein

For the 2 nd point direction convolution,

is the 2 nd depth direction convolution.

Further, the step 5) specifically comprises the following steps:

step 51) setting three residual blocks which are combined with the hourglass structure and then respectively called as a serial hourglass residual structure HRS-I, a parallel hourglass residual structure HRS-II and a serial hourglass residual structure HRS-III, and respectively replacing all residual units in ResNet-50 with HRS-I, HRS-II and HRS-III to form three new residual networks;

step 52) sequentially forming a new hourglass residual error structure chain by the HRS-I, HRS-II and the HRS-III to replace all residual error units in the ResNet-50 to obtain a new residual error network

And 53) comparing the three new residual error networks formed in the step 51) with the residual error network obtained in the step 52) to obtain the residual error network with the best performance.

Further, in the step 6), the residual error network with the best performance obtained in the step 5) is efficiently trained, and 5 short videos of 5 seconds are randomly selected from each video.

Further, in the step 6), a new residual network is trained, and the droop rate is empirically set to 0.1.

Further, in the step 6), a new residual error network is trained, and the learning rate is initialized to 0.001 according to experience.

Has the advantages that: compared with the prior art, the invention has the following beneficial effects:

according to the method, the 3D filter is decomposed into a spatial form and a temporal form, then the decomposed (2D +1D) form of the residual block is designed, three forms of the residual block are designed, and then the residual block and the designed hourglass structure with the deep convolution added at the tail end of the residual path are combined to form a new 3D residual network for carrying out space-time characteristic extraction.

Drawings

FIG. 1 is a flow chart of a method for extracting spatiotemporal features of video characters based on a residual error network.

Fig. 2 is a graph of the decoupled (2D +1D) version in conjunction with a residual network.

Fig. 3 is a diagram of a combination of a residual block and an hourglass configuration.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

FIG. 1 is a flow chart of a method for extracting spatiotemporal features of video characters based on a residual error network, and firstly, a clipped video is input, the size of the video is c multiplied by n multiplied by h multiplied by w, wherein c is the number of channels, n is the number of frames in a single video, and h and w are the height and width of each frame. The video is acquired in a large data set, Sports-1m, then a 3D convolution filter with size 3 x 3 is decomposed into spatial and temporal (2D +1D) forms, and the convolution filter of 3 x 3 is replaced by a convolution filter of 1 x 3 and a convolution filter of 3 x 1. The decoupled (2D +1D) version is then combined with the residual network, the combined network being shown in fig. 2. 3 different 3D residual blocks were designed: the device comprises a 3D serial residual block, a 3D parallel residual block and a 3D serial-parallel residual block;

and then designing a residual error network with an hourglass structure similar to a classic bottleneck structure, wherein the hourglass residual error structure is different from the bottleneck structure, and adding deep convolution at the tail end of a residual error path. The deconstruction shown in fig. 2 is combined with the hourglass structure and shortcuts are put to the connected high-dimensional representation. The method has the advantages that shortcut connection high-dimensional representation is ensured, the sequence of two point-by-point convolutions is reversed, the point-by-point convolutions are performed by 1 x 1 convolution, feature extraction on a single point is performed, then deep convolutions are added at the tail end of a residual path, point direction convolutions are designed in the middle of the deep convolutions, loss in module feature extraction can be reduced due to the fact that the two deep convolutions are performed in a high-dimensional space, and richer feature representations are extracted.

And then, fusing the hourglass residual structure into a residual network to form a new 3D residual network, training the new network on a Sprots-1M data set, and randomly selecting 5 short videos of 5 seconds from each video. During training, the mini-batch is set to 128 frames/clip and the discharge rate is set to 0.1. The learning rate is also initialized to 0.001, 60K per iteration divided by 10.

And finally, performing space-time feature extraction on the video by using the trained network.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A video character space-time feature extraction method based on a residual error network is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step 3) comprises the following steps:

step 32) setting F (x)_l) When equal to 0, H (x)_l)＝x_lThe output x of the ith residual unit can be obtained_l+1＝x_l+F′*x_lWherein F'. x_lRepresents the result of performing a residual function F on x;

step 34) designing a parallel residual block, arranging two convolution filters on different paths in parallel, so that the two convolution filters have no direct influence and only indirect influence, and accumulating the two convolution filters into a final output; let the residual function be T (x)_l)+S(x_l) The output is expressed asx_l+1＝x_l(1+T′+s′)；

Step 35) designing serial-parallel residual blocks, and simultaneously constructing direct influence between the one-dimensional convolution filter and the two-dimensional convolution filter and final output to realize quick connection of the spatial dimension sum of the serial residual blocks; let the residual function be S (x)_l)+T(S(x_l) Output is represented as x)_l+1＝x_l(1+T′s′+s′)。

3. The method for extracting spatiotemporal features of video people based on residual error network as claimed in claim 1, wherein said step 4) comprises the following steps:

step 42) provide

In order to input the tensor,

wherein

For the point convolution of the channel spread,

convolution for the point with reduced channel;

wherein

For the 1 st point direction convolution,

convolution for the 1 st depth direction; wherein

For the 2 nd point direction convolution,

is the 2 nd depth direction convolution.

4. The method as claimed in claim 1, wherein the step 5) comprises the following steps:

5. The method as claimed in claim 1, wherein the residual network with the best performance obtained in step 5) is efficiently trained in step 6), and 5 short videos of 5 seconds are randomly selected from each video.

6. The method as claimed in claim 1, wherein the step 6) trains a new residual network, and empirically sets a droop rate to 0.1.

7. The method as claimed in claim 1, wherein the step 6) trains a new residual network, and the learning rate is initialized to 0.001 according to experience.