CN112653899B

CN112653899B - Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene

Info

Publication number: CN112653899B
Application number: CN202011509545.7A
Authority: CN
Inventors: 张菁; 康俊鹏; 张广朋; 卓力
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-07-12
Anticipated expiration: 2040-12-18
Also published as: CN112653899A

Abstract

The invention relates to a method for extracting network live broadcast video characteristics in a complex scene based on joint attention ResNeSt. Firstly, key frame extraction is carried out on a network live video to obtain key frame data of the video. In order to utilize the multi-scale features of the video frame, a parallel path is designed according to the multi-scale structure of the feature pyramid network. The parallel path is constructed from bottom to top, and exchanges information with the original main path by utilizing transverse connection and oblique connection, wherein the transverse connection and the oblique connection are convolution operation. In consideration of the fact that most of the presentation forms of the live webcast pictures are artificial subjects and simultaneously contain a large amount of redundant information, space-channel joint attention is introduced, and the picture subject characteristics are convenient to focus. And finally, combining the parallel feature pyramid fused with the joint attention with a convolutional layer and a pooling layer to construct a ResNeSt feature extraction module, and realizing feature extraction of the network live broadcast video in a complex scene through superposition of multiple layers of modules.

Description

Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene

Technical Field

The method takes the live webcast video in a complex scene as a research object, and performs live webcast video feature extraction through the combination of attention and ResNeSt network, thereby forming high-efficiency feature expression of the live webcast video. Firstly, performing feature convolution on a video key frame by using a parallel feature pyramid; in the convolution process of the characteristic pyramid, low-level visual information and high-level semantic information of the video are obtained by introducing a joint attention mechanism; and finally, combining a Split Attention Residual error network (ResNeSt) to form efficient feature expression of the network live broadcast video.

Background

With the advent of the network from the media age, more and more people begin to share their lives on the network in the form of live videos, which also increase in number at a geometric level. The network live broadcast has strong powder suction capacity and user viscosity, brings great convenience for people to acquire information, and brings great economic benefits to operators. However, the huge amount of webcast video also poses serious challenges to network information security and supervision. The network live broadcast is low in threshold, complex in practitioner diversity, various in shot live broadcast scenes, complex in background and containing a large number of people, objects and identifications. Therefore, how to quickly and efficiently extract and express the characteristics of the live webcast video in a complex scene is a fundamental premise for classifying and monitoring the live webcast video.

Generally, video feature extraction methods are mainly classified into two types: firstly, the low-level visual features of the video key frames are directly utilized, and the low-level visual features comprise static features such as color features, texture features and shape features, and dynamic features such as lens movement and object movement. Due to the diversity of video contents, simple low-level visual features are difficult to be applied to describing all videos, and a more robust visual feature expression is formed; the other method is to extract and dig out the high-level semantic features of the video layer by layer from the low-level features through means of deep learning and the like, namely, the original data space dimension of the video is reduced through convolution operation and the like, and the proper semantic expression features are selected. The existing research results show that the deep learning technology achieves excellent performance in the aspect of video feature expression, which is far superior to that of the traditional method, because the semantic property and the abstract property of the features extracted by the deep learning method are higher, and the deep learning method is particularly suitable for learning the nonlinear factors in the video scene.

In recent years, many deep learning networks have emerged, of which Convolutional Neural Network (CNN) is the most popular method, and there are google lenet, VGG-19, inclusion, and the like as models using CNN as a network structure. One milestone event in the CNN history is the occurrence of the ResNet model, which can train deeper CNN models, thus achieving higher accuracy. A great deal of research shows that the performance far exceeding that of the traditional method can be obtained by utilizing ResNet to extract deeper features of the video. Based on ResNet, some researchers combine network components of ResNet and inclusion to provide a Deep transformation Residual network (ResNeXt), and extract video features with stronger expression capability by adding network branches. However, under the conditions of single video scene, less number of people and clear object edge outline, the existing deep learning network can be adopted to obtain better performance, but in the complex live video scene under the limited conditions of more scene types, indefinite number of people, lighting conditions and the like, the deep network is directly applied, so that the space-time context information is not easy to be effectively learned, and the improvement of the accuracy rate is influenced. Recently, a novel Split Attention Residual error network (resolution Networks with Split-Attention, ResNeXt) is proposed, which combines a Split Attention mechanism (Split Attention) capable of enhancing convolution characteristics in multiple branches and multiple scales, so that the performance of the network exceeds that of the former-made ResNet and ResNeXt. What supplements is to introduce joint attention.

Therefore, the invention provides a method for extracting the characteristics of a network live broadcast video in a complex scene based on joint attention ResNeSt. Firstly, performing feature convolution on a video key frame by utilizing a constructed parallel feature pyramid, and acquiring low-layer visual information and high-layer semantic information of a video by introducing a joint attention mechanism; in order to facilitate the construction of modular components and the multi-scale feature fusion, based on the concept of split attention supervision of ResNeSt, the feature pyramid is put into ResNeSt, and the extraction and expression of the network live video features are finally realized through the convolution operation and the feature fusion of ResNeSt.

Disclosure of Invention

Different from the existing video feature extraction method, the pyramid feature extraction model with parallel paths under the drive of an attention mechanism is designed for the live webcast video scene. In order to strengthen the multi-scale characteristic of the parallel characteristic pyramid, a new method for extracting the network live broadcast video characteristic in a complex scene is provided by combining the ResNeSt network. Firstly, key frame extraction is carried out on a network live video to obtain key frame data of the video. In order to utilize the multi-scale features of a video frame, a parallel path is designed according to the multi-scale structure of a Feature Pyramid Network (FPN). The parallel path is constructed from bottom to top and exchanges information with the original main path by utilizing transverse connection and oblique connection, wherein the transverse connection and the oblique connection are convolution operation. In consideration of the fact that most of the presentation forms of the live webcast pictures are artificial subjects and simultaneously contain a large amount of redundant information, space-channel joint attention is introduced, and the picture subject characteristics are conveniently focused. And finally, combining the parallel feature pyramid fused with the joint attention with a convolutional layer and a pooling layer to construct a ResNeSt feature extraction module, and realizing feature extraction of the network live video in a complex scene through superposition of multiple layers of modules.

The main process of the method is shown as the attached figure 1, and the main process can be divided into the following steps: constructing a parallel structure of a characteristic pyramid, and performing horizontal and oblique connection on down-sampling results of different scales to perform down-sampling to fuse multi-scale characteristics of a video frame, so that the characteristic extraction effect is improved; simultaneously, a joint attention mechanism (including space attention and channel attention) is introduced into each convolution module of the characteristic pyramid, and weight distribution of key information in the video is optimized; and finally, connecting the end of the parallel feature pyramid with the joint attention mechanism with a pooling layer, and completing the construction of the ResNeSt feature extraction module. And finally, through multilayer network stacking of ResNeSt, the extraction of video frame features is realized.

1. Parallel feature pyramid construction of live webcast video

The invention constructs a multi-scale parallel characteristic pyramid structure aiming at the characteristics of various network live video scenes, high content complexity and the like. Through parallel connection (including transverse connection of 1 multiplied by 1 convolution and oblique connection changing along with the number of layers) in the feature pyramid, feature graphs of different scales are fused, so that the feature pyramid can complement deep semantic information by utilizing shallow space positioning information while shortening paths of shallow information and deep information, and effective extraction of video frame features is enhanced.

As shown in fig. 2, the paths C and D correspond to feature pyramids with gradually decreasing video frame sizesA backbone structure and a parallel structure. Although the semantic information of the video frame is gradually enriched as the network hierarchy is deepened, the corresponding low-level visual information is gradually lost. And the parallel connection erected between the trunk structure and the parallel structure can fuse the spatial information of different scales, thereby enriching the feature extraction capability of the feature pyramid. The parallel connection can be divided into a transverse connection and an oblique connection, wherein the transverse connection uses convolution of 1 multiplied by 1; the diagonal connection uses a convolution of 3 x 3 to downsample the feature map, and the step size can be dynamically adjusted to 2(m-n) at different hierarchical depths, where n and m are the trunk structure C_nAnd parallel structure D_mSubscript of the number of levels.

2. Joint attention-driven feature pyramid design

Attention mechanism (Attention) is a signal processing mechanism built to mimic the human visual characteristics. The human can quickly scan the panorama and focus on certain special areas through the eyes, thereby obtaining more details of the target. The attention mechanism can also focus on a target area according to the visual characteristics of human beings, and more computing resources are invested near the focus, and meanwhile, redundant information of other areas is suppressed. In the invention, considering that a network live broadcast scene is relatively complex, in order to utilize low-level visual features to the maximum extent and highlight high-level semantic features, a space-channel joint attention mechanism is introduced into a parallel feature pyramid. The spatial attention mechanism is to simply give different spatial attention weights to feature maps of different levels, so as to enhance feature extraction of a core object in a complex scene. Specifically, in the trunk structure and the parallel structure of the parallel feature pyramid, different weights are respectively given to different layers. The shallow layer features are mostly low-layer visual information, and the importance of the shallow layer features is lower than that of the deep layer semantic features, so that the deep layer semantic features are considered in the weight distribution process to weaken the shallow layer features.

Based on the above analysis, the present invention obtains attention level weights corresponding to each feature map by constructing a weight mapping function. First apply average pooling and max pooling operations along the channel axis and concatenate them to generate a valid feature descriptorBy this operation, the region of the video frame information that needs to be focused is found, and finally, the spatial attention can be obtained by using the standard convolution. The input feature graph F is respectively sent to a Multi-Layer Perception Machine (MLP) according to the result of maximum Pooling (Global Max Pooling) and Average Pooling (Global Average Pooling) of each channel in sequence, the output results are directly added up, and then the feature graph M of the spatial attention module is obtained through a ReLU activation function_s(F) The whole process is as follows:

wherein M is_sRepresenting spatial attention, σ represents a ReLU nonlinear function, f^3×3Representing the use of a 3 x 3 convolution kernel in the convolution process. AvgPool represents the pooling operation of the weighted average of the previous layer, and MaxPool represents the pooling operation to obtain the maximum of the previous layer. F^s _avgI.e. the result after the average pooling, F^s _maxI.e. the result after maximum pooling. After obtaining the spatial attention weight, since each video frame includes a plurality of channels, and each channel generates a new signal after passing through a different convolution kernel, a large amount of redundant information is generated, and further consideration needs to be given to the attention of the channels, as shown below:

wherein M is_c(F) Representing channel attention, σ represents the ReLU nonlinear function, MLP is a multi-layered perceptron matrix operation, W₀And W₁Is the weight of the multi-layer perceptron. AvgPool represents the pooling operation of the previous layer of weighted averaging, and MaxPool represents the pooling operation that achieves the maximum of the previous layer. F^c _avgI.e. the result after the average pooling, F^c _maxI.e. the result after maximum pooling

The spatial attention and the channel attention weight obtained by calculation are retained to the next layer, and are continuously transmitted and overlapped through the feature extraction module until the end, and the module schematic diagram is shown in fig. 3.

In the invention, the joint attention formed by the spatial attention and the channel attention is accumulated and used in the parallel feature pyramid for the first time, and the weight coefficient of each channel is learned in the whole process, so that the model has better discrimination capability on the spatial feature in each channel. Compared with the traditional method that the attention mechanism is only used after the last pooling layer of the network, the combined attention mechanism used layer by layer can save more calculation sources and is more beneficial to focusing the network on a target area worth paying attention in a video image frame.

3. Live video scene feature expression based on joint attention and ResNeSt network

The split Attention residual error network ResNeSt is a novel neural network architecture proposed by the Amazon team in 2020, and is different from other network architectures of the same series of ResNet, the biggest highlight of ResNeSt lies in a multi-branch split Attention module (Split Attention), and the module divides different channels of an input video frame into R groups by introducing a hyper-parameter Radix, and implements different feature transformations (Conv + Bn + ReLU and the like) on the R groups, so that an Attention weight redistribution mechanism is further enhanced. The ResNet series network structure starts from basic residual error learning; gradually develops into a ResNeXt network which integrates the Incepration branch structure. The ResNeSt used in the invention adds a Split-Attention (Split-Attention) module on the basis of the branch structure of ResNeXt, and the mathematical expression is as follows:

s＝F_ex(z,W)＝sigmoid(W₃σ(W₄z)) (4)

where z represents the feature compression along the spatial dimension, F_sqRepresents a compression function, u_cRepresenting the C-th convolution kernel, i, j are the corresponding elements in the matrix, H, W and C are the height, width and number of channels of the image. s represents a deep profile, F_exRepresents a feature extraction function, σ represents a ReLU nonlinear function, W₃、W₄The representative dimension reduction coefficient is two hyper-parameters which can be manually set, and the compressed global description feature z restores the original dimension through bottleneck structures of two fully-connected layers of ReLU and sigmoid to obtain the output of (1 × 1 × C).

The present invention makes relevant improvements based on the ResNeSt network architecture and the split attention supervision mechanism therein. Through standard convolution and superposition operations with different step lengths, the joint attention weights with different scales are accumulated, and cross-feature diagram connection of the attention mechanism is realized. In addition, the spatial position information of the shallow layer of the main structure can also be utilized in the semantic feature map of the deeper layer in the parallel structure, which is beneficial to the proposition and expression of effective features. The parallel feature pyramid integrated with the joint attention module is put into ResNeSt, and the ResNeSt feature extraction module is constructed and corresponding residual errors are learned by combining the convolution layer and the pooling layer. The image feature extraction of the live webcast video is realized through continuous down-sampling and final feature fusion and convolution calculation, and a module schematic diagram of the method is shown in fig. 4.

Description of the drawings:

FIG. 1 is a flow chart of a joint attention feature pyramid video feature extraction;

FIG. 2 parallel feature pyramid structure

FIG. 3 is a schematic view of a spatial channel joint attention module

FIG. 4 pyramid module embedded in ResNeSt

Detailed Description

In light of the above description, a specific implementation flow is as follows, but the scope of protection of this patent is not limited to this implementation flow. The following is a specific workflow of the invention:

the video data used by the invention is from a plurality of network video platforms, and key frames of various downloaded live videos are extracted. In the experimental process, key frames are taken at 5fps, only a continuous 16-frame fragment is taken to represent a video, and the video frame data of 224 x 224 pixels is obtained through Resize preprocessing. The video frame data is placed into a characteristic pyramid for down-sampling to obtain characteristic graphs with different scales; then, obtaining attention weight distribution of the multi-scale features through calculation of a combined attention mechanism; and finally, combining convolution and pooling operations to establish a ResNeSt module, and constructing a ResNeSt50 feature extraction network through superposition of 50 ResNeSt modules, thereby realizing feature extraction of the live webcast video. 1. Parallel feature pyramid construction of live webcast video

And constructing a parallel multi-scale characteristic pyramid structure based on the multi-scale structure of the characteristic pyramid. After the bottom layer image is input into the main structure of the characteristic pyramid, the main channel is convoluted and downsampled from bottom to top, and convolution kernels with different sizes and step lengths can be used according to different requirements. The invention constructs a multi-scale parallel characteristic pyramid structure, and carries out down-sampling in a trunk channel by adopting convolution with 3 multiplied by 3 step length of 2 to obtain a characteristic diagram with gradually reduced scale. Meanwhile, a parallel structure similar to the trunk structure is obtained by using parallel connection, wherein the parallel connection comprises 1 × 1 transverse convolution and dynamic oblique connection capable of automatically adjusting step length according to different levels. Taking 4 convolutional layers as an example, after a video frame with 224 × 224 pixels is input, the diagonal connection step length between the first layer image of the main structure and other layers of the parallel structure can be obtained by parallel connection and calculation according to the above formula, as shown in table 1.

TABLE 1 step sizes corresponding to different layers

Hierarchy level	Step size
		C₁-D₂	2
C₁-D₃	4
		C₁-D₄	6

After the corresponding convolution step length in table 1 is obtained, the convolution feature map between the trunk structure and the parallel structure can be calculated by the transverse convolution of 1 × 1 and the oblique convolution of 3 × 3.

2. Joint attention-driven feature pyramid design

After the parallel feature pyramid is constructed, a joint attention mechanism may be introduced to conserve resources, focusing on the regions of interest in the feature map. First, for spatial attention, spatial attention is used to find regions of interest in an image in both the backbone structure and the parallel structure of the present invention. For the first layer in this example, a 224 × 224 pixel image is convolved with a 3 × 3 step size of 2 to obtain a 112 × 112 image. Applying average pooling and maximum pooling operations layer by layer from bottom to top along the channel axis and connecting them to generate an effective feature descriptor, as shown in formula (1), combining channel information by this operation, inputting a feature map F according to the results of maximum pooling and average pooling of each channel, sequentially and respectively feeding into a multi-layer perceptron for addition, and then obtaining spatial attention information M of the layer by means of a ReLU activation function_s。

The channel attention is then recalculated. The number of channels r per layer in this example is set to 3, 16, 64, 128, respectively, according to the relevant empirical reference. And after the input first-layer characteristic diagram F obtains 3 channels, adding the output results of the multilayer perceptron according to the calculation of a formula (2) in sequence by taking the maximum pooling and average pooling result of each channel. Finally, obtaining a characteristic diagram M of a corresponding channel attention module through a ReLU activation function_c。

And (4) reserving the space attention and the channel attention weight obtained above to the last layer of the ResNeSt module, and finally obtaining a characteristic diagram integrating the multi-scale space and the attention of each channel through modular stacking of the ResNeSt network. In the invention, the joint attention formed by the spatial attention and the channel attention is accumulated and used in the parallel feature pyramid for the first time, and the weight coefficient of each channel is learned in the whole process, so that the model has better discrimination capability on the spatial feature in each channel. Compared with the traditional method that the attention mechanism is only used after the last pooling layer of the network, the combined attention mechanism used layer by layer can save more calculation sources and is more beneficial to focusing the network on a target area worth paying attention in a video image frame.

In the process of embedding the ResNeSt network, according to the similar structure of the split attention module in the ResNeSt network, the joint attention feature pyramid is embedded into the ResNeSt network. By using the branch structure of ResNeSt, setting the super parameter Radix to take 2, namely dividing ResNeSt into two branches to match the main structure and the parallel structure of the characteristic pyramid. The spatial attention of the trunk structure is extracted through layering, and each layer of the parallel structure is formed by firstly superposing oblique convolution of the trunk structure and then extracting the channel attention and the feature attention sequentially. Through the different characteristic transformation operations of the two branches, characteristic diagrams with different scales and different channel characteristics can be finally obtained. Putting the feature map as z in formula (4) into a split attention module, and reducing the dimension coefficient W₃、W₄Set to 1, the final feature map containing attention weights can be obtained. The calculation process is as follows:

s＝F_ex(z,W)＝sigmoid(σ(z)) (5)

and z is a feature graph output by the feature pyramid, W is a feature graph which is subjected to full connection of a ReLU activation function layer, convolution with a dimension (C multiplied by C)/r and then a ReLU nonlinear function represented by sigma and a sigmoid function are carried out, and finally a feature graph s which can contain split attention weight is obtained.

The parallel feature pyramid integrated with the joint attention module is put into ResNeSt, and the ResNeSt feature extraction module is constructed and corresponding residual errors are learned by combining the convolution layer and the pooling layer. And continuously downsampling, and obtaining a deep feature map s through final feature fusion and convolution calculation, namely realizing the extraction of the video frame information features of the final live webcast video.

Claims

1. A method for extracting network live broadcast video features under a complex scene based on joint attention ResNeSt is characterized by comprising the following steps:

1) parallel feature pyramid construction of live webcast video

Constructing a multi-scale parallel characteristic pyramid structure; fusing feature maps of different scales through parallel connection in the feature pyramid, wherein the parallel connection comprises transverse connection of 1 multiplied by 1 convolution and oblique connection changing along with the number of layers;

the transverse connection in the parallel connection uses convolution of 1 multiplied by 1; the diagonal connection uses a convolution of 3 x 3 to downsample the feature map, whose step size can be dynamically adjusted to 2(m-n), where n and m are the trunk structure C_nAnd parallel structure D_mA hierarchy number subscript of;

2) joint attention-driven feature pyramid design

Firstly, applying average pooling and maximum pooling operations along the channel axial direction, connecting the average pooling operation and the maximum pooling operation to generate an effective feature descriptor, finding out an area needing attention in video frame information through the operations, and finally obtaining space attention by utilizing standard convolution; the input feature graph F is respectively sent to a multi-layer perceptron MLP according to the maximum pooling result and the average pooling result of each channel, the output results are directly added, and then the feature graph M of the space attention module is obtained through a ReLU activation function_s(F) The whole process is as follows:

wherein M is_sRepresenting spatial attentionWhere σ represents the ReLU nonlinear function, f^3×3Representing the use of a 3 x 3 convolution kernel in the convolution process; AvgPool represents the pooling operation of the previous layer of weighted average, and MaxPool represents the pooling operation for obtaining the maximum value of the previous layer; f^s _avgI.e. the result after the average pooling, F^s _maxThe result after the largest pooling is obtained;

after obtaining the spatial attention weight, the incoming channel attention is further considered as follows:

wherein M is_c(F) Representing channel attention, σ represents the ReLU nonlinear function, MLP is a multi-layered perceptron matrix operation, W₀And W₁Is the weight of the multilayer perceptron; AvgPool represents the pooling operation of the weighted average of the previous layer, and MaxPool represents the pooling operation for obtaining the maximum value of the previous layer; f^c _avgI.e. the result after the average pooling, F^c _maxThe result after the largest pooling is obtained;

the space attention and the channel attention weight obtained by calculation are reserved to the next layer, and are continuously transmitted and superposed until the end through a feature extraction module;

3) live video scene feature expression based on joint attention and ResNeSt network

ResNeSt adds a split attention module on the basis of the branch structure of ResNeXt, and the mathematical expression is as follows:

s＝F_ex(z,W)＝sigmoid(W₃σ(W₄z)) (4)

where z represents the feature compression along the spatial dimension, F_sqRepresents a compression function, u_cRepresenting the C-th convolution kernel, i, j are the corresponding elements in the matrix, H, W and C are the graphsHeight, width and number of channels of the image; s represents a deep profile, F_exRepresents a feature extraction function, σ represents a ReLU nonlinear function, W₃、W₄The representative dimension reduction coefficient is two hyper-parameters which can be manually set, and the compressed global description feature z restores the original dimension through bottleneck structures of two fully-connected layers of ReLU and sigmoid to obtain the output of (1 × 1 × C).