CN112653899B - Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene - Google Patents

Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene Download PDF

Info

Publication number
CN112653899B
CN112653899B CN202011509545.7A CN202011509545A CN112653899B CN 112653899 B CN112653899 B CN 112653899B CN 202011509545 A CN202011509545 A CN 202011509545A CN 112653899 B CN112653899 B CN 112653899B
Authority
CN
China
Prior art keywords
attention
feature
video
resnest
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011509545.7A
Other languages
Chinese (zh)
Other versions
CN112653899A (en
Inventor
张菁
康俊鹏
张广朋
卓力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202011509545.7A priority Critical patent/CN112653899B/en
Publication of CN112653899A publication Critical patent/CN112653899A/en
Application granted granted Critical
Publication of CN112653899B publication Critical patent/CN112653899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention relates to a method for extracting network live broadcast video characteristics in a complex scene based on joint attention ResNeSt. Firstly, key frame extraction is carried out on a network live video to obtain key frame data of the video. In order to utilize the multi-scale features of the video frame, a parallel path is designed according to the multi-scale structure of the feature pyramid network. The parallel path is constructed from bottom to top, and exchanges information with the original main path by utilizing transverse connection and oblique connection, wherein the transverse connection and the oblique connection are convolution operation. In consideration of the fact that most of the presentation forms of the live webcast pictures are artificial subjects and simultaneously contain a large amount of redundant information, space-channel joint attention is introduced, and the picture subject characteristics are convenient to focus. And finally, combining the parallel feature pyramid fused with the joint attention with a convolutional layer and a pooling layer to construct a ResNeSt feature extraction module, and realizing feature extraction of the network live broadcast video in a complex scene through superposition of multiple layers of modules.

Description

Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
Technical Field
The method takes the live webcast video in a complex scene as a research object, and performs live webcast video feature extraction through the combination of attention and ResNeSt network, thereby forming high-efficiency feature expression of the live webcast video. Firstly, performing feature convolution on a video key frame by using a parallel feature pyramid; in the convolution process of the characteristic pyramid, low-level visual information and high-level semantic information of the video are obtained by introducing a joint attention mechanism; and finally, combining a Split Attention Residual error network (ResNeSt) to form efficient feature expression of the network live broadcast video.
Background
With the advent of the network from the media age, more and more people begin to share their lives on the network in the form of live videos, which also increase in number at a geometric level. The network live broadcast has strong powder suction capacity and user viscosity, brings great convenience for people to acquire information, and brings great economic benefits to operators. However, the huge amount of webcast video also poses serious challenges to network information security and supervision. The network live broadcast is low in threshold, complex in practitioner diversity, various in shot live broadcast scenes, complex in background and containing a large number of people, objects and identifications. Therefore, how to quickly and efficiently extract and express the characteristics of the live webcast video in a complex scene is a fundamental premise for classifying and monitoring the live webcast video.
Generally, video feature extraction methods are mainly classified into two types: firstly, the low-level visual features of the video key frames are directly utilized, and the low-level visual features comprise static features such as color features, texture features and shape features, and dynamic features such as lens movement and object movement. Due to the diversity of video contents, simple low-level visual features are difficult to be applied to describing all videos, and a more robust visual feature expression is formed; the other method is to extract and dig out the high-level semantic features of the video layer by layer from the low-level features through means of deep learning and the like, namely, the original data space dimension of the video is reduced through convolution operation and the like, and the proper semantic expression features are selected. The existing research results show that the deep learning technology achieves excellent performance in the aspect of video feature expression, which is far superior to that of the traditional method, because the semantic property and the abstract property of the features extracted by the deep learning method are higher, and the deep learning method is particularly suitable for learning the nonlinear factors in the video scene.
In recent years, many deep learning networks have emerged, of which Convolutional Neural Network (CNN) is the most popular method, and there are google lenet, VGG-19, inclusion, and the like as models using CNN as a network structure. One milestone event in the CNN history is the occurrence of the ResNet model, which can train deeper CNN models, thus achieving higher accuracy. A great deal of research shows that the performance far exceeding that of the traditional method can be obtained by utilizing ResNet to extract deeper features of the video. Based on ResNet, some researchers combine network components of ResNet and inclusion to provide a Deep transformation Residual network (ResNeXt), and extract video features with stronger expression capability by adding network branches. However, under the conditions of single video scene, less number of people and clear object edge outline, the existing deep learning network can be adopted to obtain better performance, but in the complex live video scene under the limited conditions of more scene types, indefinite number of people, lighting conditions and the like, the deep network is directly applied, so that the space-time context information is not easy to be effectively learned, and the improvement of the accuracy rate is influenced. Recently, a novel Split Attention Residual error network (resolution Networks with Split-Attention, ResNeXt) is proposed, which combines a Split Attention mechanism (Split Attention) capable of enhancing convolution characteristics in multiple branches and multiple scales, so that the performance of the network exceeds that of the former-made ResNet and ResNeXt. What supplements is to introduce joint attention.
Therefore, the invention provides a method for extracting the characteristics of a network live broadcast video in a complex scene based on joint attention ResNeSt. Firstly, performing feature convolution on a video key frame by utilizing a constructed parallel feature pyramid, and acquiring low-layer visual information and high-layer semantic information of a video by introducing a joint attention mechanism; in order to facilitate the construction of modular components and the multi-scale feature fusion, based on the concept of split attention supervision of ResNeSt, the feature pyramid is put into ResNeSt, and the extraction and expression of the network live video features are finally realized through the convolution operation and the feature fusion of ResNeSt.
Disclosure of Invention
Different from the existing video feature extraction method, the pyramid feature extraction model with parallel paths under the drive of an attention mechanism is designed for the live webcast video scene. In order to strengthen the multi-scale characteristic of the parallel characteristic pyramid, a new method for extracting the network live broadcast video characteristic in a complex scene is provided by combining the ResNeSt network. Firstly, key frame extraction is carried out on a network live video to obtain key frame data of the video. In order to utilize the multi-scale features of a video frame, a parallel path is designed according to the multi-scale structure of a Feature Pyramid Network (FPN). The parallel path is constructed from bottom to top and exchanges information with the original main path by utilizing transverse connection and oblique connection, wherein the transverse connection and the oblique connection are convolution operation. In consideration of the fact that most of the presentation forms of the live webcast pictures are artificial subjects and simultaneously contain a large amount of redundant information, space-channel joint attention is introduced, and the picture subject characteristics are conveniently focused. And finally, combining the parallel feature pyramid fused with the joint attention with a convolutional layer and a pooling layer to construct a ResNeSt feature extraction module, and realizing feature extraction of the network live video in a complex scene through superposition of multiple layers of modules.
The main process of the method is shown as the attached figure 1, and the main process can be divided into the following steps: constructing a parallel structure of a characteristic pyramid, and performing horizontal and oblique connection on down-sampling results of different scales to perform down-sampling to fuse multi-scale characteristics of a video frame, so that the characteristic extraction effect is improved; simultaneously, a joint attention mechanism (including space attention and channel attention) is introduced into each convolution module of the characteristic pyramid, and weight distribution of key information in the video is optimized; and finally, connecting the end of the parallel feature pyramid with the joint attention mechanism with a pooling layer, and completing the construction of the ResNeSt feature extraction module. And finally, through multilayer network stacking of ResNeSt, the extraction of video frame features is realized.
1. Parallel feature pyramid construction of live webcast video
The invention constructs a multi-scale parallel characteristic pyramid structure aiming at the characteristics of various network live video scenes, high content complexity and the like. Through parallel connection (including transverse connection of 1 multiplied by 1 convolution and oblique connection changing along with the number of layers) in the feature pyramid, feature graphs of different scales are fused, so that the feature pyramid can complement deep semantic information by utilizing shallow space positioning information while shortening paths of shallow information and deep information, and effective extraction of video frame features is enhanced.
As shown in fig. 2, the paths C and D correspond to feature pyramids with gradually decreasing video frame sizesA backbone structure and a parallel structure. Although the semantic information of the video frame is gradually enriched as the network hierarchy is deepened, the corresponding low-level visual information is gradually lost. And the parallel connection erected between the trunk structure and the parallel structure can fuse the spatial information of different scales, thereby enriching the feature extraction capability of the feature pyramid. The parallel connection can be divided into a transverse connection and an oblique connection, wherein the transverse connection uses convolution of 1 multiplied by 1; the diagonal connection uses a convolution of 3 x 3 to downsample the feature map, and the step size can be dynamically adjusted to 2(m-n) at different hierarchical depths, where n and m are the trunk structure CnAnd parallel structure DmSubscript of the number of levels.
2. Joint attention-driven feature pyramid design
Attention mechanism (Attention) is a signal processing mechanism built to mimic the human visual characteristics. The human can quickly scan the panorama and focus on certain special areas through the eyes, thereby obtaining more details of the target. The attention mechanism can also focus on a target area according to the visual characteristics of human beings, and more computing resources are invested near the focus, and meanwhile, redundant information of other areas is suppressed. In the invention, considering that a network live broadcast scene is relatively complex, in order to utilize low-level visual features to the maximum extent and highlight high-level semantic features, a space-channel joint attention mechanism is introduced into a parallel feature pyramid. The spatial attention mechanism is to simply give different spatial attention weights to feature maps of different levels, so as to enhance feature extraction of a core object in a complex scene. Specifically, in the trunk structure and the parallel structure of the parallel feature pyramid, different weights are respectively given to different layers. The shallow layer features are mostly low-layer visual information, and the importance of the shallow layer features is lower than that of the deep layer semantic features, so that the deep layer semantic features are considered in the weight distribution process to weaken the shallow layer features.
Based on the above analysis, the present invention obtains attention level weights corresponding to each feature map by constructing a weight mapping function. First apply average pooling and max pooling operations along the channel axis and concatenate them to generate a valid feature descriptorBy this operation, the region of the video frame information that needs to be focused is found, and finally, the spatial attention can be obtained by using the standard convolution. The input feature graph F is respectively sent to a Multi-Layer Perception Machine (MLP) according to the result of maximum Pooling (Global Max Pooling) and Average Pooling (Global Average Pooling) of each channel in sequence, the output results are directly added up, and then the feature graph M of the spatial attention module is obtained through a ReLU activation functions(F) The whole process is as follows:
Figure BDA0002845982270000041
wherein M issRepresenting spatial attention, σ represents a ReLU nonlinear function, f3×3Representing the use of a 3 x 3 convolution kernel in the convolution process. AvgPool represents the pooling operation of the weighted average of the previous layer, and MaxPool represents the pooling operation to obtain the maximum of the previous layer. Fs avgI.e. the result after the average pooling, Fs maxI.e. the result after maximum pooling. After obtaining the spatial attention weight, since each video frame includes a plurality of channels, and each channel generates a new signal after passing through a different convolution kernel, a large amount of redundant information is generated, and further consideration needs to be given to the attention of the channels, as shown below:
Figure BDA0002845982270000051
wherein M isc(F) Representing channel attention, σ represents the ReLU nonlinear function, MLP is a multi-layered perceptron matrix operation, W0And W1Is the weight of the multi-layer perceptron. AvgPool represents the pooling operation of the previous layer of weighted averaging, and MaxPool represents the pooling operation that achieves the maximum of the previous layer. Fc avgI.e. the result after the average pooling, Fc maxI.e. the result after maximum pooling
The spatial attention and the channel attention weight obtained by calculation are retained to the next layer, and are continuously transmitted and overlapped through the feature extraction module until the end, and the module schematic diagram is shown in fig. 3.
In the invention, the joint attention formed by the spatial attention and the channel attention is accumulated and used in the parallel feature pyramid for the first time, and the weight coefficient of each channel is learned in the whole process, so that the model has better discrimination capability on the spatial feature in each channel. Compared with the traditional method that the attention mechanism is only used after the last pooling layer of the network, the combined attention mechanism used layer by layer can save more calculation sources and is more beneficial to focusing the network on a target area worth paying attention in a video image frame.
3. Live video scene feature expression based on joint attention and ResNeSt network
The split Attention residual error network ResNeSt is a novel neural network architecture proposed by the Amazon team in 2020, and is different from other network architectures of the same series of ResNet, the biggest highlight of ResNeSt lies in a multi-branch split Attention module (Split Attention), and the module divides different channels of an input video frame into R groups by introducing a hyper-parameter Radix, and implements different feature transformations (Conv + Bn + ReLU and the like) on the R groups, so that an Attention weight redistribution mechanism is further enhanced. The ResNet series network structure starts from basic residual error learning; gradually develops into a ResNeXt network which integrates the Incepration branch structure. The ResNeSt used in the invention adds a Split-Attention (Split-Attention) module on the basis of the branch structure of ResNeXt, and the mathematical expression is as follows:
Figure BDA0002845982270000061
s=Fex(z,W)=sigmoid(W3σ(W4z)) (4)
where z represents the feature compression along the spatial dimension, FsqRepresents a compression function, ucRepresenting the C-th convolution kernel, i, j are the corresponding elements in the matrix, H, W and C are the height, width and number of channels of the image. s represents a deep profile, FexRepresents a feature extraction function, σ represents a ReLU nonlinear function, W3、W4The representative dimension reduction coefficient is two hyper-parameters which can be manually set, and the compressed global description feature z restores the original dimension through bottleneck structures of two fully-connected layers of ReLU and sigmoid to obtain the output of (1 × 1 × C).
The present invention makes relevant improvements based on the ResNeSt network architecture and the split attention supervision mechanism therein. Through standard convolution and superposition operations with different step lengths, the joint attention weights with different scales are accumulated, and cross-feature diagram connection of the attention mechanism is realized. In addition, the spatial position information of the shallow layer of the main structure can also be utilized in the semantic feature map of the deeper layer in the parallel structure, which is beneficial to the proposition and expression of effective features. The parallel feature pyramid integrated with the joint attention module is put into ResNeSt, and the ResNeSt feature extraction module is constructed and corresponding residual errors are learned by combining the convolution layer and the pooling layer. The image feature extraction of the live webcast video is realized through continuous down-sampling and final feature fusion and convolution calculation, and a module schematic diagram of the method is shown in fig. 4.
Description of the drawings:
FIG. 1 is a flow chart of a joint attention feature pyramid video feature extraction;
FIG. 2 parallel feature pyramid structure
FIG. 3 is a schematic view of a spatial channel joint attention module
FIG. 4 pyramid module embedded in ResNeSt
Detailed Description
In light of the above description, a specific implementation flow is as follows, but the scope of protection of this patent is not limited to this implementation flow. The following is a specific workflow of the invention:
the video data used by the invention is from a plurality of network video platforms, and key frames of various downloaded live videos are extracted. In the experimental process, key frames are taken at 5fps, only a continuous 16-frame fragment is taken to represent a video, and the video frame data of 224 x 224 pixels is obtained through Resize preprocessing. The video frame data is placed into a characteristic pyramid for down-sampling to obtain characteristic graphs with different scales; then, obtaining attention weight distribution of the multi-scale features through calculation of a combined attention mechanism; and finally, combining convolution and pooling operations to establish a ResNeSt module, and constructing a ResNeSt50 feature extraction network through superposition of 50 ResNeSt modules, thereby realizing feature extraction of the live webcast video. 1. Parallel feature pyramid construction of live webcast video
And constructing a parallel multi-scale characteristic pyramid structure based on the multi-scale structure of the characteristic pyramid. After the bottom layer image is input into the main structure of the characteristic pyramid, the main channel is convoluted and downsampled from bottom to top, and convolution kernels with different sizes and step lengths can be used according to different requirements. The invention constructs a multi-scale parallel characteristic pyramid structure, and carries out down-sampling in a trunk channel by adopting convolution with 3 multiplied by 3 step length of 2 to obtain a characteristic diagram with gradually reduced scale. Meanwhile, a parallel structure similar to the trunk structure is obtained by using parallel connection, wherein the parallel connection comprises 1 × 1 transverse convolution and dynamic oblique connection capable of automatically adjusting step length according to different levels. Taking 4 convolutional layers as an example, after a video frame with 224 × 224 pixels is input, the diagonal connection step length between the first layer image of the main structure and other layers of the parallel structure can be obtained by parallel connection and calculation according to the above formula, as shown in table 1.
TABLE 1 step sizes corresponding to different layers
Hierarchy level Step size
C1-D2 2
C1-D3 4
C1-D4 6
After the corresponding convolution step length in table 1 is obtained, the convolution feature map between the trunk structure and the parallel structure can be calculated by the transverse convolution of 1 × 1 and the oblique convolution of 3 × 3.
2. Joint attention-driven feature pyramid design
After the parallel feature pyramid is constructed, a joint attention mechanism may be introduced to conserve resources, focusing on the regions of interest in the feature map. First, for spatial attention, spatial attention is used to find regions of interest in an image in both the backbone structure and the parallel structure of the present invention. For the first layer in this example, a 224 × 224 pixel image is convolved with a 3 × 3 step size of 2 to obtain a 112 × 112 image. Applying average pooling and maximum pooling operations layer by layer from bottom to top along the channel axis and connecting them to generate an effective feature descriptor, as shown in formula (1), combining channel information by this operation, inputting a feature map F according to the results of maximum pooling and average pooling of each channel, sequentially and respectively feeding into a multi-layer perceptron for addition, and then obtaining spatial attention information M of the layer by means of a ReLU activation functions
The channel attention is then recalculated. The number of channels r per layer in this example is set to 3, 16, 64, 128, respectively, according to the relevant empirical reference. And after the input first-layer characteristic diagram F obtains 3 channels, adding the output results of the multilayer perceptron according to the calculation of a formula (2) in sequence by taking the maximum pooling and average pooling result of each channel. Finally, obtaining a characteristic diagram M of a corresponding channel attention module through a ReLU activation functionc
And (4) reserving the space attention and the channel attention weight obtained above to the last layer of the ResNeSt module, and finally obtaining a characteristic diagram integrating the multi-scale space and the attention of each channel through modular stacking of the ResNeSt network. In the invention, the joint attention formed by the spatial attention and the channel attention is accumulated and used in the parallel feature pyramid for the first time, and the weight coefficient of each channel is learned in the whole process, so that the model has better discrimination capability on the spatial feature in each channel. Compared with the traditional method that the attention mechanism is only used after the last pooling layer of the network, the combined attention mechanism used layer by layer can save more calculation sources and is more beneficial to focusing the network on a target area worth paying attention in a video image frame.
3. Live video scene feature expression based on joint attention and ResNeSt network
In the process of embedding the ResNeSt network, according to the similar structure of the split attention module in the ResNeSt network, the joint attention feature pyramid is embedded into the ResNeSt network. By using the branch structure of ResNeSt, setting the super parameter Radix to take 2, namely dividing ResNeSt into two branches to match the main structure and the parallel structure of the characteristic pyramid. The spatial attention of the trunk structure is extracted through layering, and each layer of the parallel structure is formed by firstly superposing oblique convolution of the trunk structure and then extracting the channel attention and the feature attention sequentially. Through the different characteristic transformation operations of the two branches, characteristic diagrams with different scales and different channel characteristics can be finally obtained. Putting the feature map as z in formula (4) into a split attention module, and reducing the dimension coefficient W3、W4Set to 1, the final feature map containing attention weights can be obtained. The calculation process is as follows:
s=Fex(z,W)=sigmoid(σ(z)) (5)
and z is a feature graph output by the feature pyramid, W is a feature graph which is subjected to full connection of a ReLU activation function layer, convolution with a dimension (C multiplied by C)/r and then a ReLU nonlinear function represented by sigma and a sigmoid function are carried out, and finally a feature graph s which can contain split attention weight is obtained.
The parallel feature pyramid integrated with the joint attention module is put into ResNeSt, and the ResNeSt feature extraction module is constructed and corresponding residual errors are learned by combining the convolution layer and the pooling layer. And continuously downsampling, and obtaining a deep feature map s through final feature fusion and convolution calculation, namely realizing the extraction of the video frame information features of the final live webcast video.

Claims (1)

1. A method for extracting network live broadcast video features under a complex scene based on joint attention ResNeSt is characterized by comprising the following steps:
1) parallel feature pyramid construction of live webcast video
Constructing a multi-scale parallel characteristic pyramid structure; fusing feature maps of different scales through parallel connection in the feature pyramid, wherein the parallel connection comprises transverse connection of 1 multiplied by 1 convolution and oblique connection changing along with the number of layers;
the transverse connection in the parallel connection uses convolution of 1 multiplied by 1; the diagonal connection uses a convolution of 3 x 3 to downsample the feature map, whose step size can be dynamically adjusted to 2(m-n), where n and m are the trunk structure CnAnd parallel structure DmA hierarchy number subscript of;
2) joint attention-driven feature pyramid design
Firstly, applying average pooling and maximum pooling operations along the channel axial direction, connecting the average pooling operation and the maximum pooling operation to generate an effective feature descriptor, finding out an area needing attention in video frame information through the operations, and finally obtaining space attention by utilizing standard convolution; the input feature graph F is respectively sent to a multi-layer perceptron MLP according to the maximum pooling result and the average pooling result of each channel, the output results are directly added, and then the feature graph M of the space attention module is obtained through a ReLU activation functions(F) The whole process is as follows:
Figure FDA0002845982260000011
wherein M issRepresenting spatial attentionWhere σ represents the ReLU nonlinear function, f3×3Representing the use of a 3 x 3 convolution kernel in the convolution process; AvgPool represents the pooling operation of the previous layer of weighted average, and MaxPool represents the pooling operation for obtaining the maximum value of the previous layer; fs avgI.e. the result after the average pooling, Fs maxThe result after the largest pooling is obtained;
after obtaining the spatial attention weight, the incoming channel attention is further considered as follows:
Figure FDA0002845982260000012
wherein M isc(F) Representing channel attention, σ represents the ReLU nonlinear function, MLP is a multi-layered perceptron matrix operation, W0And W1Is the weight of the multilayer perceptron; AvgPool represents the pooling operation of the weighted average of the previous layer, and MaxPool represents the pooling operation for obtaining the maximum value of the previous layer; fc avgI.e. the result after the average pooling, Fc maxThe result after the largest pooling is obtained;
the space attention and the channel attention weight obtained by calculation are reserved to the next layer, and are continuously transmitted and superposed until the end through a feature extraction module;
3) live video scene feature expression based on joint attention and ResNeSt network
ResNeSt adds a split attention module on the basis of the branch structure of ResNeXt, and the mathematical expression is as follows:
Figure FDA0002845982260000021
s=Fex(z,W)=sigmoid(W3σ(W4z)) (4)
where z represents the feature compression along the spatial dimension, FsqRepresents a compression function, ucRepresenting the C-th convolution kernel, i, j are the corresponding elements in the matrix, H, W and C are the graphsHeight, width and number of channels of the image; s represents a deep profile, FexRepresents a feature extraction function, σ represents a ReLU nonlinear function, W3、W4The representative dimension reduction coefficient is two hyper-parameters which can be manually set, and the compressed global description feature z restores the original dimension through bottleneck structures of two fully-connected layers of ReLU and sigmoid to obtain the output of (1 × 1 × C).
CN202011509545.7A 2020-12-18 2020-12-18 Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene Active CN112653899B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011509545.7A CN112653899B (en) 2020-12-18 2020-12-18 Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011509545.7A CN112653899B (en) 2020-12-18 2020-12-18 Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene

Publications (2)

Publication Number Publication Date
CN112653899A CN112653899A (en) 2021-04-13
CN112653899B true CN112653899B (en) 2022-07-12

Family

ID=75355082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011509545.7A Active CN112653899B (en) 2020-12-18 2020-12-18 Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene

Country Status (1)

Country Link
CN (1) CN112653899B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284093A (en) * 2021-04-29 2021-08-20 安徽省皖北煤电集团有限责任公司 Satellite image cloud detection method based on improved D-LinkNet
CN113539297A (en) * 2021-07-08 2021-10-22 中国海洋大学 Combined attention mechanism model and method for sound classification and application
CN113378791B (en) * 2021-07-09 2022-08-05 合肥工业大学 Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion
CN113673340B (en) * 2021-07-16 2024-05-10 北京农业信息技术研究中心 Pest type image identification method and system
CN113658114A (en) * 2021-07-29 2021-11-16 南京理工大学 Contact net opening pin defect target detection method based on multi-scale cross attention
CN113592718A (en) * 2021-08-12 2021-11-02 中国矿业大学 Mine image super-resolution reconstruction method and system based on multi-scale residual error network
CN114005025A (en) * 2021-09-28 2022-02-01 山东农业大学 Automatic monitoring method, device and system for high-point panoramic color-changing tree
CN114237089A (en) * 2021-11-15 2022-03-25 成都壹为新能源汽车有限公司 Intelligent loading control system for washing and sweeping operation vehicle
CN114092833B (en) * 2022-01-24 2022-05-27 长沙理工大学 Remote sensing image classification method and device, computer equipment and storage medium
CN114627086B (en) * 2022-03-18 2023-04-28 江苏省特种设备安全监督检验研究院 Crane surface damage detection method based on characteristic pyramid network
CN117612168B (en) * 2023-11-29 2024-11-05 湖南工商大学 Recognition method and device based on feature pyramid and attention mechanism
CN117468085B (en) * 2023-12-27 2024-05-28 浙江晶盛机电股份有限公司 Crystal bar growth control method and device, crystal growth furnace system and computer equipment
CN117496160B (en) * 2023-12-29 2024-03-19 中国民用航空飞行学院 Indoor scene-oriented semantic segmentation method for low-illumination image shot by unmanned aerial vehicle

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902693A (en) * 2019-02-16 2019-06-18 太原理工大学 One kind being based on more attention spatial pyramid characteristic image recognition methods
WO2019153908A1 (en) * 2018-02-11 2019-08-15 北京达佳互联信息技术有限公司 Image recognition method and system based on attention model
WO2020113886A1 (en) * 2018-12-07 2020-06-11 中国科学院自动化研究所 Behavior feature extraction method, system and apparatus based on time-space/frequency domain hybrid learning
CN111523410A (en) * 2020-04-09 2020-08-11 哈尔滨工业大学 Video saliency target detection method based on attention mechanism
CN111626159A (en) * 2020-05-15 2020-09-04 南京邮电大学 Human body key point detection method based on attention residual error module and branch fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019153908A1 (en) * 2018-02-11 2019-08-15 北京达佳互联信息技术有限公司 Image recognition method and system based on attention model
WO2020113886A1 (en) * 2018-12-07 2020-06-11 中国科学院自动化研究所 Behavior feature extraction method, system and apparatus based on time-space/frequency domain hybrid learning
CN109902693A (en) * 2019-02-16 2019-06-18 太原理工大学 One kind being based on more attention spatial pyramid characteristic image recognition methods
CN111523410A (en) * 2020-04-09 2020-08-11 哈尔滨工业大学 Video saliency target detection method based on attention mechanism
CN111626159A (en) * 2020-05-15 2020-09-04 南京邮电大学 Human body key point detection method based on attention residual error module and branch fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于长短时预测一致性的大规模视频语义识别算法;王铮等;《中国科学:信息科学》;20200612;第50卷(第6期);全文 *

Also Published As

Publication number Publication date
CN112653899A (en) 2021-04-13

Similar Documents

Publication Publication Date Title
CN112653899B (en) Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
He et al. Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline
CN113362223B (en) Image super-resolution reconstruction method based on attention mechanism and two-channel network
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
Liu et al. Learning temporal dynamics for video super-resolution: A deep learning approach
CN110059598B (en) Long-term fast-slow network fusion behavior identification method based on attitude joint points
CN109410239A (en) A kind of text image super resolution ratio reconstruction method generating confrontation network based on condition
CN108537754B (en) Face image restoration system based on deformation guide picture
CN110443842A (en) Depth map prediction technique based on visual angle fusion
CN111835983B (en) Multi-exposure-image high-dynamic-range imaging method and system based on generation countermeasure network
CN112750201B (en) Three-dimensional reconstruction method, related device and equipment
CN114820341A (en) Image blind denoising method and system based on enhanced transform
CN113538243B (en) Super-resolution image reconstruction method based on multi-parallax attention module combination
Li et al. Uphdr-gan: Generative adversarial network for high dynamic range imaging with unpaired data
CN111833360B (en) Image processing method, device, equipment and computer readable storage medium
CN111950477A (en) Single-image three-dimensional face reconstruction method based on video surveillance
CN113065645A (en) Twin attention network, image processing method and device
CN111242181B (en) RGB-D saliency object detector based on image semantics and detail
CN117173024B (en) Mine image super-resolution reconstruction system and method based on overall attention
CN112017116A (en) Image super-resolution reconstruction network based on asymmetric convolution and construction method thereof
CN108875751A (en) Image processing method and device, the training method of neural network, storage medium
CN107067452A (en) A kind of film 2D based on full convolutional neural networks turns 3D methods
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
CN105374010A (en) A panoramic image generation method
Huang et al. Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant