CN113936235A

CN113936235A - Video saliency target detection method based on quality evaluation

Info

Publication number: CN113936235A
Application number: CN202111075792.5A
Authority: CN
Inventors: 颜成钢; 高含笑; 王超怡; 孙垚棋; 张继勇; 李宗鹏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-01-14

Abstract

The invention discloses a video saliency target detection method based on quality evaluation, which comprises the steps of firstly constructing a video saliency target detection network framework, based on a double-flow coding-decoding structure, adopting a ResNet101 network as a main network, inputting one branch into an RGB image, extracting the spatial characteristics of the image, called RGB branch, inputting the other branch into an optical flow image, extracting the time information between the images, called moving branch, processing two continuous frames of images in a video through an RAFT algorithm, and calculating to obtain an optical flow graph used for reflecting the movement of an object in the video. Compared with the existing video saliency target detection method based on double flow, the framework provided by the invention can capture accurate space and time information in a self-adaptive manner, so that an accurate prediction result is obtained.

Description

Video saliency target detection method based on quality evaluation

Technical Field

The invention belongs to the field of computer vision, and aims to locate and segment the most attention-attracting object by utilizing spatial clues and temporal clues hidden in a video sequence. This task stems from the human visual attention behavior in cognitive studies, i.e. a rapid shift of attention to the most informative regions of the visual scene.

Background

The prior art can partially solve the problem, and can be roughly divided into four categories, namely Video Salient Object Detection (VSOD) methods based on feature extraction, long-short term memory, attention mechanism and parallel network.

Feature extraction based VSOD methods attempt to combine spatial information with a motion cue based on a priori knowledge, such as spatial-temporal background a priori knowledge and low rank consistency, and the performance of such methods is limited by how good or bad features are extracted. The VSOD method based on long-short term memory networks extracts spatial information from individual images in a video sequence and models temporal information through a convolution storage unit, such as ConvLSTM. The attention-based VSOD method uses a non-local mechanism to capture temporal information for several consecutive frames of images. Parallel-based VSOD methods typically employ a dual-flow framework, where one tributary extracts spatial features of the image and one tributary extracts temporal features of the optical flow image using a generative optical flow algorithm. The method is limited by the quality of the optical flow image, and whether the output characteristics can better fuse the spatial information and the time information.

The problems and challenges with current methods of VSOD are mainly: first, spatial cues hidden in each frame are often difficult to exploit when the foreground and background share similar features. Low contrast RGB images between salient objects and the background can introduce misleading information to interfere with the predicted target. Second, temporal cues hidden between different frames may be disturbed by fast motion, large displacements and illumination variations. Noise in the optical flow images leads to erroneous predictions, and even temporal information from accurate optical flow images can confuse the spatial information of several moving objects in the scene. Third, the roughness of the predicted edge, through spatio-temporal information, can often determine the location of salient objects, however, the lack of emphasis on shallow features results in blurring of the edge information.

VSOD has a wide range of application scenarios, and video saliency target detection has been widely applied to many computer vision tasks such as retrieval, identification, segmentation, redirection, enhancement, pedestrian detection, evaluation, compression, and the like as an effective preprocessing technique.

Disclosure of Invention

The VSOD method based on the double-flow frame is limited by the quality of the optical flow image, the quality of RGB image features is extracted, and whether the output features can better integrate spatial information and time information is judged. Therefore, the present invention provides a video saliency target detection method based on quality evaluation, and we propose a new framework, which contains a module for quality evaluation of optical flow characteristics (temporal information) and RGB characteristics (spatial information), so that the framework can adaptively capture accurate spatial and temporal information to predict saliency maps. Specifically, an adaptive gate module (quality evaluation module) for quality evaluation is introduced into the encoding and decoding parts of the framework respectively, the module can estimate the quality of input features by calculating the MAE value, the features with high quality are given larger weight, the features are reserved, the features with low quality are given smaller weight, the features are removed, the screening effect is achieved, and effective information is transmitted. Secondly, considering that multi-scale information can promote the determination of the overall target positioning and the segmentation of target details, a multi-scale perception module is introduced. In addition, in consideration of better fusion of spatial information and temporal information, a module (a spatiotemporal information interaction module) based on an attention mechanism is introduced, so that the spatial information and the temporal information are mutually guided and mutually promoted, and therefore better spatiotemporal characteristics are learned. Finally, we also propose a double difference enhancement module that focuses on capturing the difference information between spatial and temporal cues and generating fused features.

A video saliency target detection method based on quality assessment comprises the following steps:

step (1): constructing a video saliency target detection network framework;

the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, one branch is input into an RGB image, the spatial characteristics of the image are extracted and called RGB branch, the other branch is input into an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in the video are processed through an RAFT algorithm, and an optical flow graph is obtained through calculation and used for reflecting the movement of an object in the video.

Step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained. Considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the multi-scale fused RGB features.

In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.

And (3): loss function:

for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.

L_f＝L_BCE(P_f,G)+L_SSIM(P_f,G)+L_IoU(P_f,G) (9)

For quality assessment, correct labeling is performed to supervise the intermediate saliency map with the same loss function of equation 9, the mean absolute error MAE between the intermediate saliency map and the correct labeling is calculated, and MAE supervises quality scores through equation 10.

And (4): and pre-training the video saliency target detection network by adopting a training set of the DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of the DAVIS data set. Overfitting is prevented by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. Using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.

Further, the multi-scale sensing module:

the multi-scale perception module is used for enhancing the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the multi-scale perception module are shared with ResNet101 backbone networks of the RGB tributaries. For the input images of RGB tributaries, a down-sampling operation is first performed to reduce the images to 1/2 and 1/4, and the down-sampled images are input to two ResNet 101. Obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:

concat (.,) is a splicing operation of channel dimensions,

doubling the operation for upsampling.

To input the corresponding features for the 1/4 image,

for input of corresponding features for 1/2 images, E_inFeatures corresponding to RGB tributaries, E_outIs the fused output feature.

Further, the quality evaluation module:

the quality assessment module consists of two sub-networks, a prediction sub-network and an assessment sub-network. The prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by the correct labeling. The evaluation sub-network consists of three convolutional layers, a global mean pooling layer and a Sigmoid activation function, and is used for calculating a quality score which is supervised by the mean absolute error MAE between the prediction graph and the correct label. And splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network. And multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3. The input features with higher quality scores will be retained and the input features with lower scores will be removed as if there is a lot of noise.

In order to evaluate the convolution operation of a sub-network,

to predict convolution operations of a subnetwork, s_iTo score for quality, E_iIn order to input the features of the image,

the characteristics after quality evaluation. As a corresponding element multiplication operation, σ is a Sigmoid activation function.

Further, the spatiotemporal information interaction module:

the space-time information interaction module is used for keeping semantic consistency among different features. Attention operation on channel dimension is performed on the moving tributary features as input features, as shown in formula 4, attention operation on space dimension is performed, as shown in formula 5, the enhanced features and the features of the RGB tributaries are added to obtain space information under the guidance of time information, as shown in formulas 6 and 7. Similarly, the RGB tributary features are subjected to attention operation enhancement features in channel dimension and space dimension and added with the moving tributary features to obtain time information under the guidance of space information.

In order to function the attention of the channel,

for maximum pooling operations in the spatial dimension,

for a fully connected operation, σ is a Sigmoid activation function, which is a corresponding multiplication in the channel dimension.

In order to function the spatial attention as a function of,

for maximum pooling operation in the channel dimension,

in order to perform the convolution operation,

are corresponding multiplications in the spatial dimension.

In order to be able to input the movement characteristics,

to take care of the enhanced movement characteristics after the manipulation,

for the RGB characteristics of the output to be,

an add operation is performed for the corresponding elements.

Further, the double differential enhancement module:

the double difference enhancement module mines the difference information between the RGB and optical flow features. For the RGB and optical flow features after the quality evaluation, respectively, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as a supplement, as shown in equation 8.

In order to perform a convolution operation on the difference feature,

for enhanced (spatial or temporal) features.

The invention has the following beneficial effects:

the invention provides a novel video saliency target detection framework, wherein a multi-scale perception module, a quality evaluation module, a temporal-spatial information interaction module and a double-difference enhancement module are introduced, and compared with the existing video saliency target detection method based on double flow, the framework provided by the invention can capture accurate spatial and temporal information in a self-adaptive manner, so that an accurate prediction result is obtained.

Drawings

FIG. 1 is a view of a coding part in a frame structure;

FIG. 2 is a decoding part of the frame structure;

FIG. 3 is a block diagram of a quality assessment module;

fig. 4 is a diagram of a dual differential enhancement module.

Detailed Description

The method of the invention is further described below with reference to the accompanying drawings and examples.

step (1): constructing a video saliency target detection network framework;

And (3): loss function:

L_f＝L_BCE(P_f,G)+L_SSIM(P_f,G)+L_IoU(P_f,G) (9)

Fig. 1 and 2 are structural diagrams of a frame according to the present invention. The framework of the method is based on a double-stream coding-decoding structure, a main line is a ResNet101 network, input of one branch is an RGB image, spatial features of the image are extracted and called RGB branch, input of the other branch is an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in a video are processed through an RAFT algorithm, the optical flow graph is calculated, and motion of objects in the video can be reflected. In the coding part, the two branches respectively extract features, the output feature quality of each layer is evaluated, the features with guiding significance are screened, the feature enhancement is carried out through a module based on an attention mechanism, and the space-time features of the two branches are guided mutually; in the decoding part, the output characteristics of each layer in the coding are subjected to quality evaluation again, the space-time characteristics are fused through a double-difference enhancement module, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained.

Introduction of a specific module:

(1) the multi-scale perception module:

so-called multiscale, which is actually sampling of different granularities of a signal, we can observe different features at different scales, and thus accomplish different tasks. Generally, more detail is seen with less granular, i.e., denser, samples, and the overall trend is seen with more granular, i.e., sparser, samples. The multi-scale perception module is arranged to enhance the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the ResNet101 networks are shared with the ResNet101 backbone networks of the RGB tributaries. For the input images of RGB tributaries, a down-sampling operation is first performed to reduce the images to 1/2 and 1/4, and the down-sampled images are input to two ResNet 101. Obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:

concat (.,) is a splicing operation of channel dimensions,

doubling the operation for upsampling.

To input the corresponding features for the 1/4 image,

(2) A quality evaluation module:

the quality evaluation module aims to evaluate and supervise the extracted features and remove noise information, as shown in fig. 3. It predicts the quality score to represent the reliability of the feature and recalibrates the feature, and the quality assessment module consists of two sub-networks, a prediction sub-network and an assessment sub-network. The prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by the correct labeling. The evaluation sub-network consists of three convolutional layers, a global mean pooling layer and a Sigmoid activation function, and is used for calculating a quality score which is supervised by the mean absolute error MAE between the prediction graph and the correct label. And splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network. And multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3. The input features with higher quality scores will be retained and the input features with lower scores will be removed as if there is a lot of noise.

In order to evaluate the convolution operation of a sub-network,

(3) The space-time information interaction module:

the space-time information interaction module is used for keeping semantic consistency among different features. The spatiotemporal information is guided and promoted mutually through the spatiotemporal information interaction module. Attention operation on channel dimension is performed on the moving tributary features as input features, as shown in formula 4, attention operation on space dimension is performed, as shown in formula 5, the enhanced features and the features of the RGB tributaries are added to obtain space information under the guidance of time information, as shown in formulas 6 and 7. Similarly, the RGB tributary features are subjected to attention operation enhancement features in channel dimension and space dimension and added with the moving tributary features to obtain time information under the guidance of space information.

In order to function the attention of the channel,

for maximum pooling operations in the spatial dimension,

In order to function the spatial attention as a function of,

for maximum pooling operation in the channel dimension,

in order to perform the convolution operation,

are corresponding multiplications in the spatial dimension.

In order to be able to input the movement characteristics,

to take care of the enhanced movement characteristics after the manipulation,

for the RGB characteristics of the output to be,

an add operation is performed for the corresponding elements.

(4) A double differential enhancement module:

the color saliency obtained from the RGB features and the motion saliency obtained from the optical flow features are complementary, and fusing the two can result in a saliency map rich in information. However, most of the complementary information is hidden in the difference between the RGB and optical flow features. To fully exploit their complementarity, we propose a double differential enhancement module to mine the differential information between RGB and optical flow features, as shown in fig. 4. For the RGB and optical flow features after the quality evaluation, respectively, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as a supplement, as shown in equation 8.

In order to perform a convolution operation on the difference feature,

for enhanced (spatial or temporal) features.

II, experimental details:

(1) loss function:

L_f＝L_BCE(P_f,G)+L_SSIM(P_f,G)+L_IoU(P_f,G) (9)

For quality assessment, the intermediate saliency map is unsupervised with the same loss function as equation 9, the mean absolute error MAE between the intermediate saliency map and the correct annotation is calculated, and MAE is unsupervised with a quality score as equation 10.

(2) Data set:

in the experiment, an image significant target detection data set (DUTS) and a video significant target detection data set (DAVIS) are used for model training, and the video significant target detection data set and the DAVIS and DAVSOD data sets are used for testing the performance of the model.

The DUTS data set, which contains 5019 test images and 10553 training images, is the largest image saliency detection data set at present. The DAVIS dataset contains 50 high quality video sequences, which in total contains 3455 frames. Davod is the largest VSOD dataset at present, with 226 video and 23938 frames, covering different real scenes, objects, instances, and actions.

(3) Evaluation indexes are as follows:

the evaluation of video significance target detection mainly comprises three indexes, namely F-measure, S-measure and MeanAbsolute error (MAE).

F-measure is defined as equation 11 to find the weighted harmonic mean of recall and precision under non-negative weight β. Wherein, empirically obtained, β²Typically a value of 0.3.

The definition of S-measure is shown in equation 12, and is used to evaluate the structural similarity between the predicted saliency map and the corresponding correct annotation map. Where μ is typically set to 0.5, S_oAnd S_rRespectively representing region-aware structural similarity and object-aware structural similarity.

The definition of meanabsoluteerror (mae) is shown in formula 13, where S (x, y) represents the pixel value of the predicted saliency map, G (x, y) represents the pixel value of the correct annotation map, and W and H represent the width and height of the image.

(4) The experimental steps are as follows:

we have implemented our method on a pytorech. We used ResNet-101, previously trained on ImageNet, as our initial backbone. We use RAFT to generate optical flow images. First, we pre-train our model with the training set of DUTS dataset, and after one round of pre-training is over, we train the whole network with the training set of DAVIS dataset for another round. We prevent overfitting by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. We trained the model using the Adam optimizer with an initial learning rate of 1e-5 until convergence.

Claims

1. A video saliency target detection method based on quality assessment is characterized by comprising the following steps:

step (1): constructing a video saliency target detection network framework;

the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, the input of one branch flow is an RGB image, the spatial characteristics of the image are extracted and called RGB branch flow, the input of the other branch flow is an optical flow image, the time information between the images is extracted and called moving branch flow, two continuous frame images in the video are processed through an RAFT algorithm, and an optical flow diagram is obtained through calculation and used for reflecting the movement of an object in the video;

step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained; considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the RGB characteristics after multi-scale fusion;

in order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions; meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters;

and (3): loss function:

for the final prediction saliency map, calculating the loss between the final prediction saliency map and the correct label according to the definition of the loss function in the BASNet, as shown in formula 9;

L_f＝L_BCE(P_f，,G)+L_SSIM(P_f，,G)+L_IoU(P_f，G) (9)

for quality evaluation, correctly labeling the saliency map in the middle of unsupervised by using the same loss function of formula 9, calculating the average absolute error MAE between the saliency map in the middle and the correct label, and scoring the MAE supervision quality by using formula 10;

and (4): pre-training a video saliency target detection network by adopting a training set of a DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of a DAVIS data set; the training data volume is amplified by applying a random horizontal turning and random rotation mode to the input image, so that overfitting is prevented; using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.

2. The method according to claim 1, wherein the multi-scale perception module:

the multi-scale perception module is used for enhancing the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the multi-scale perception module are shared with ResNet101 main networks of the RGB tributaries; for input images of RGB tributaries, firstly, a down-sampling operation is carried out, the images are respectively reduced to 1/2 and 1/4, and the down-sampled images are respectively input into two ResNet 101; obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:

concat (.,) is a splicing operation of channel dimensions,

twice as many operations as upsampling;

to input the corresponding features for the 1/4 image,

3. The method according to claim 1, wherein the quality evaluation module:

the quality evaluation module consists of two sub-networks, a prediction sub-network and an evaluation sub-network; the prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by correct labeling; the evaluation sub-network consists of three convolutional layers, a global average pooling layer and a Sigmoid activation function and is used for calculating a quality score which is supervised by an average absolute error MAE between the prediction graph and the correct label; splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network; multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3; the input features with higher quality scores will be retained, and the input features with lower scores will be removed as the existence of a large amount of noise;

in order to evaluate the convolution operation of a sub-network,

the characteristics after quality evaluation; as a corresponding element multiplication operation, σ is a Sigmoid activation function.

4. The method as claimed in claim 1, wherein the spatiotemporal information interaction module:

the space-time information interaction module is used for keeping semantic consistency among different characteristics; firstly, performing attention operation on channel dimension on the characteristics of the moving tributaries as input characteristics, as shown in a formula 4, and then performing attention operation on space dimension, as shown in a formula 5, adding the enhanced characteristics and the characteristics of the RGB tributaries to obtain space information under the guidance of time information, as shown in formulas 6 and 7; similarly, performing attention operation enhancement characteristics on channel dimension and space dimension on RGB tributary characteristics, and adding the characteristics and the moving tributary characteristics to obtain time information under the guidance of space information;

in order to function the attention of the channel,

for maximum pooling operations in the spatial dimension,

for full connection operation, σ is a Sigmoid activation function, and σ is corresponding multiplication in channel dimension;

in order to function the spatial attention as a function of,

for maximum pooling operation in the channel dimension,

in order to perform the convolution operation,

is a corresponding multiplication in spatial dimension;

in order to be able to input the movement characteristics,

to take care of the enhanced movement characteristics after the manipulation,

for the RGB characteristics of the output to be,

an add operation is performed for the corresponding elements.

5. The method according to claim 1, wherein the double difference enhancement module:

the double-difference enhancement module is used for mining difference information between RGB and optical flow characteristics; for the RGB and optical flow characteristics after quality evaluation, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as supplement, as shown in formula 8;

in order to perform a convolution operation on the difference feature,

for enhanced (spatial or temporal) features.