CN113936235A - Video saliency target detection method based on quality evaluation - Google Patents
Video saliency target detection method based on quality evaluation Download PDFInfo
- Publication number
- CN113936235A CN113936235A CN202111075792.5A CN202111075792A CN113936235A CN 113936235 A CN113936235 A CN 113936235A CN 202111075792 A CN202111075792 A CN 202111075792A CN 113936235 A CN113936235 A CN 113936235A
- Authority
- CN
- China
- Prior art keywords
- features
- rgb
- input
- module
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 37
- 238000001514 detection method Methods 0.000 title claims abstract description 28
- 230000003287 optical effect Effects 0.000 claims abstract description 26
- 230000006870 function Effects 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 24
- 238000000034 method Methods 0.000 claims description 17
- 230000002123 temporal effect Effects 0.000 claims description 15
- 230000003993 interaction Effects 0.000 claims description 12
- 230000008447 perception Effects 0.000 claims description 12
- 238000011156 evaluation Methods 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000001303 quality assessment method Methods 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 230000002708 enhancing effect Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 239000013589 supplement Substances 0.000 claims description 3
- 238000005065 mining Methods 0.000 claims 1
- 230000007246 mechanism Effects 0.000 description 4
- 230000003190 augmentative effect Effects 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 235000019580 granularity Nutrition 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video saliency target detection method based on quality evaluation, which comprises the steps of firstly constructing a video saliency target detection network framework, based on a double-flow coding-decoding structure, adopting a ResNet101 network as a main network, inputting one branch into an RGB image, extracting the spatial characteristics of the image, called RGB branch, inputting the other branch into an optical flow image, extracting the time information between the images, called moving branch, processing two continuous frames of images in a video through an RAFT algorithm, and calculating to obtain an optical flow graph used for reflecting the movement of an object in the video. Compared with the existing video saliency target detection method based on double flow, the framework provided by the invention can capture accurate space and time information in a self-adaptive manner, so that an accurate prediction result is obtained.
Description
Technical Field
The invention belongs to the field of computer vision, and aims to locate and segment the most attention-attracting object by utilizing spatial clues and temporal clues hidden in a video sequence. This task stems from the human visual attention behavior in cognitive studies, i.e. a rapid shift of attention to the most informative regions of the visual scene.
Background
The prior art can partially solve the problem, and can be roughly divided into four categories, namely Video Salient Object Detection (VSOD) methods based on feature extraction, long-short term memory, attention mechanism and parallel network.
Feature extraction based VSOD methods attempt to combine spatial information with a motion cue based on a priori knowledge, such as spatial-temporal background a priori knowledge and low rank consistency, and the performance of such methods is limited by how good or bad features are extracted. The VSOD method based on long-short term memory networks extracts spatial information from individual images in a video sequence and models temporal information through a convolution storage unit, such as ConvLSTM. The attention-based VSOD method uses a non-local mechanism to capture temporal information for several consecutive frames of images. Parallel-based VSOD methods typically employ a dual-flow framework, where one tributary extracts spatial features of the image and one tributary extracts temporal features of the optical flow image using a generative optical flow algorithm. The method is limited by the quality of the optical flow image, and whether the output characteristics can better fuse the spatial information and the time information.
The problems and challenges with current methods of VSOD are mainly: first, spatial cues hidden in each frame are often difficult to exploit when the foreground and background share similar features. Low contrast RGB images between salient objects and the background can introduce misleading information to interfere with the predicted target. Second, temporal cues hidden between different frames may be disturbed by fast motion, large displacements and illumination variations. Noise in the optical flow images leads to erroneous predictions, and even temporal information from accurate optical flow images can confuse the spatial information of several moving objects in the scene. Third, the roughness of the predicted edge, through spatio-temporal information, can often determine the location of salient objects, however, the lack of emphasis on shallow features results in blurring of the edge information.
VSOD has a wide range of application scenarios, and video saliency target detection has been widely applied to many computer vision tasks such as retrieval, identification, segmentation, redirection, enhancement, pedestrian detection, evaluation, compression, and the like as an effective preprocessing technique.
Disclosure of Invention
The VSOD method based on the double-flow frame is limited by the quality of the optical flow image, the quality of RGB image features is extracted, and whether the output features can better integrate spatial information and time information is judged. Therefore, the present invention provides a video saliency target detection method based on quality evaluation, and we propose a new framework, which contains a module for quality evaluation of optical flow characteristics (temporal information) and RGB characteristics (spatial information), so that the framework can adaptively capture accurate spatial and temporal information to predict saliency maps. Specifically, an adaptive gate module (quality evaluation module) for quality evaluation is introduced into the encoding and decoding parts of the framework respectively, the module can estimate the quality of input features by calculating the MAE value, the features with high quality are given larger weight, the features are reserved, the features with low quality are given smaller weight, the features are removed, the screening effect is achieved, and effective information is transmitted. Secondly, considering that multi-scale information can promote the determination of the overall target positioning and the segmentation of target details, a multi-scale perception module is introduced. In addition, in consideration of better fusion of spatial information and temporal information, a module (a spatiotemporal information interaction module) based on an attention mechanism is introduced, so that the spatial information and the temporal information are mutually guided and mutually promoted, and therefore better spatiotemporal characteristics are learned. Finally, we also propose a double difference enhancement module that focuses on capturing the difference information between spatial and temporal cues and generating fused features.
A video saliency target detection method based on quality assessment comprises the following steps:
step (1): constructing a video saliency target detection network framework;
the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, one branch is input into an RGB image, the spatial characteristics of the image are extracted and called RGB branch, the other branch is input into an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in the video are processed through an RAFT algorithm, and an optical flow graph is obtained through calculation and used for reflecting the movement of an object in the video.
Step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained. Considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the multi-scale fused RGB features.
In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.
And (3): loss function:
for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.
Lf=LBCE(Pf,G)+LSSIM(Pf,G)+LIoU(Pf,G) (9)
For quality assessment, correct labeling is performed to supervise the intermediate saliency map with the same loss function of equation 9, the mean absolute error MAE between the intermediate saliency map and the correct labeling is calculated, and MAE supervises quality scores through equation 10.
And (4): and pre-training the video saliency target detection network by adopting a training set of the DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of the DAVIS data set. Overfitting is prevented by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. Using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.
Further, the multi-scale sensing module:
the multi-scale perception module is used for enhancing the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the multi-scale perception module are shared with ResNet101 backbone networks of the RGB tributaries. For the input images of RGB tributaries, a down-sampling operation is first performed to reduce the images to 1/2 and 1/4, and the down-sampled images are input to two ResNet 101. Obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:
concat (.,) is a splicing operation of channel dimensions,doubling the operation for upsampling.To input the corresponding features for the 1/4 image,for input of corresponding features for 1/2 images, EinFeatures corresponding to RGB tributaries, EoutIs the fused output feature.
Further, the quality evaluation module:
the quality assessment module consists of two sub-networks, a prediction sub-network and an assessment sub-network. The prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by the correct labeling. The evaluation sub-network consists of three convolutional layers, a global mean pooling layer and a Sigmoid activation function, and is used for calculating a quality score which is supervised by the mean absolute error MAE between the prediction graph and the correct label. And splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network. And multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3. The input features with higher quality scores will be retained and the input features with lower scores will be removed as if there is a lot of noise.
In order to evaluate the convolution operation of a sub-network,to predict convolution operations of a subnetwork, siTo score for quality, EiIn order to input the features of the image,the characteristics after quality evaluation. As a corresponding element multiplication operation, σ is a Sigmoid activation function.
Further, the spatiotemporal information interaction module:
the space-time information interaction module is used for keeping semantic consistency among different features. Attention operation on channel dimension is performed on the moving tributary features as input features, as shown in formula 4, attention operation on space dimension is performed, as shown in formula 5, the enhanced features and the features of the RGB tributaries are added to obtain space information under the guidance of time information, as shown in formulas 6 and 7. Similarly, the RGB tributary features are subjected to attention operation enhancement features in channel dimension and space dimension and added with the moving tributary features to obtain time information under the guidance of space information.
In order to function the attention of the channel,for maximum pooling operations in the spatial dimension,for a fully connected operation, σ is a Sigmoid activation function, which is a corresponding multiplication in the channel dimension.In order to function the spatial attention as a function of,for maximum pooling operation in the channel dimension,in order to perform the convolution operation,are corresponding multiplications in the spatial dimension.In order to be able to input the movement characteristics,to take care of the enhanced movement characteristics after the manipulation,for the RGB characteristics of the output to be,an add operation is performed for the corresponding elements.
Further, the double differential enhancement module:
the double difference enhancement module mines the difference information between the RGB and optical flow features. For the RGB and optical flow features after the quality evaluation, respectively, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as a supplement, as shown in equation 8.
In order to perform a convolution operation on the difference feature,for enhanced (spatial or temporal) features.
The invention has the following beneficial effects:
the invention provides a novel video saliency target detection framework, wherein a multi-scale perception module, a quality evaluation module, a temporal-spatial information interaction module and a double-difference enhancement module are introduced, and compared with the existing video saliency target detection method based on double flow, the framework provided by the invention can capture accurate spatial and temporal information in a self-adaptive manner, so that an accurate prediction result is obtained.
Drawings
FIG. 1 is a view of a coding part in a frame structure;
FIG. 2 is a decoding part of the frame structure;
FIG. 3 is a block diagram of a quality assessment module;
fig. 4 is a diagram of a dual differential enhancement module.
Detailed Description
The method of the invention is further described below with reference to the accompanying drawings and examples.
A video saliency target detection method based on quality assessment comprises the following steps:
step (1): constructing a video saliency target detection network framework;
the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, one branch is input into an RGB image, the spatial characteristics of the image are extracted and called RGB branch, the other branch is input into an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in the video are processed through an RAFT algorithm, and an optical flow graph is obtained through calculation and used for reflecting the movement of an object in the video.
Step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained. Considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the multi-scale fused RGB features.
In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.
And (3): loss function:
for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.
Lf=LBCE(Pf,G)+LSSIM(Pf,G)+LIoU(Pf,G) (9)
For quality assessment, correct labeling is performed to supervise the intermediate saliency map with the same loss function of equation 9, the mean absolute error MAE between the intermediate saliency map and the correct labeling is calculated, and MAE supervises quality scores through equation 10.
And (4): and pre-training the video saliency target detection network by adopting a training set of the DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of the DAVIS data set. Overfitting is prevented by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. Using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.
Fig. 1 and 2 are structural diagrams of a frame according to the present invention. The framework of the method is based on a double-stream coding-decoding structure, a main line is a ResNet101 network, input of one branch is an RGB image, spatial features of the image are extracted and called RGB branch, input of the other branch is an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in a video are processed through an RAFT algorithm, the optical flow graph is calculated, and motion of objects in the video can be reflected. In the coding part, the two branches respectively extract features, the output feature quality of each layer is evaluated, the features with guiding significance are screened, the feature enhancement is carried out through a module based on an attention mechanism, and the space-time features of the two branches are guided mutually; in the decoding part, the output characteristics of each layer in the coding are subjected to quality evaluation again, the space-time characteristics are fused through a double-difference enhancement module, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained.
In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.
Introduction of a specific module:
(1) the multi-scale perception module:
so-called multiscale, which is actually sampling of different granularities of a signal, we can observe different features at different scales, and thus accomplish different tasks. Generally, more detail is seen with less granular, i.e., denser, samples, and the overall trend is seen with more granular, i.e., sparser, samples. The multi-scale perception module is arranged to enhance the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the ResNet101 networks are shared with the ResNet101 backbone networks of the RGB tributaries. For the input images of RGB tributaries, a down-sampling operation is first performed to reduce the images to 1/2 and 1/4, and the down-sampled images are input to two ResNet 101. Obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:
concat (.,) is a splicing operation of channel dimensions,doubling the operation for upsampling.To input the corresponding features for the 1/4 image,for input of corresponding features for 1/2 images, EinFeatures corresponding to RGB tributaries, EoutIs the fused output feature.
(2) A quality evaluation module:
the quality evaluation module aims to evaluate and supervise the extracted features and remove noise information, as shown in fig. 3. It predicts the quality score to represent the reliability of the feature and recalibrates the feature, and the quality assessment module consists of two sub-networks, a prediction sub-network and an assessment sub-network. The prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by the correct labeling. The evaluation sub-network consists of three convolutional layers, a global mean pooling layer and a Sigmoid activation function, and is used for calculating a quality score which is supervised by the mean absolute error MAE between the prediction graph and the correct label. And splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network. And multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3. The input features with higher quality scores will be retained and the input features with lower scores will be removed as if there is a lot of noise.
In order to evaluate the convolution operation of a sub-network,to predict convolution operations of a subnetwork, siTo score for quality, EiIn order to input the features of the image,the characteristics after quality evaluation. As a corresponding element multiplication operation, σ is a Sigmoid activation function.
(3) The space-time information interaction module:
the space-time information interaction module is used for keeping semantic consistency among different features. The spatiotemporal information is guided and promoted mutually through the spatiotemporal information interaction module. Attention operation on channel dimension is performed on the moving tributary features as input features, as shown in formula 4, attention operation on space dimension is performed, as shown in formula 5, the enhanced features and the features of the RGB tributaries are added to obtain space information under the guidance of time information, as shown in formulas 6 and 7. Similarly, the RGB tributary features are subjected to attention operation enhancement features in channel dimension and space dimension and added with the moving tributary features to obtain time information under the guidance of space information.
In order to function the attention of the channel,for maximum pooling operations in the spatial dimension,for a fully connected operation, σ is a Sigmoid activation function, which is a corresponding multiplication in the channel dimension.In order to function the spatial attention as a function of,for maximum pooling operation in the channel dimension,in order to perform the convolution operation,are corresponding multiplications in the spatial dimension.In order to be able to input the movement characteristics,to take care of the enhanced movement characteristics after the manipulation,for the RGB characteristics of the output to be,an add operation is performed for the corresponding elements.
(4) A double differential enhancement module:
the color saliency obtained from the RGB features and the motion saliency obtained from the optical flow features are complementary, and fusing the two can result in a saliency map rich in information. However, most of the complementary information is hidden in the difference between the RGB and optical flow features. To fully exploit their complementarity, we propose a double differential enhancement module to mine the differential information between RGB and optical flow features, as shown in fig. 4. For the RGB and optical flow features after the quality evaluation, respectively, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as a supplement, as shown in equation 8.
In order to perform a convolution operation on the difference feature,for enhanced (spatial or temporal) features.
II, experimental details:
(1) loss function:
for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.
Lf=LBCE(Pf,G)+LSSIM(Pf,G)+LIoU(Pf,G) (9)
For quality assessment, the intermediate saliency map is unsupervised with the same loss function as equation 9, the mean absolute error MAE between the intermediate saliency map and the correct annotation is calculated, and MAE is unsupervised with a quality score as equation 10.
(2) Data set:
in the experiment, an image significant target detection data set (DUTS) and a video significant target detection data set (DAVIS) are used for model training, and the video significant target detection data set and the DAVIS and DAVSOD data sets are used for testing the performance of the model.
The DUTS data set, which contains 5019 test images and 10553 training images, is the largest image saliency detection data set at present. The DAVIS dataset contains 50 high quality video sequences, which in total contains 3455 frames. Davod is the largest VSOD dataset at present, with 226 video and 23938 frames, covering different real scenes, objects, instances, and actions.
(3) Evaluation indexes are as follows:
the evaluation of video significance target detection mainly comprises three indexes, namely F-measure, S-measure and MeanAbsolute error (MAE).
F-measure is defined as equation 11 to find the weighted harmonic mean of recall and precision under non-negative weight β. Wherein, empirically obtained, β2Typically a value of 0.3.
The definition of S-measure is shown in equation 12, and is used to evaluate the structural similarity between the predicted saliency map and the corresponding correct annotation map. Where μ is typically set to 0.5, SoAnd SrRespectively representing region-aware structural similarity and object-aware structural similarity.
The definition of meanabsoluteerror (mae) is shown in formula 13, where S (x, y) represents the pixel value of the predicted saliency map, G (x, y) represents the pixel value of the correct annotation map, and W and H represent the width and height of the image.
(4) The experimental steps are as follows:
we have implemented our method on a pytorech. We used ResNet-101, previously trained on ImageNet, as our initial backbone. We use RAFT to generate optical flow images. First, we pre-train our model with the training set of DUTS dataset, and after one round of pre-training is over, we train the whole network with the training set of DAVIS dataset for another round. We prevent overfitting by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. We trained the model using the Adam optimizer with an initial learning rate of 1e-5 until convergence.
Claims (5)
1. A video saliency target detection method based on quality assessment is characterized by comprising the following steps:
step (1): constructing a video saliency target detection network framework;
the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, the input of one branch flow is an RGB image, the spatial characteristics of the image are extracted and called RGB branch flow, the input of the other branch flow is an optical flow image, the time information between the images is extracted and called moving branch flow, two continuous frame images in the video are processed through an RAFT algorithm, and an optical flow diagram is obtained through calculation and used for reflecting the movement of an object in the video;
step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained; considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the RGB characteristics after multi-scale fusion;
in order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions; meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters;
and (3): loss function:
for the final prediction saliency map, calculating the loss between the final prediction saliency map and the correct label according to the definition of the loss function in the BASNet, as shown in formula 9;
Lf=LBCE(Pf,,G)+LSSIM(Pf,,G)+LIoU(Pf,G) (9)
for quality evaluation, correctly labeling the saliency map in the middle of unsupervised by using the same loss function of formula 9, calculating the average absolute error MAE between the saliency map in the middle and the correct label, and scoring the MAE supervision quality by using formula 10;
and (4): pre-training a video saliency target detection network by adopting a training set of a DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of a DAVIS data set; the training data volume is amplified by applying a random horizontal turning and random rotation mode to the input image, so that overfitting is prevented; using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.
2. The method according to claim 1, wherein the multi-scale perception module:
the multi-scale perception module is used for enhancing the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the multi-scale perception module are shared with ResNet101 main networks of the RGB tributaries; for input images of RGB tributaries, firstly, a down-sampling operation is carried out, the images are respectively reduced to 1/2 and 1/4, and the down-sampled images are respectively input into two ResNet 101; obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:
3. The method according to claim 1, wherein the quality evaluation module:
the quality evaluation module consists of two sub-networks, a prediction sub-network and an evaluation sub-network; the prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by correct labeling; the evaluation sub-network consists of three convolutional layers, a global average pooling layer and a Sigmoid activation function and is used for calculating a quality score which is supervised by an average absolute error MAE between the prediction graph and the correct label; splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network; multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3; the input features with higher quality scores will be retained, and the input features with lower scores will be removed as the existence of a large amount of noise;
in order to evaluate the convolution operation of a sub-network,to predict convolution operations of a subnetwork, siTo score for quality, EiIn order to input the features of the image,the characteristics after quality evaluation; as a corresponding element multiplication operation, σ is a Sigmoid activation function.
4. The method as claimed in claim 1, wherein the spatiotemporal information interaction module:
the space-time information interaction module is used for keeping semantic consistency among different characteristics; firstly, performing attention operation on channel dimension on the characteristics of the moving tributaries as input characteristics, as shown in a formula 4, and then performing attention operation on space dimension, as shown in a formula 5, adding the enhanced characteristics and the characteristics of the RGB tributaries to obtain space information under the guidance of time information, as shown in formulas 6 and 7; similarly, performing attention operation enhancement characteristics on channel dimension and space dimension on RGB tributary characteristics, and adding the characteristics and the moving tributary characteristics to obtain time information under the guidance of space information;
in order to function the attention of the channel,for maximum pooling operations in the spatial dimension,for full connection operation, σ is a Sigmoid activation function, and σ is corresponding multiplication in channel dimension;in order to function the spatial attention as a function of,for maximum pooling operation in the channel dimension,in order to perform the convolution operation,is a corresponding multiplication in spatial dimension;in order to be able to input the movement characteristics,to take care of the enhanced movement characteristics after the manipulation,for the RGB characteristics of the output to be,an add operation is performed for the corresponding elements.
5. The method according to claim 1, wherein the double difference enhancement module:
the double-difference enhancement module is used for mining difference information between RGB and optical flow characteristics; for the RGB and optical flow characteristics after quality evaluation, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as supplement, as shown in formula 8;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111075792.5A CN113936235A (en) | 2021-09-14 | 2021-09-14 | Video saliency target detection method based on quality evaluation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111075792.5A CN113936235A (en) | 2021-09-14 | 2021-09-14 | Video saliency target detection method based on quality evaluation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113936235A true CN113936235A (en) | 2022-01-14 |
Family
ID=79275690
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111075792.5A Withdrawn CN113936235A (en) | 2021-09-14 | 2021-09-14 | Video saliency target detection method based on quality evaluation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113936235A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114612979A (en) * | 2022-03-09 | 2022-06-10 | 平安科技(深圳)有限公司 | Living body detection method and device, electronic equipment and storage medium |
CN116994006A (en) * | 2023-09-27 | 2023-11-03 | 江苏源驶科技有限公司 | Collaborative saliency detection method and system for fusing image saliency information |
CN117173394A (en) * | 2023-08-07 | 2023-12-05 | 山东大学 | Weak supervision salient object detection method and system for unmanned aerial vehicle video data |
-
2021
- 2021-09-14 CN CN202111075792.5A patent/CN113936235A/en not_active Withdrawn
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114612979A (en) * | 2022-03-09 | 2022-06-10 | 平安科技(深圳)有限公司 | Living body detection method and device, electronic equipment and storage medium |
CN114612979B (en) * | 2022-03-09 | 2024-05-31 | 平安科技(深圳)有限公司 | Living body detection method and device, electronic equipment and storage medium |
CN117173394A (en) * | 2023-08-07 | 2023-12-05 | 山东大学 | Weak supervision salient object detection method and system for unmanned aerial vehicle video data |
CN117173394B (en) * | 2023-08-07 | 2024-04-02 | 山东大学 | Weak supervision salient object detection method and system for unmanned aerial vehicle video data |
CN116994006A (en) * | 2023-09-27 | 2023-11-03 | 江苏源驶科技有限公司 | Collaborative saliency detection method and system for fusing image saliency information |
CN116994006B (en) * | 2023-09-27 | 2023-12-08 | 江苏源驶科技有限公司 | Collaborative saliency detection method and system for fusing image saliency information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112132156B (en) | Image saliency target detection method and system based on multi-depth feature fusion | |
Zhou et al. | HFNet: Hierarchical feedback network with multilevel atrous spatial pyramid pooling for RGB-D saliency detection | |
CN113936235A (en) | Video saliency target detection method based on quality evaluation | |
CN111738054B (en) | Behavior anomaly detection method based on space-time self-encoder network and space-time CNN | |
CN112150450B (en) | Image tampering detection method and device based on dual-channel U-Net model | |
Lin et al. | Image manipulation detection by multiple tampering traces and edge artifact enhancement | |
CN110020658B (en) | Salient object detection method based on multitask deep learning | |
CN113313810A (en) | 6D attitude parameter calculation method for transparent object | |
CN114339362B (en) | Video bullet screen matching method, device, computer equipment and storage medium | |
Kang et al. | SdBAN: Salient object detection using bilateral attention network with dice coefficient loss | |
CN111652181B (en) | Target tracking method and device and electronic equipment | |
CN113065551A (en) | Method for performing image segmentation using a deep neural network model | |
Xia et al. | Pedestrian detection algorithm based on multi-scale feature extraction and attention feature fusion | |
Kompella et al. | A semi-supervised recurrent neural network for video salient object detection | |
CN114693952A (en) | RGB-D significance target detection method based on multi-modal difference fusion network | |
Aldhaheri et al. | MACC Net: Multi-task attention crowd counting network | |
CN111242068A (en) | Behavior recognition method and device based on video, electronic equipment and storage medium | |
CN112784745B (en) | Confidence self-adaption and difference enhancement based video salient object detection method | |
CN117351487A (en) | Medical image segmentation method and system for fusing adjacent area and edge information | |
CN110942463B (en) | Video target segmentation method based on generation countermeasure network | |
CN116758449A (en) | Video salient target detection method and system based on deep learning | |
WO2023185074A1 (en) | Group behavior recognition method based on complementary spatio-temporal information modeling | |
Gowda et al. | Foreground segmentation network using transposed convolutional neural networks and up sampling for multiscale feature encoding | |
Huang et al. | Deep Multimodal Fusion Autoencoder for Saliency Prediction of RGB‐D Images | |
CN110211146B (en) | Video foreground segmentation method and device for cross-view simulation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20220114 |
|
WW01 | Invention patent application withdrawn after publication |