CN113936235A - Video saliency target detection method based on quality evaluation - Google Patents

Video saliency target detection method based on quality evaluation Download PDF

Info

Publication number
CN113936235A
CN113936235A CN202111075792.5A CN202111075792A CN113936235A CN 113936235 A CN113936235 A CN 113936235A CN 202111075792 A CN202111075792 A CN 202111075792A CN 113936235 A CN113936235 A CN 113936235A
Authority
CN
China
Prior art keywords
features
rgb
input
module
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111075792.5A
Other languages
Chinese (zh)
Inventor
颜成钢
高含笑
王超怡
孙垚棋
张继勇
李宗鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111075792.5A priority Critical patent/CN113936235A/en
Publication of CN113936235A publication Critical patent/CN113936235A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video saliency target detection method based on quality evaluation, which comprises the steps of firstly constructing a video saliency target detection network framework, based on a double-flow coding-decoding structure, adopting a ResNet101 network as a main network, inputting one branch into an RGB image, extracting the spatial characteristics of the image, called RGB branch, inputting the other branch into an optical flow image, extracting the time information between the images, called moving branch, processing two continuous frames of images in a video through an RAFT algorithm, and calculating to obtain an optical flow graph used for reflecting the movement of an object in the video. Compared with the existing video saliency target detection method based on double flow, the framework provided by the invention can capture accurate space and time information in a self-adaptive manner, so that an accurate prediction result is obtained.

Description

Video saliency target detection method based on quality evaluation
Technical Field
The invention belongs to the field of computer vision, and aims to locate and segment the most attention-attracting object by utilizing spatial clues and temporal clues hidden in a video sequence. This task stems from the human visual attention behavior in cognitive studies, i.e. a rapid shift of attention to the most informative regions of the visual scene.
Background
The prior art can partially solve the problem, and can be roughly divided into four categories, namely Video Salient Object Detection (VSOD) methods based on feature extraction, long-short term memory, attention mechanism and parallel network.
Feature extraction based VSOD methods attempt to combine spatial information with a motion cue based on a priori knowledge, such as spatial-temporal background a priori knowledge and low rank consistency, and the performance of such methods is limited by how good or bad features are extracted. The VSOD method based on long-short term memory networks extracts spatial information from individual images in a video sequence and models temporal information through a convolution storage unit, such as ConvLSTM. The attention-based VSOD method uses a non-local mechanism to capture temporal information for several consecutive frames of images. Parallel-based VSOD methods typically employ a dual-flow framework, where one tributary extracts spatial features of the image and one tributary extracts temporal features of the optical flow image using a generative optical flow algorithm. The method is limited by the quality of the optical flow image, and whether the output characteristics can better fuse the spatial information and the time information.
The problems and challenges with current methods of VSOD are mainly: first, spatial cues hidden in each frame are often difficult to exploit when the foreground and background share similar features. Low contrast RGB images between salient objects and the background can introduce misleading information to interfere with the predicted target. Second, temporal cues hidden between different frames may be disturbed by fast motion, large displacements and illumination variations. Noise in the optical flow images leads to erroneous predictions, and even temporal information from accurate optical flow images can confuse the spatial information of several moving objects in the scene. Third, the roughness of the predicted edge, through spatio-temporal information, can often determine the location of salient objects, however, the lack of emphasis on shallow features results in blurring of the edge information.
VSOD has a wide range of application scenarios, and video saliency target detection has been widely applied to many computer vision tasks such as retrieval, identification, segmentation, redirection, enhancement, pedestrian detection, evaluation, compression, and the like as an effective preprocessing technique.
Disclosure of Invention
The VSOD method based on the double-flow frame is limited by the quality of the optical flow image, the quality of RGB image features is extracted, and whether the output features can better integrate spatial information and time information is judged. Therefore, the present invention provides a video saliency target detection method based on quality evaluation, and we propose a new framework, which contains a module for quality evaluation of optical flow characteristics (temporal information) and RGB characteristics (spatial information), so that the framework can adaptively capture accurate spatial and temporal information to predict saliency maps. Specifically, an adaptive gate module (quality evaluation module) for quality evaluation is introduced into the encoding and decoding parts of the framework respectively, the module can estimate the quality of input features by calculating the MAE value, the features with high quality are given larger weight, the features are reserved, the features with low quality are given smaller weight, the features are removed, the screening effect is achieved, and effective information is transmitted. Secondly, considering that multi-scale information can promote the determination of the overall target positioning and the segmentation of target details, a multi-scale perception module is introduced. In addition, in consideration of better fusion of spatial information and temporal information, a module (a spatiotemporal information interaction module) based on an attention mechanism is introduced, so that the spatial information and the temporal information are mutually guided and mutually promoted, and therefore better spatiotemporal characteristics are learned. Finally, we also propose a double difference enhancement module that focuses on capturing the difference information between spatial and temporal cues and generating fused features.
A video saliency target detection method based on quality assessment comprises the following steps:
step (1): constructing a video saliency target detection network framework;
the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, one branch is input into an RGB image, the spatial characteristics of the image are extracted and called RGB branch, the other branch is input into an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in the video are processed through an RAFT algorithm, and an optical flow graph is obtained through calculation and used for reflecting the movement of an object in the video.
Step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained. Considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the multi-scale fused RGB features.
In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.
And (3): loss function:
for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.
Lf=LBCE(Pf,G)+LSSIM(Pf,G)+LIoU(Pf,G) (9)
For quality assessment, correct labeling is performed to supervise the intermediate saliency map with the same loss function of equation 9, the mean absolute error MAE between the intermediate saliency map and the correct labeling is calculated, and MAE supervises quality scores through equation 10.
Figure BDA0003262205360000041
And (4): and pre-training the video saliency target detection network by adopting a training set of the DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of the DAVIS data set. Overfitting is prevented by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. Using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.
Further, the multi-scale sensing module:
the multi-scale perception module is used for enhancing the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the multi-scale perception module are shared with ResNet101 backbone networks of the RGB tributaries. For the input images of RGB tributaries, a down-sampling operation is first performed to reduce the images to 1/2 and 1/4, and the down-sampled images are input to two ResNet 101. Obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:
Figure BDA0003262205360000051
concat (.,) is a splicing operation of channel dimensions,
Figure BDA0003262205360000052
doubling the operation for upsampling.
Figure BDA0003262205360000053
To input the corresponding features for the 1/4 image,
Figure BDA0003262205360000054
for input of corresponding features for 1/2 images, EinFeatures corresponding to RGB tributaries, EoutIs the fused output feature.
Further, the quality evaluation module:
the quality assessment module consists of two sub-networks, a prediction sub-network and an assessment sub-network. The prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by the correct labeling. The evaluation sub-network consists of three convolutional layers, a global mean pooling layer and a Sigmoid activation function, and is used for calculating a quality score which is supervised by the mean absolute error MAE between the prediction graph and the correct label. And splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network. And multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3. The input features with higher quality scores will be retained and the input features with lower scores will be removed as if there is a lot of noise.
Figure BDA0003262205360000055
Figure BDA0003262205360000056
Figure BDA0003262205360000057
In order to evaluate the convolution operation of a sub-network,
Figure BDA0003262205360000058
to predict convolution operations of a subnetwork, siTo score for quality, EiIn order to input the features of the image,
Figure BDA0003262205360000059
the characteristics after quality evaluation. As a corresponding element multiplication operation, σ is a Sigmoid activation function.
Further, the spatiotemporal information interaction module:
the space-time information interaction module is used for keeping semantic consistency among different features. Attention operation on channel dimension is performed on the moving tributary features as input features, as shown in formula 4, attention operation on space dimension is performed, as shown in formula 5, the enhanced features and the features of the RGB tributaries are added to obtain space information under the guidance of time information, as shown in formulas 6 and 7. Similarly, the RGB tributary features are subjected to attention operation enhancement features in channel dimension and space dimension and added with the moving tributary features to obtain time information under the guidance of space information.
Figure BDA0003262205360000061
Figure BDA0003262205360000062
Figure BDA0003262205360000063
Figure BDA0003262205360000064
Figure BDA0003262205360000065
In order to function the attention of the channel,
Figure BDA0003262205360000066
for maximum pooling operations in the spatial dimension,
Figure BDA0003262205360000067
for a fully connected operation, σ is a Sigmoid activation function, which is a corresponding multiplication in the channel dimension.
Figure BDA0003262205360000068
In order to function the spatial attention as a function of,
Figure BDA0003262205360000069
for maximum pooling operation in the channel dimension,
Figure BDA00032622053600000610
in order to perform the convolution operation,
Figure BDA00032622053600000611
are corresponding multiplications in the spatial dimension.
Figure BDA00032622053600000612
In order to be able to input the movement characteristics,
Figure BDA00032622053600000613
to take care of the enhanced movement characteristics after the manipulation,
Figure BDA00032622053600000614
for the RGB characteristics of the output to be,
Figure BDA00032622053600000615
an add operation is performed for the corresponding elements.
Further, the double differential enhancement module:
the double difference enhancement module mines the difference information between the RGB and optical flow features. For the RGB and optical flow features after the quality evaluation, respectively, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as a supplement, as shown in equation 8.
Figure BDA0003262205360000071
Figure BDA0003262205360000072
In order to perform a convolution operation on the difference feature,
Figure BDA0003262205360000073
for enhanced (spatial or temporal) features.
The invention has the following beneficial effects:
the invention provides a novel video saliency target detection framework, wherein a multi-scale perception module, a quality evaluation module, a temporal-spatial information interaction module and a double-difference enhancement module are introduced, and compared with the existing video saliency target detection method based on double flow, the framework provided by the invention can capture accurate spatial and temporal information in a self-adaptive manner, so that an accurate prediction result is obtained.
Drawings
FIG. 1 is a view of a coding part in a frame structure;
FIG. 2 is a decoding part of the frame structure;
FIG. 3 is a block diagram of a quality assessment module;
fig. 4 is a diagram of a dual differential enhancement module.
Detailed Description
The method of the invention is further described below with reference to the accompanying drawings and examples.
A video saliency target detection method based on quality assessment comprises the following steps:
step (1): constructing a video saliency target detection network framework;
the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, one branch is input into an RGB image, the spatial characteristics of the image are extracted and called RGB branch, the other branch is input into an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in the video are processed through an RAFT algorithm, and an optical flow graph is obtained through calculation and used for reflecting the movement of an object in the video.
Step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained. Considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the multi-scale fused RGB features.
In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.
And (3): loss function:
for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.
Lf=LBCE(Pf,G)+LSSIM(Pf,G)+LIoU(Pf,G) (9)
For quality assessment, correct labeling is performed to supervise the intermediate saliency map with the same loss function of equation 9, the mean absolute error MAE between the intermediate saliency map and the correct labeling is calculated, and MAE supervises quality scores through equation 10.
Figure BDA0003262205360000081
And (4): and pre-training the video saliency target detection network by adopting a training set of the DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of the DAVIS data set. Overfitting is prevented by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. Using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.
Fig. 1 and 2 are structural diagrams of a frame according to the present invention. The framework of the method is based on a double-stream coding-decoding structure, a main line is a ResNet101 network, input of one branch is an RGB image, spatial features of the image are extracted and called RGB branch, input of the other branch is an optical flow image, time information between the images is extracted and called moving branch, two continuous frames of images in a video are processed through an RAFT algorithm, the optical flow graph is calculated, and motion of objects in the video can be reflected. In the coding part, the two branches respectively extract features, the output feature quality of each layer is evaluated, the features with guiding significance are screened, the feature enhancement is carried out through a module based on an attention mechanism, and the space-time features of the two branches are guided mutually; in the decoding part, the output characteristics of each layer in the coding are subjected to quality evaluation again, the space-time characteristics are fused through a double-difference enhancement module, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained.
In order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions. Meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters.
Introduction of a specific module:
(1) the multi-scale perception module:
so-called multiscale, which is actually sampling of different granularities of a signal, we can observe different features at different scales, and thus accomplish different tasks. Generally, more detail is seen with less granular, i.e., denser, samples, and the overall trend is seen with more granular, i.e., sparser, samples. The multi-scale perception module is arranged to enhance the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the ResNet101 networks are shared with the ResNet101 backbone networks of the RGB tributaries. For the input images of RGB tributaries, a down-sampling operation is first performed to reduce the images to 1/2 and 1/4, and the down-sampled images are input to two ResNet 101. Obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:
Figure BDA0003262205360000101
concat (.,) is a splicing operation of channel dimensions,
Figure BDA0003262205360000102
doubling the operation for upsampling.
Figure BDA0003262205360000103
To input the corresponding features for the 1/4 image,
Figure BDA0003262205360000104
for input of corresponding features for 1/2 images, EinFeatures corresponding to RGB tributaries, EoutIs the fused output feature.
(2) A quality evaluation module:
the quality evaluation module aims to evaluate and supervise the extracted features and remove noise information, as shown in fig. 3. It predicts the quality score to represent the reliability of the feature and recalibrates the feature, and the quality assessment module consists of two sub-networks, a prediction sub-network and an assessment sub-network. The prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by the correct labeling. The evaluation sub-network consists of three convolutional layers, a global mean pooling layer and a Sigmoid activation function, and is used for calculating a quality score which is supervised by the mean absolute error MAE between the prediction graph and the correct label. And splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network. And multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3. The input features with higher quality scores will be retained and the input features with lower scores will be removed as if there is a lot of noise.
Figure BDA0003262205360000111
Figure BDA0003262205360000112
Figure BDA0003262205360000113
In order to evaluate the convolution operation of a sub-network,
Figure BDA0003262205360000114
to predict convolution operations of a subnetwork, siTo score for quality, EiIn order to input the features of the image,
Figure BDA0003262205360000115
the characteristics after quality evaluation. As a corresponding element multiplication operation, σ is a Sigmoid activation function.
(3) The space-time information interaction module:
the space-time information interaction module is used for keeping semantic consistency among different features. The spatiotemporal information is guided and promoted mutually through the spatiotemporal information interaction module. Attention operation on channel dimension is performed on the moving tributary features as input features, as shown in formula 4, attention operation on space dimension is performed, as shown in formula 5, the enhanced features and the features of the RGB tributaries are added to obtain space information under the guidance of time information, as shown in formulas 6 and 7. Similarly, the RGB tributary features are subjected to attention operation enhancement features in channel dimension and space dimension and added with the moving tributary features to obtain time information under the guidance of space information.
Figure BDA0003262205360000116
Figure BDA0003262205360000117
Figure BDA0003262205360000118
Figure BDA0003262205360000119
Figure BDA00032622053600001110
In order to function the attention of the channel,
Figure BDA00032622053600001111
for maximum pooling operations in the spatial dimension,
Figure BDA0003262205360000121
for a fully connected operation, σ is a Sigmoid activation function, which is a corresponding multiplication in the channel dimension.
Figure BDA0003262205360000122
In order to function the spatial attention as a function of,
Figure BDA0003262205360000123
for maximum pooling operation in the channel dimension,
Figure BDA0003262205360000124
in order to perform the convolution operation,
Figure BDA0003262205360000125
are corresponding multiplications in the spatial dimension.
Figure BDA0003262205360000126
In order to be able to input the movement characteristics,
Figure BDA0003262205360000127
to take care of the enhanced movement characteristics after the manipulation,
Figure BDA0003262205360000128
for the RGB characteristics of the output to be,
Figure BDA0003262205360000129
an add operation is performed for the corresponding elements.
(4) A double differential enhancement module:
the color saliency obtained from the RGB features and the motion saliency obtained from the optical flow features are complementary, and fusing the two can result in a saliency map rich in information. However, most of the complementary information is hidden in the difference between the RGB and optical flow features. To fully exploit their complementarity, we propose a double differential enhancement module to mine the differential information between RGB and optical flow features, as shown in fig. 4. For the RGB and optical flow features after the quality evaluation, respectively, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as a supplement, as shown in equation 8.
Figure BDA00032622053600001210
Figure BDA00032622053600001211
In order to perform a convolution operation on the difference feature,
Figure BDA00032622053600001212
for enhanced (spatial or temporal) features.
II, experimental details:
(1) loss function:
for the final predicted saliency map, the loss between the final predicted saliency map and the correct label is calculated according to the definition of the loss function in BASNet, as shown in equation 9.
Lf=LBCE(Pf,G)+LSSIM(Pf,G)+LIoU(Pf,G) (9)
For quality assessment, the intermediate saliency map is unsupervised with the same loss function as equation 9, the mean absolute error MAE between the intermediate saliency map and the correct annotation is calculated, and MAE is unsupervised with a quality score as equation 10.
Figure BDA0003262205360000131
(2) Data set:
in the experiment, an image significant target detection data set (DUTS) and a video significant target detection data set (DAVIS) are used for model training, and the video significant target detection data set and the DAVIS and DAVSOD data sets are used for testing the performance of the model.
The DUTS data set, which contains 5019 test images and 10553 training images, is the largest image saliency detection data set at present. The DAVIS dataset contains 50 high quality video sequences, which in total contains 3455 frames. Davod is the largest VSOD dataset at present, with 226 video and 23938 frames, covering different real scenes, objects, instances, and actions.
(3) Evaluation indexes are as follows:
the evaluation of video significance target detection mainly comprises three indexes, namely F-measure, S-measure and MeanAbsolute error (MAE).
F-measure is defined as equation 11 to find the weighted harmonic mean of recall and precision under non-negative weight β. Wherein, empirically obtained, β2Typically a value of 0.3.
Figure BDA0003262205360000132
The definition of S-measure is shown in equation 12, and is used to evaluate the structural similarity between the predicted saliency map and the corresponding correct annotation map. Where μ is typically set to 0.5, SoAnd SrRespectively representing region-aware structural similarity and object-aware structural similarity.
Figure BDA0003262205360000133
The definition of meanabsoluteerror (mae) is shown in formula 13, where S (x, y) represents the pixel value of the predicted saliency map, G (x, y) represents the pixel value of the correct annotation map, and W and H represent the width and height of the image.
Figure BDA0003262205360000141
(4) The experimental steps are as follows:
we have implemented our method on a pytorech. We used ResNet-101, previously trained on ImageNet, as our initial backbone. We use RAFT to generate optical flow images. First, we pre-train our model with the training set of DUTS dataset, and after one round of pre-training is over, we train the whole network with the training set of DAVIS dataset for another round. We prevent overfitting by augmenting the amount of training data by applying random horizontal flipping and random rotation to the input image. We trained the model using the Adam optimizer with an initial learning rate of 1e-5 until convergence.

Claims (5)

1. A video saliency target detection method based on quality assessment is characterized by comprising the following steps:
step (1): constructing a video saliency target detection network framework;
the video saliency target detection network frame is based on a double-flow coding-decoding structure, a ResNet101 network is adopted by a main network, the input of one branch flow is an RGB image, the spatial characteristics of the image are extracted and called RGB branch flow, the input of the other branch flow is an optical flow image, the time information between the images is extracted and called moving branch flow, two continuous frame images in the video are processed through an RAFT algorithm, and an optical flow diagram is obtained through calculation and used for reflecting the movement of an object in the video;
step (2): in the coding part, the two branches respectively extract features, the quality evaluation module is used for evaluating the quality of output features of each layer, the features with guiding significance are screened, and the space-time information interaction module is used for enhancing the features and guiding the space-time features of the two branches; in the decoding part, the quality evaluation module is used for carrying out quality evaluation on the output characteristics of each layer in the coding, the double-difference enhancement module is used for fusing the space-time characteristics, the deep-level characteristics are transmitted to the shallow-level characteristics in a cascading mode, and finally the prediction graph is obtained; considering that multi-scale information can promote the determination of the overall positioning of the target and the segmentation of target details, a multi-scale sensing module is introduced into the coding part, and the output of the multi-scale sensing module is fused with the output of the RGB tributaries to obtain the RGB characteristics after multi-scale fusion;
in order to reduce the number of model parameters, in a decoding part, convolution operation is respectively carried out on each layer of output features of ResNet, the number of channels is reduced, specifically, the first four layers of features are reduced to 48 dimensions, and the fifth layer of features are reduced to 256 dimensions; meanwhile, the backbone networks of the two branches share parameters, and the quality evaluation module of each layer shares parameters;
and (3): loss function:
for the final prediction saliency map, calculating the loss between the final prediction saliency map and the correct label according to the definition of the loss function in the BASNet, as shown in formula 9;
Lf=LBCE(Pf,,G)+LSSIM(Pf,,G)+LIoU(Pf,G) (9)
for quality evaluation, correctly labeling the saliency map in the middle of unsupervised by using the same loss function of formula 9, calculating the average absolute error MAE between the saliency map in the middle and the correct label, and scoring the MAE supervision quality by using formula 10;
Figure FDA0003262205350000021
and (4): pre-training a video saliency target detection network by adopting a training set of a DUTS data set, and after one round of pre-training is finished, performing another round of training on the whole network by using a training set of a DAVIS data set; the training data volume is amplified by applying a random horizontal turning and random rotation mode to the input image, so that overfitting is prevented; using an Adam optimizer, the model was trained at an initial learning rate of 1e-5 until convergence.
2. The method according to claim 1, wherein the multi-scale perception module:
the multi-scale perception module is used for enhancing the characteristics of the final output of the RGB tributaries in the encoding stage, and structurally, the multi-scale perception module consists of two ResNet101 networks, and the parameters of the multi-scale perception module are shared with ResNet101 main networks of the RGB tributaries; for input images of RGB tributaries, firstly, a down-sampling operation is carried out, the images are respectively reduced to 1/2 and 1/4, and the down-sampled images are respectively input into two ResNet 101; obtaining two output characteristics with different scales through five convolution layers, performing characteristic fusion on the two characteristics and the final layer output characteristics of the RGB tributaries in a manner of up-sampling and splicing step by step, and fusing the fused characteristics and the characteristics finally obtained by the RGB tributaries to obtain the RGB characteristics after multi-scale fusion, wherein the RGB characteristics are shown in formula 1:
Figure FDA0003262205350000031
concat (.,) is a splicing operation of channel dimensions,
Figure FDA0003262205350000032
twice as many operations as upsampling;
Figure FDA0003262205350000033
to input the corresponding features for the 1/4 image,
Figure FDA0003262205350000034
for input of corresponding features for 1/2 images, EinFeatures corresponding to RGB tributaries, EoutIs the fused output feature.
3. The method according to claim 1, wherein the quality evaluation module:
the quality evaluation module consists of two sub-networks, a prediction sub-network and an evaluation sub-network; the prediction subnetwork consists of three convolutional layers for predicting the saliency map, which is supervised by correct labeling; the evaluation sub-network consists of three convolutional layers, a global average pooling layer and a Sigmoid activation function and is used for calculating a quality score which is supervised by an average absolute error MAE between the prediction graph and the correct label; splicing the predicted saliency map and the input features to serve as the input features of the evaluation sub-network; multiplying the quality score by the input characteristic to serve as the output characteristic of the quality evaluation module, as shown in formulas 2 and 3; the input features with higher quality scores will be retained, and the input features with lower scores will be removed as the existence of a large amount of noise;
Figure FDA0003262205350000035
Figure FDA0003262205350000036
Figure FDA0003262205350000037
in order to evaluate the convolution operation of a sub-network,
Figure FDA0003262205350000038
to predict convolution operations of a subnetwork, siTo score for quality, EiIn order to input the features of the image,
Figure FDA0003262205350000039
the characteristics after quality evaluation; as a corresponding element multiplication operation, σ is a Sigmoid activation function.
4. The method as claimed in claim 1, wherein the spatiotemporal information interaction module:
the space-time information interaction module is used for keeping semantic consistency among different characteristics; firstly, performing attention operation on channel dimension on the characteristics of the moving tributaries as input characteristics, as shown in a formula 4, and then performing attention operation on space dimension, as shown in a formula 5, adding the enhanced characteristics and the characteristics of the RGB tributaries to obtain space information under the guidance of time information, as shown in formulas 6 and 7; similarly, performing attention operation enhancement characteristics on channel dimension and space dimension on RGB tributary characteristics, and adding the characteristics and the moving tributary characteristics to obtain time information under the guidance of space information;
Figure FDA0003262205350000041
Figure FDA0003262205350000042
Figure FDA0003262205350000043
Figure FDA0003262205350000044
Figure FDA0003262205350000045
in order to function the attention of the channel,
Figure FDA0003262205350000046
for maximum pooling operations in the spatial dimension,
Figure FDA0003262205350000047
for full connection operation, σ is a Sigmoid activation function, and σ is corresponding multiplication in channel dimension;
Figure FDA0003262205350000048
in order to function the spatial attention as a function of,
Figure FDA0003262205350000049
for maximum pooling operation in the channel dimension,
Figure FDA00032622053500000410
in order to perform the convolution operation,
Figure FDA00032622053500000411
is a corresponding multiplication in spatial dimension;
Figure FDA00032622053500000412
in order to be able to input the movement characteristics,
Figure FDA00032622053500000413
to take care of the enhanced movement characteristics after the manipulation,
Figure FDA00032622053500000414
for the RGB characteristics of the output to be,
Figure FDA00032622053500000415
an add operation is performed for the corresponding elements.
5. The method according to claim 1, wherein the double difference enhancement module:
the double-difference enhancement module is used for mining difference information between RGB and optical flow characteristics; for the RGB and optical flow characteristics after quality evaluation, difference information is extracted by performing subtraction and convolution operations, and the original information is enhanced by using the difference information as supplement, as shown in formula 8;
Figure FDA0003262205350000051
Figure FDA0003262205350000052
in order to perform a convolution operation on the difference feature,
Figure FDA0003262205350000053
for enhanced (spatial or temporal) features.
CN202111075792.5A 2021-09-14 2021-09-14 Video saliency target detection method based on quality evaluation Withdrawn CN113936235A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111075792.5A CN113936235A (en) 2021-09-14 2021-09-14 Video saliency target detection method based on quality evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111075792.5A CN113936235A (en) 2021-09-14 2021-09-14 Video saliency target detection method based on quality evaluation

Publications (1)

Publication Number Publication Date
CN113936235A true CN113936235A (en) 2022-01-14

Family

ID=79275690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111075792.5A Withdrawn CN113936235A (en) 2021-09-14 2021-09-14 Video saliency target detection method based on quality evaluation

Country Status (1)

Country Link
CN (1) CN113936235A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612979A (en) * 2022-03-09 2022-06-10 平安科技(深圳)有限公司 Living body detection method and device, electronic equipment and storage medium
CN116994006A (en) * 2023-09-27 2023-11-03 江苏源驶科技有限公司 Collaborative saliency detection method and system for fusing image saliency information
CN117173394A (en) * 2023-08-07 2023-12-05 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612979A (en) * 2022-03-09 2022-06-10 平安科技(深圳)有限公司 Living body detection method and device, electronic equipment and storage medium
CN114612979B (en) * 2022-03-09 2024-05-31 平安科技(深圳)有限公司 Living body detection method and device, electronic equipment and storage medium
CN117173394A (en) * 2023-08-07 2023-12-05 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data
CN117173394B (en) * 2023-08-07 2024-04-02 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data
CN116994006A (en) * 2023-09-27 2023-11-03 江苏源驶科技有限公司 Collaborative saliency detection method and system for fusing image saliency information
CN116994006B (en) * 2023-09-27 2023-12-08 江苏源驶科技有限公司 Collaborative saliency detection method and system for fusing image saliency information

Similar Documents

Publication Publication Date Title
CN112132156B (en) Image saliency target detection method and system based on multi-depth feature fusion
Zhou et al. HFNet: Hierarchical feedback network with multilevel atrous spatial pyramid pooling for RGB-D saliency detection
CN113936235A (en) Video saliency target detection method based on quality evaluation
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
Lin et al. Image manipulation detection by multiple tampering traces and edge artifact enhancement
CN110020658B (en) Salient object detection method based on multitask deep learning
CN113313810A (en) 6D attitude parameter calculation method for transparent object
CN114339362B (en) Video bullet screen matching method, device, computer equipment and storage medium
Kang et al. SdBAN: Salient object detection using bilateral attention network with dice coefficient loss
CN111652181B (en) Target tracking method and device and electronic equipment
CN113065551A (en) Method for performing image segmentation using a deep neural network model
Xia et al. Pedestrian detection algorithm based on multi-scale feature extraction and attention feature fusion
Kompella et al. A semi-supervised recurrent neural network for video salient object detection
CN114693952A (en) RGB-D significance target detection method based on multi-modal difference fusion network
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
CN111242068A (en) Behavior recognition method and device based on video, electronic equipment and storage medium
CN112784745B (en) Confidence self-adaption and difference enhancement based video salient object detection method
CN117351487A (en) Medical image segmentation method and system for fusing adjacent area and edge information
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN116758449A (en) Video salient target detection method and system based on deep learning
WO2023185074A1 (en) Group behavior recognition method based on complementary spatio-temporal information modeling
Gowda et al. Foreground segmentation network using transposed convolutional neural networks and up sampling for multiscale feature encoding
Huang et al. Deep Multimodal Fusion Autoencoder for Saliency Prediction of RGB‐D Images
CN110211146B (en) Video foreground segmentation method and device for cross-view simulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220114

WW01 Invention patent application withdrawn after publication