CN113313077A - Salient object detection method based on multi-strategy and cross feature fusion - Google Patents

Salient object detection method based on multi-strategy and cross feature fusion Download PDF

Info

Publication number
CN113313077A
CN113313077A CN202110743443.XA CN202110743443A CN113313077A CN 113313077 A CN113313077 A CN 113313077A CN 202110743443 A CN202110743443 A CN 202110743443A CN 113313077 A CN113313077 A CN 113313077A
Authority
CN
China
Prior art keywords
neural network
convolutional neural
strategy
fusion
pixel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110743443.XA
Other languages
Chinese (zh)
Inventor
周武杰
孙帆
强芳芳
许彩娥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202110743443.XA priority Critical patent/CN113313077A/en
Publication of CN113313077A publication Critical patent/CN113313077A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a salient object detection method based on multi-strategy fusion, and relates to the field of deep learning. In a training stage, a convolutional neural network is constructed, and hidden layers of the convolutional neural network comprise 10 neural network convolution blocks, 5 multi-strategy fusion blocks and 4 cross characteristic fusion blocks; inputting the original RGB color image and Depth image into a convolutional neural network for training to obtain a corresponding salient physical detection image; then, calculating loss function values of an original prediction graph and a corresponding real salient label graph (Ground Truth) to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model; in the testing stage, inputting the RGB color image of the salient body to be detected and the corresponding Depth image into a convolutional neural network classification training model together to obtain a prediction salient body detection image; the method has the advantage of improving the detection efficiency and accuracy of the RGB-D significant object.

Description

Salient object detection method based on multi-strategy and cross feature fusion
Technical Field
The invention relates to the field of deep learning, in particular to a salient object detection method based on multi-strategy and cross feature fusion.
Background
The Salient Object Detection (SOD) plays an important role in many computer vision tasks as a powerful preprocessing tool to identify human visual attention mechanisms that attract attention objects from natural images. It has many applications such as autopilot, robotic navigation, visual tracking, image retrieval, aesthetic assessment, and content-aware image editing. Inspired by progress in perceptual psychology, early models used heuristic prior and hand-made features such as contrast distance transforms. However, in complex scenarios, their detection performance is severely limited. Recent studies have demonstrated that deep learning techniques, particularly Convolutional Neural Networks (CNNs), are particularly good at extracting semantic features from image regions to understand visual concepts and achieve significant results.
The method adopts a deep learning semantic segmentation method to directly perform end-to-end (end-to-end) semantic segmentation at a pixel level, and can predict in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. Currently, methods based on deep learning semantic segmentation are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object.
Most of the existing salient body detection methods adopt a deep learning method, and a large number of models are combined by utilizing a convolution layer and a pooling layer. Depth information can provide important supplementary cues to identify objects in complex scenes for significance. With the rapid development of imaging technology, the acquisition of depth maps becomes more convenient, and the research on RGB-D significance detection is promoted. Furthermore, depth maps contain many useful attributes, such as the shape of the convex body, contours, and geometric spatial information objects, which can be considered relevant clues for RGB-D saliency.
Disclosure of Invention
In view of the above, the present invention provides a method for detecting a salient object based on multi-strategy and cross feature fusion.
In order to achieve the purpose, the invention adopts the following technical scheme:
a salient object detection method based on multi-strategy and cross feature fusion comprises the following steps:
selecting RGB color images, Depth images and Ground Truth label images of a plurality of data sets to form a training set;
constructing a convolutional neural network, wherein the convolutional neural network adopts a top-down high-level feature supervision low-level feature fusion mode;
inputting the training set into the convolutional neural network, and training the convolutional neural network;
and training for multiple times to obtain a convolutional neural network model.
Preferably, the convolutional neural network introduces a depth optimization module to improve the image quality, and the feature maps obtained by the multi-strategy fusion module are subjected to cross fusion by the cross fusion module to capture the combined features.
Preferably, the depth optimization module has the following structure:
the first maximum pooling layer, the first rolling block, the first activation layer, the second rolling block and the second activation layer are sequentially connected and then are subjected to pixel multiplication with the first maximum pooling layer and then are input into the second maximum pooling layer, the second maximum pooling layer is sequentially connected with the third rolling block and the third activation layer, the output of the third activation layer is subjected to pixel multiplication with the second maximum pooling layer and then is input into the third maximum pooling layer, and the output of the third maximum pooling layer and the output of the first maximum pooling layer are subjected to pixel addition to form final output.
Preferably, the multi-strategy fusion module performs pixel subtraction, pixel addition and pixel multiplication operations on the depth feature and the RGB feature respectively, and takes an average value and a maximum value on a channel dimension; subtracting pixels, adding pixels, performing pixel multiplication operation and performing pixel addition on the average value and the maximum value on the channel dimension to obtain a first output; and the upper layer of fusion features are subjected to pixel addition with the first output after being subjected to upsampling to be used as final output.
Preferably, the structure of the cross-fusion module is as follows:
second input
Figure BDA0003142110230000031
By feature extraction and first input
Figure BDA0003142110230000032
The result of the pixel addition is recorded as
Figure BDA0003142110230000033
Figure BDA0003142110230000034
Output via the first convolution block and
Figure BDA0003142110230000035
performing pixel addition to obtain M, performing pixel addition on M and M, using the result of pixel addition as the input of pixel multiplication with M, using the result of pixel multiplication as the input of pixel subtraction with M, using the result of pixel subtraction as the input of channel superposition with M, and performing second convolution on the output of channel superpositionAnd finally outputting the block.
Compared with the prior art, the method for detecting the salient object based on the multi-strategy and cross feature fusion has the following beneficial effects that:
1) the method comprises the steps of constructing a convolutional neural network, inputting RGB-D images in a training set into the convolutional neural network for training, and obtaining a convolutional neural network classification training model; and inputting the image to be subjected to significance detection into a convolutional neural network classification training model, and predicting to obtain a predicted significance image corresponding to the RGB image.
2) The method adopts a cross feature fusion module to perform cross fusion on the feature graphs of the multi-strategy fusion module, captures the joint features and provides supplementary information for the single-mode features.
3) The method adopts the depth optimization module to eliminate the influence of the noise of the depth information on the network, so that the obtained depth information can better express the position information of the salient body.
4) The method adopts a bidirectional cooperation structure, adopts top-down supervision and bottom-up decoding, and refines global features to regional features for final prediction.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a cross-fusion module architecture of the present invention;
FIG. 3 is a block diagram of a depth optimization module according to the present invention;
FIG. 4 is a block diagram of a multi-policy fusion module according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method for detecting a salient object based on multi-strategy fusion and multi-supervision, and the overall implementation block diagram is shown in figure 1, and the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q NJU2K and RGB color images, Depth images and Ground Truth label images of an NLPR data set, forming a training set, and recording the Q-th original obvious detection image in the training set as { I }q(I, j) }, the training set is summed with { I }q(i, j) } the corresponding real label image is recorded as
Figure BDA0003142110230000051
Then, the real significance detection image corresponding to each original significance image in the training set is processed into 1 single-hot coding image by adopting the existing single-hot coding technology (one-hot), and the 1 single-hot coding image is obtained
Figure BDA0003142110230000052
The processed set of 1 one-hot coded image is denoted as
Figure BDA0003142110230000053
Wherein, the road scene image is an RGB color image, Q is a positive integer, Q is more than or equal to 200, if Q is 2185, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, i is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents a hard faceIq(I, j) }, H denotes { I }q(I, j) } e.g. take W224, H224, Iq(I, j) represents { IqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0003142110230000054
to represent
Figure BDA0003142110230000055
The middle coordinate position is the pixel value of the pixel point of (i, j); here, 2185 images in the saliency detection image database NJU2K and the NLPR training set were selected directly.
Step 1_ 2: constructing a convolutional neural network: the convolutional neural network is divided into an encoding (Encode) part and a decoding (Decode) part, and respectively corresponds to Feature extraction (Feature Extract) and Feature Fusion (Feature Fusion) of an image. Fig. 2 is a cross fusion module structure diagram, fig. 3 is a depth optimization module structure diagram, and fig. 4 is a multi-strategy fusion module structure diagram.
The input is combined by two different modes of RGB (three-channel) and Depth (single-channel), so that the network input is divided into two streams, and RGB and Depth are respectively encoded. Since Depth information (Depth) contains spatial information between regions of an image, it plays a very important role in salient object detection, but Depth maps are usually of low quality, possibly introducing characteristic noise and redundancy into the network, and we introduce a Depth-optimization module (Depth-optimization Model). The backbone network employs ResNet-50. RGB and Depth codes are each made up of 5 convolutional blocks. Wherein the 1 st convolution block, the 2 nd convolution block, the 3 rd convolution block are defined as low-level features, the 4 th convolution block, and the 5 th convolution block are defined as high-level features, and the 6 th convolution block, the 7 th convolution block, and the 8 th convolution block in the same depth stream are defined as low-level features, and the 9 th convolution block and the 10 th convolution block are defined as high-level features. There are 5 multi-policy Fusion modules (Muti-stream Fusion) between the two encoded main streams, which use high-level features to supervise low-level feature Fusion, in a top-down manner. Each MSF has a supervision output by upsampling (Upsample) as a supervision loss during training. And performing Cross Feature Fusion (CFF) with the outputs of the 2 nd multi-strategy fusion module, the 3 rd multi-strategy fusion module, the 4 th multi-strategy fusion module and the 5 th multi-strategy fusion module through the first MSF module. Where the input pictures of both encoded streams are both W wide and H high.
For the RGB color image training layer and the Depth single-channel image pre-training layer, the ResNet50 pre-trained on Imagenet is adopted, and the total output is five. The first output layer of the RGB color image pre-training layer has the size of W/2 and the height of H/2, and 64 feature maps are recorded as R1; the second output layer of the RGB color image pre-training layer has the size of W/4 and the height of H/4, and has 256 characteristic graphs which are marked as R2; the size of a third output layer of the RGB color image pre-training layer is W/8, the height of the third output layer is H/8, 512 feature maps are provided in total and are marked as R3; the size of a fourth output layer of the RGB color image pre-training layer is W/16, the height of the fourth output layer is H/16, and 1024 feature maps are recorded as R4; the fifth output layer of the left-view color image pre-training layer has the size of W/32 and the height of H/32, and has 1024 characteristic graphs which are marked as R5; the Depth image pre-training layer has five outputs, which are recorded as D1, D2, D3, D4 and D5, and the structures are respectively the same as R1, R2, R3, R4 and R5.
For the 5 th convolution block, the 6 th convolution block, the 7 th convolution block, the 8 th convolution block, the 9 th convolution block and the 10 th convolution block, the output of each convolution block from the previous layer of convolution block is input to the depth optimization modules DOM1, DOM2, DOM3, DOM4 and DOM5, and D2, D3, D4 and D5 are obtained.
Input D of the deep optimization Module DOMi(Ci×Hi×Wi)(i=1,2,3,4,5),CiDenotes the number of channels, Hi,WiRepresenting the length and width of the image, respectively. Channel Attention (Channel Attention) is first performed, where the main branches are in turn grouped by the first largest pooling layer, and the size of the output depth map is 1 × 1. The first convolution block, convolution kernel size 1 × 1, step size 1, number of channels CiA first active layer (Relu), a second convolution block, a convolution kernel size of 1 × 1, a step size of 1, a number of channels CiSecond activation ofLayer (Sigmoid), then the main branch and the shortcut branch are multiplied by pixel to obtain
Figure BDA0003142110230000061
Then Spatial Attention (Spatial Attention) is performed, wherein the main branch is sequentially formed by a first maximization layer (Maximize), a third convolution block, a convolution kernel size of 7 × 7, a step size of 1, Padding (Padding) of 3, and a third activation layer of Sigmoid, and then the channel Attention is obtained
Figure BDA0003142110230000071
Multiplied by the spatial attention output to obtain
Figure BDA0003142110230000072
Finally, the original input D is inputiAnd
Figure BDA0003142110230000073
add operation is performed as input to the next volume block.
Step 1_ 3: for the fifth multi-strategy fusion module, the outputs of the fifth convolution module (RGB color feature R5) and the 5 th Depth optimization module (Depth feature D5) are used as inputs, pixel subtraction, pixel addition and pixel multiplication are respectively carried out, the maximum value of the channel and the average value of the channel are taken, and Q is obtained1,Q2,Q3,Q4,Q5Then respectively adding Qi(i ═ 1,2,3,4, 5) are added as the fusion features input by the next-layer multi-strategy fusion module, and for the 4 th multi-strategy fusion module, the 3 rd multi-strategy fusion module, the 2 nd multi-strategy fusion module, and the 1 st multi-strategy fusion module, the 4 th convolution block, the 3 rd convolution block, the 2 nd convolution block, the 1 st convolution block (R4, R3, R2, R1) and the 4 th depth optimization module, the 3 rd depth optimization module, the 2 nd depth optimization module, the 1 st depth optimization module (D4, D3, D2, D1) and the fusion features of the previous-layer multi-strategy fusion feature module are input, respectively. Will Di(i ═ 1,2,3,4) and Ri(i ═ 1,2,3,4), pixel subtraction, pixel addition, pixel multiplication, channel maximization, and channel leveling, respectivelyMean value to obtain Q1,Q2,Q3,Q4,Q5Then, the fusion characteristics of the multi-strategy fusion module of the upper layer are sampled by 2 times to obtain Fi(i ═ 1,2,3,4) and finally Q1,Q2,Q3,Q4,Q5And FiAnd adding the fusion characteristics as the input fusion characteristics of the next layer of multi-strategy fusion module.
For the 4 th cross fusion module, the 3 rd cross fusion module, the 2 nd cross fusion module and the 1 st cross fusion module, the input of the first multi-strategy fusion module is respectively output
Figure BDA0003142110230000074
And 5, 4, 3, 2 multiple strategy fusion module. Firstly, the ith (i-2, 3,4,5) multi-strategy fusion output is processed by 2i-1Performing multiple upsampling, extracting features, determining the convolution kernel size of the convolution layer to be 3 × 3, the step length to be 1, the padding to be 1, the output channel to be 64, then performing standardization (Batch Norm), and finally performing activation (Rectified Linear Unit, ReLU) to obtain
Figure BDA0003142110230000075
Will be provided with
Figure BDA0003142110230000076
And
Figure BDA0003142110230000077
result of addition
Figure BDA0003142110230000078
Performing a first convolution with a convolution kernel size of 3 × 3, a step size of 1, and a padding of 1 to obtain
Figure BDA0003142110230000079
Then will be
Figure BDA00031421102300000710
Figure BDA00031421102300000711
And adding, respectively adding with the self, multiplying, subtracting and taking the obtained characteristics as the operation objects of the next step, and finally carrying out Concat on the obtained result and the self. And in the second convolution block, the convolution kernel size is 1, the step size is 1, and the output is 64 channels.
Step 1_ 4: and performing data enhancement on each original RGB color image and Depth image in the training set by means of random cutting, rotation, color enhancement, overturning and the like, and then taking the images as initial input images, wherein the batch size is 4. Inputting the prediction images into a deep convolution neural network for training to obtain a prediction image with each original saliency image in a training set equal to the original size, and in addition, in order to assist the training, outputting 5 multi-strategy fusion modules during the training
Figure BDA0003142110230000081
The sizes are W/2H/2, W/4H/4, W/8H/8, W/16H/16 and W/32H/32 in turn, and 2 is subjected to upsamplingiMultiplying to obtain the characteristics with H x W and the final output M of the modeloutSupervise training together, will
Figure BDA0003142110230000082
MoutAnd MGTThe LOSS function between (true values) is noted LOSS (M)pre,MGT) The LOSS adopts a Binary Cross Entropy LOSS function (Binary Cross Entropy LOSS) and finally sums 6 losses to obtain a final LOSS value.
Step 1_ 5: repeatedly executing the step 1_4 for N times until the neural network converges on the training set, and taking 800 original RGB color images and Depth images as a verification set during the training period to obtain N loss function values in total; then finding out the loss function value with the minimum value from the N loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as WbestAnd bbest(ii) a Where N > 1, in this example, N is 300.
The test stage process comprises the following specific steps:
step 2_ 1: the set of NJU2K data sets for 500 original RGB color images and Depth images and the set of NLPR data for 300 original RGB color images and Depth images were taken as the test set. Order to
Figure BDA0003142110230000083
Representing a saliency image to be detected; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0003142110230000084
Width of (A), H' represents
Figure BDA0003142110230000085
The height of (a) of (b),
Figure BDA0003142110230000086
to represent
Figure BDA0003142110230000087
And the middle coordinate position is the pixel value of the pixel point of (i, j). No data enhancement was performed at the time of testing.
Step 2_ 2: will be provided with
Figure BDA0003142110230000091
The R channel component, the G channel component and the B channel component are input into a convolutional neural network classification training model and are subjected to W-based classificationbestAnd bbestMaking a prediction to obtain
Figure BDA0003142110230000092
Corresponding predictive semantic segmentation image, denoted
Figure BDA0003142110230000093
Wherein the content of the first and second substances,
Figure BDA0003142110230000094
to represent
Figure BDA0003142110230000095
Middle coordinate positionAnd setting the pixel value of the pixel point of (i ', j').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building a convolutional neural network architecture by using a python-based deep learning library Pythrch. The significance detection database NJU2K and the test set of NLPR are adopted to analyze how the segmentation effect of the significance detection image (500 NJU2K images and 300 NLPR images) obtained by prediction by the method is. The average Absolute Error (MAE) of the target detection effect of the method, F1 Score (F1 Score, F1), Structure measurement (S-measure), and Enhanced positioning measurement (E-measure) are used for evaluating the detection performance of the significance detection image, as listed in Table 1. From the data listed in table 1, the significant object images obtained by the method of the present invention are good, which indicates that it is feasible and effective to obtain significant object images of various scenes by using the method of the present invention.
TABLE 1 evaluation results on test sets using the method of the invention
ours S↑ adpE↑ adpF↑ MaxF↑ MAE↓
NJU2K 0.912 0.932 0.915 0.917 0.032
NLPR 0.920 0.958 0.904 0.912 0.022
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (5)

1. A salient object detection method based on multi-strategy and cross feature fusion is characterized by comprising the following steps:
selecting RGB color images, Depth images and GroudTruth label images of a plurality of data sets to form a training set;
constructing a convolutional neural network, wherein the convolutional neural network adopts a top-down high-level feature supervision low-level feature fusion mode;
inputting the training set into the convolutional neural network, and training the convolutional neural network;
and training for multiple times to obtain a convolutional neural network model.
2. The method for detecting the salient object based on the multi-strategy and cross-feature fusion as claimed in claim 1, wherein the convolutional neural network introduces a depth optimization module to improve the image quality, and the feature maps obtained by the multi-strategy fusion module are cross-fused by the cross-fusion module to capture the combined features.
3. The method for detecting the salient object based on the multi-strategy and cross feature fusion as claimed in claim 2, wherein the depth optimization module has the following structure:
the first maximum pooling layer, the first rolling block, the first activation layer, the second rolling block and the second activation layer are sequentially connected and then are subjected to pixel multiplication with the first maximum pooling layer and then are input into the second maximum pooling layer, the second maximum pooling layer is sequentially connected with the third rolling block and the third activation layer, the output of the third activation layer is subjected to pixel multiplication with the second maximum pooling layer and then is input into the third maximum pooling layer, and the output of the third maximum pooling layer and the output of the first maximum pooling layer are subjected to pixel addition to form final output.
4. The method for detecting the salient object based on the multi-strategy and cross-feature fusion as claimed in claim 2, wherein the multi-strategy fusion module performs pixel subtraction, pixel addition and pixel multiplication operations on the depth feature and the RGB feature respectively, and takes an average value and a maximum value on a channel dimension; subtracting pixels, adding pixels, performing pixel multiplication operation and performing pixel addition on the average value and the maximum value on the channel dimension to obtain a first output; and the upper layer of fusion features are subjected to pixel addition with the first output after being subjected to upsampling to be used as final output.
5. The method for detecting the salient object based on the multi-strategy and cross feature fusion as claimed in claim 2, wherein the structure of the cross fusion module is as follows:
second input
Figure FDA0003142110220000021
By feature extraction and first input
Figure FDA0003142110220000022
The result of the pixel addition is recorded as
Figure FDA0003142110220000023
Figure FDA0003142110220000024
Output via the first convolution block and
Figure FDA0003142110220000025
and performing pixel addition to obtain M, performing pixel addition on the M and the M, using a pixel addition result as an input for performing pixel multiplication on the M, using a pixel multiplication result as an input for performing pixel subtraction on the M, using a pixel subtraction result as an input for performing channel superposition on the M, and using an output of the channel superposition as a final output after passing through a second convolution block.
CN202110743443.XA 2021-06-30 2021-06-30 Salient object detection method based on multi-strategy and cross feature fusion Withdrawn CN113313077A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110743443.XA CN113313077A (en) 2021-06-30 2021-06-30 Salient object detection method based on multi-strategy and cross feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110743443.XA CN113313077A (en) 2021-06-30 2021-06-30 Salient object detection method based on multi-strategy and cross feature fusion

Publications (1)

Publication Number Publication Date
CN113313077A true CN113313077A (en) 2021-08-27

Family

ID=77381578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110743443.XA Withdrawn CN113313077A (en) 2021-06-30 2021-06-30 Salient object detection method based on multi-strategy and cross feature fusion

Country Status (1)

Country Link
CN (1) CN113313077A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114445442A (en) * 2022-01-28 2022-05-06 杭州电子科技大学 Multispectral image semantic segmentation method based on asymmetric cross fusion
CN115796244A (en) * 2022-12-20 2023-03-14 广东石油化工学院 CFF-based parameter identification method for super-nonlinear input/output system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619638A (en) * 2019-08-22 2019-12-27 浙江科技学院 Multi-mode fusion significance detection method based on convolution block attention module
CN111242181A (en) * 2020-01-03 2020-06-05 大连民族大学 RGB-D salient object detector based on image semantics and details
CN112149662A (en) * 2020-08-21 2020-12-29 浙江科技学院 Multi-mode fusion significance detection method based on expansion volume block
CN112529862A (en) * 2020-12-07 2021-03-19 浙江科技学院 Significance image detection method for interactive cycle characteristic remodeling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619638A (en) * 2019-08-22 2019-12-27 浙江科技学院 Multi-mode fusion significance detection method based on convolution block attention module
CN111242181A (en) * 2020-01-03 2020-06-05 大连民族大学 RGB-D salient object detector based on image semantics and details
CN112149662A (en) * 2020-08-21 2020-12-29 浙江科技学院 Multi-mode fusion significance detection method based on expansion volume block
CN112529862A (en) * 2020-12-07 2021-03-19 浙江科技学院 Significance image detection method for interactive cycle characteristic remodeling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SANGHYUN WOO等: "CBAM: Convolutional Block Attention Module", 《计算机视觉-ECCV2018》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114445442A (en) * 2022-01-28 2022-05-06 杭州电子科技大学 Multispectral image semantic segmentation method based on asymmetric cross fusion
CN115796244A (en) * 2022-12-20 2023-03-14 广东石油化工学院 CFF-based parameter identification method for super-nonlinear input/output system
CN115796244B (en) * 2022-12-20 2023-07-21 广东石油化工学院 Parameter identification method based on CFF for ultra-nonlinear input/output system

Similar Documents

Publication Publication Date Title
US20210390700A1 (en) Referring image segmentation
CN111723732B (en) Optical remote sensing image change detection method, storage medium and computing equipment
CN110889449A (en) Edge-enhanced multi-scale remote sensing image building semantic feature extraction method
CN113850825A (en) Remote sensing image road segmentation method based on context information and multi-scale feature fusion
CN107871014A (en) A kind of big data cross-module state search method and system based on depth integration Hash
CN112966684A (en) Cooperative learning character recognition method under attention mechanism
CN110246148B (en) Multi-modal significance detection method for depth information fusion and attention learning
CN112418212B (en) YOLOv3 algorithm based on EIoU improvement
CN113780149A (en) Method for efficiently extracting building target of remote sensing image based on attention mechanism
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN109461177B (en) Monocular image depth prediction method based on neural network
CN116994140A (en) Cultivated land extraction method, device, equipment and medium based on remote sensing image
CN113313077A (en) Salient object detection method based on multi-strategy and cross feature fusion
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
CN111915618B (en) Peak response enhancement-based instance segmentation algorithm and computing device
CN111523463B (en) Target tracking method and training method based on matching-regression network
CN113192073A (en) Clothing semantic segmentation method based on cross fusion network
CN113487600B (en) Feature enhancement scale self-adaptive perception ship detection method
CN113269224A (en) Scene image classification method, system and storage medium
CN112529862A (en) Significance image detection method for interactive cycle characteristic remodeling
CN114170623A (en) Human interaction detection equipment and method and device thereof, and readable storage medium
CN114926734B (en) Solid waste detection device and method based on feature aggregation and attention fusion
Chen et al. MSF-Net: A multiscale supervised fusion network for building change detection in high-resolution remote sensing images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210827

WW01 Invention patent application withdrawn after publication