CN113313077A

CN113313077A - Salient object detection method based on multi-strategy and cross feature fusion

Info

Publication number: CN113313077A
Application number: CN202110743443.XA
Authority: CN
Inventors: 周武杰; 孙帆; 强芳芳; 许彩娥
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-08-27

Abstract

The invention discloses a salient object detection method based on multi-strategy fusion, and relates to the field of deep learning. In a training stage, a convolutional neural network is constructed, and hidden layers of the convolutional neural network comprise 10 neural network convolution blocks, 5 multi-strategy fusion blocks and 4 cross characteristic fusion blocks; inputting the original RGB color image and Depth image into a convolutional neural network for training to obtain a corresponding salient physical detection image; then, calculating loss function values of an original prediction graph and a corresponding real salient label graph (Ground Truth) to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model; in the testing stage, inputting the RGB color image of the salient body to be detected and the corresponding Depth image into a convolutional neural network classification training model together to obtain a prediction salient body detection image; the method has the advantage of improving the detection efficiency and accuracy of the RGB-D significant object.

Description

Salient object detection method based on multi-strategy and cross feature fusion

Technical Field

The invention relates to the field of deep learning, in particular to a salient object detection method based on multi-strategy and cross feature fusion.

Background

The Salient Object Detection (SOD) plays an important role in many computer vision tasks as a powerful preprocessing tool to identify human visual attention mechanisms that attract attention objects from natural images. It has many applications such as autopilot, robotic navigation, visual tracking, image retrieval, aesthetic assessment, and content-aware image editing. Inspired by progress in perceptual psychology, early models used heuristic prior and hand-made features such as contrast distance transforms. However, in complex scenarios, their detection performance is severely limited. Recent studies have demonstrated that deep learning techniques, particularly Convolutional Neural Networks (CNNs), are particularly good at extracting semantic features from image regions to understand visual concepts and achieve significant results.

The method adopts a deep learning semantic segmentation method to directly perform end-to-end (end-to-end) semantic segmentation at a pixel level, and can predict in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. Currently, methods based on deep learning semantic segmentation are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object.

Most of the existing salient body detection methods adopt a deep learning method, and a large number of models are combined by utilizing a convolution layer and a pooling layer. Depth information can provide important supplementary cues to identify objects in complex scenes for significance. With the rapid development of imaging technology, the acquisition of depth maps becomes more convenient, and the research on RGB-D significance detection is promoted. Furthermore, depth maps contain many useful attributes, such as the shape of the convex body, contours, and geometric spatial information objects, which can be considered relevant clues for RGB-D saliency.

Disclosure of Invention

In view of the above, the present invention provides a method for detecting a salient object based on multi-strategy and cross feature fusion.

In order to achieve the purpose, the invention adopts the following technical scheme:

a salient object detection method based on multi-strategy and cross feature fusion comprises the following steps:

selecting RGB color images, Depth images and Ground Truth label images of a plurality of data sets to form a training set;

constructing a convolutional neural network, wherein the convolutional neural network adopts a top-down high-level feature supervision low-level feature fusion mode;

inputting the training set into the convolutional neural network, and training the convolutional neural network;

and training for multiple times to obtain a convolutional neural network model.

Preferably, the convolutional neural network introduces a depth optimization module to improve the image quality, and the feature maps obtained by the multi-strategy fusion module are subjected to cross fusion by the cross fusion module to capture the combined features.

Preferably, the depth optimization module has the following structure:

the first maximum pooling layer, the first rolling block, the first activation layer, the second rolling block and the second activation layer are sequentially connected and then are subjected to pixel multiplication with the first maximum pooling layer and then are input into the second maximum pooling layer, the second maximum pooling layer is sequentially connected with the third rolling block and the third activation layer, the output of the third activation layer is subjected to pixel multiplication with the second maximum pooling layer and then is input into the third maximum pooling layer, and the output of the third maximum pooling layer and the output of the first maximum pooling layer are subjected to pixel addition to form final output.

Preferably, the multi-strategy fusion module performs pixel subtraction, pixel addition and pixel multiplication operations on the depth feature and the RGB feature respectively, and takes an average value and a maximum value on a channel dimension; subtracting pixels, adding pixels, performing pixel multiplication operation and performing pixel addition on the average value and the maximum value on the channel dimension to obtain a first output; and the upper layer of fusion features are subjected to pixel addition with the first output after being subjected to upsampling to be used as final output.

Preferably, the structure of the cross-fusion module is as follows:

second input

By feature extraction and first input

The result of the pixel addition is recorded as

Output via the first convolution block and

performing pixel addition to obtain M, performing pixel addition on M and M, using the result of pixel addition as the input of pixel multiplication with M, using the result of pixel multiplication as the input of pixel subtraction with M, using the result of pixel subtraction as the input of channel superposition with M, and performing second convolution on the output of channel superpositionAnd finally outputting the block.

Compared with the prior art, the method for detecting the salient object based on the multi-strategy and cross feature fusion has the following beneficial effects that:

1) the method comprises the steps of constructing a convolutional neural network, inputting RGB-D images in a training set into the convolutional neural network for training, and obtaining a convolutional neural network classification training model; and inputting the image to be subjected to significance detection into a convolutional neural network classification training model, and predicting to obtain a predicted significance image corresponding to the RGB image.

2) The method adopts a cross feature fusion module to perform cross fusion on the feature graphs of the multi-strategy fusion module, captures the joint features and provides supplementary information for the single-mode features.

3) The method adopts the depth optimization module to eliminate the influence of the noise of the depth information on the network, so that the obtained depth information can better express the position information of the salient body.

4) The method adopts a bidirectional cooperation structure, adopts top-down supervision and bottom-up decoding, and refines global features to regional features for final prediction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a cross-fusion module architecture of the present invention;

FIG. 3 is a block diagram of a depth optimization module according to the present invention;

FIG. 4 is a block diagram of a multi-policy fusion module according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a method for detecting a salient object based on multi-strategy fusion and multi-supervision, and the overall implementation block diagram is shown in figure 1, and the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q NJU2K and RGB color images, Depth images and Ground Truth label images of an NLPR data set, forming a training set, and recording the Q-th original obvious detection image in the training set as { I }^q(I, j) }, the training set is summed with { I }^q(i, j) } the corresponding real label image is recorded as

Then, the real significance detection image corresponding to each original significance image in the training set is processed into 1 single-hot coding image by adopting the existing single-hot coding technology (one-hot), and the 1 single-hot coding image is obtained

The processed set of 1 one-hot coded image is denoted as

Wherein, the road scene image is an RGB color image, Q is a positive integer, Q is more than or equal to 200, if Q is 2185, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, i is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents a hard faceI^q(I, j) }, H denotes { I }^q(I, j) } e.g. take W224, H224, I^q(I, j) represents { I^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j); here, 2185 images in the saliency detection image database NJU2K and the NLPR training set were selected directly.

Step 1_ 2: constructing a convolutional neural network: the convolutional neural network is divided into an encoding (Encode) part and a decoding (Decode) part, and respectively corresponds to Feature extraction (Feature Extract) and Feature Fusion (Feature Fusion) of an image. Fig. 2 is a cross fusion module structure diagram, fig. 3 is a depth optimization module structure diagram, and fig. 4 is a multi-strategy fusion module structure diagram.

The input is combined by two different modes of RGB (three-channel) and Depth (single-channel), so that the network input is divided into two streams, and RGB and Depth are respectively encoded. Since Depth information (Depth) contains spatial information between regions of an image, it plays a very important role in salient object detection, but Depth maps are usually of low quality, possibly introducing characteristic noise and redundancy into the network, and we introduce a Depth-optimization module (Depth-optimization Model). The backbone network employs ResNet-50. RGB and Depth codes are each made up of 5 convolutional blocks. Wherein the 1 st convolution block, the 2 nd convolution block, the 3 rd convolution block are defined as low-level features, the 4 th convolution block, and the 5 th convolution block are defined as high-level features, and the 6 th convolution block, the 7 th convolution block, and the 8 th convolution block in the same depth stream are defined as low-level features, and the 9 th convolution block and the 10 th convolution block are defined as high-level features. There are 5 multi-policy Fusion modules (Muti-stream Fusion) between the two encoded main streams, which use high-level features to supervise low-level feature Fusion, in a top-down manner. Each MSF has a supervision output by upsampling (Upsample) as a supervision loss during training. And performing Cross Feature Fusion (CFF) with the outputs of the 2 nd multi-strategy fusion module, the 3 rd multi-strategy fusion module, the 4 th multi-strategy fusion module and the 5 th multi-strategy fusion module through the first MSF module. Where the input pictures of both encoded streams are both W wide and H high.

For the RGB color image training layer and the Depth single-channel image pre-training layer, the ResNet50 pre-trained on Imagenet is adopted, and the total output is five. The first output layer of the RGB color image pre-training layer has the size of W/2 and the height of H/2, and 64 feature maps are recorded as R1; the second output layer of the RGB color image pre-training layer has the size of W/4 and the height of H/4, and has 256 characteristic graphs which are marked as R2; the size of a third output layer of the RGB color image pre-training layer is W/8, the height of the third output layer is H/8, 512 feature maps are provided in total and are marked as R3; the size of a fourth output layer of the RGB color image pre-training layer is W/16, the height of the fourth output layer is H/16, and 1024 feature maps are recorded as R4; the fifth output layer of the left-view color image pre-training layer has the size of W/32 and the height of H/32, and has 1024 characteristic graphs which are marked as R5; the Depth image pre-training layer has five outputs, which are recorded as D1, D2, D3, D4 and D5, and the structures are respectively the same as R1, R2, R3, R4 and R5.

For the 5 th convolution block, the 6 th convolution block, the 7 th convolution block, the 8 th convolution block, the 9 th convolution block and the 10 th convolution block, the output of each convolution block from the previous layer of convolution block is input to the depth optimization modules DOM1, DOM2, DOM3, DOM4 and DOM5, and D2, D3, D4 and D5 are obtained.

Input D of the deep optimization Module DOMⁱ(Cⁱ×Hⁱ×Wⁱ)(i＝1，2，3，4，5)，CⁱDenotes the number of channels, Hⁱ，WⁱRepresenting the length and width of the image, respectively. Channel Attention (Channel Attention) is first performed, where the main branches are in turn grouped by the first largest pooling layer, and the size of the output depth map is 1 × 1. The first convolution block, convolution kernel size 1 × 1, step size 1, number of channels CⁱA first active layer (Relu), a second convolution block, a convolution kernel size of 1 × 1, a step size of 1, a number of channels CⁱSecond activation ofLayer (Sigmoid), then the main branch and the shortcut branch are multiplied by pixel to obtain

Then Spatial Attention (Spatial Attention) is performed, wherein the main branch is sequentially formed by a first maximization layer (Maximize), a third convolution block, a convolution kernel size of 7 × 7, a step size of 1, Padding (Padding) of 3, and a third activation layer of Sigmoid, and then the channel Attention is obtained

Multiplied by the spatial attention output to obtain

Finally, the original input D is inputⁱAnd

add operation is performed as input to the next volume block.

Step 1_ 3: for the fifth multi-strategy fusion module, the outputs of the fifth convolution module (RGB color feature R5) and the 5 th Depth optimization module (Depth feature D5) are used as inputs, pixel subtraction, pixel addition and pixel multiplication are respectively carried out, the maximum value of the channel and the average value of the channel are taken, and Q is obtained¹，Q²，Q³，Q⁴，Q⁵Then respectively adding Qⁱ(i ═ 1,2,3,4, 5) are added as the fusion features input by the next-layer multi-strategy fusion module, and for the 4 th multi-strategy fusion module, the 3 rd multi-strategy fusion module, the 2 nd multi-strategy fusion module, and the 1 st multi-strategy fusion module, the 4 th convolution block, the 3 rd convolution block, the 2 nd convolution block, the 1 st convolution block (R4, R3, R2, R1) and the 4 th depth optimization module, the 3 rd depth optimization module, the 2 nd depth optimization module, the 1 st depth optimization module (D4, D3, D2, D1) and the fusion features of the previous-layer multi-strategy fusion feature module are input, respectively. Will Dⁱ(i ═ 1,2,3,4) and Rⁱ(i ═ 1,2,3,4), pixel subtraction, pixel addition, pixel multiplication, channel maximization, and channel leveling, respectivelyMean value to obtain Q¹，Q²，Q³，Q⁴，Q⁵Then, the fusion characteristics of the multi-strategy fusion module of the upper layer are sampled by 2 times to obtain Fⁱ(i ═ 1,2,3,4) and finally Q¹，Q²，Q³，Q⁴，Q⁵And FⁱAnd adding the fusion characteristics as the input fusion characteristics of the next layer of multi-strategy fusion module.

For the 4 th cross fusion module, the 3 rd cross fusion module, the 2 nd cross fusion module and the 1 st cross fusion module, the input of the first multi-strategy fusion module is respectively output

And 5, 4, 3, 2 multiple strategy fusion module. Firstly, the ith (i-2, 3,4,5) multi-strategy fusion output is processed by 2^i-1Performing multiple upsampling, extracting features, determining the convolution kernel size of the convolution layer to be 3 × 3, the step length to be 1, the padding to be 1, the output channel to be 64, then performing standardization (Batch Norm), and finally performing activation (Rectified Linear Unit, ReLU) to obtain

Will be provided with

And

result of addition

Performing a first convolution with a convolution kernel size of 3 × 3, a step size of 1, and a padding of 1 to obtain

Then will be

And adding, respectively adding with the self, multiplying, subtracting and taking the obtained characteristics as the operation objects of the next step, and finally carrying out Concat on the obtained result and the self. And in the second convolution block, the convolution kernel size is 1, the step size is 1, and the output is 64 channels.

Step 1_ 4: and performing data enhancement on each original RGB color image and Depth image in the training set by means of random cutting, rotation, color enhancement, overturning and the like, and then taking the images as initial input images, wherein the batch size is 4. Inputting the prediction images into a deep convolution neural network for training to obtain a prediction image with each original saliency image in a training set equal to the original size, and in addition, in order to assist the training, outputting 5 multi-strategy fusion modules during the training

The sizes are W/2H/2, W/4H/4, W/8H/8, W/16H/16 and W/32H/32 in turn, and 2 is subjected to upsamplingⁱMultiplying to obtain the characteristics with H x W and the final output M of the model_outSupervise training together, will

M_outAnd M_GTThe LOSS function between (true values) is noted LOSS (M)_pre，M_GT) The LOSS adopts a Binary Cross Entropy LOSS function (Binary Cross Entropy LOSS) and finally sums 6 losses to obtain a final LOSS value.

Step 1_ 5: repeatedly executing the step 1_4 for N times until the neural network converges on the training set, and taking 800 original RGB color images and Depth images as a verification set during the training period to obtain N loss function values in total; then finding out the loss function value with the minimum value from the N loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as W^bestAnd b^best(ii) a Where N > 1, in this example, N is 300.

The test stage process comprises the following specific steps:

step 2_ 1: the set of NJU2K data sets for 500 original RGB color images and Depth images and the set of NLPR data for 300 original RGB color images and Depth images were taken as the test set. Order to

Representing a saliency image to be detected; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

And the middle coordinate position is the pixel value of the pixel point of (i, j). No data enhancement was performed at the time of testing.

Step 2_ 2: will be provided with

The R channel component, the G channel component and the B channel component are input into a convolutional neural network classification training model and are subjected to W-based classification^bestAnd b^bestMaking a prediction to obtain

Corresponding predictive semantic segmentation image, denoted

Wherein the content of the first and second substances,

to represent

Middle coordinate positionAnd setting the pixel value of the pixel point of (i ', j').

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) building a convolutional neural network architecture by using a python-based deep learning library Pythrch. The significance detection database NJU2K and the test set of NLPR are adopted to analyze how the segmentation effect of the significance detection image (500 NJU2K images and 300 NLPR images) obtained by prediction by the method is. The average Absolute Error (MAE) of the target detection effect of the method, F1 Score (F1 Score, F1), Structure measurement (S-measure), and Enhanced positioning measurement (E-measure) are used for evaluating the detection performance of the significance detection image, as listed in Table 1. From the data listed in table 1, the significant object images obtained by the method of the present invention are good, which indicates that it is feasible and effective to obtain significant object images of various scenes by using the method of the present invention.

TABLE 1 evaluation results on test sets using the method of the invention

ours	S↑	adpE↑	adpF↑	MaxF↑	MAE↓
						NJU2K	0.912	0.932	0.915	0.917	0.032
NLPR	0.920	0.958	0.904	0.912	0.022

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A salient object detection method based on multi-strategy and cross feature fusion is characterized by comprising the following steps:

selecting RGB color images, Depth images and GroudTruth label images of a plurality of data sets to form a training set;

and training for multiple times to obtain a convolutional neural network model.

2. The method for detecting the salient object based on the multi-strategy and cross-feature fusion as claimed in claim 1, wherein the convolutional neural network introduces a depth optimization module to improve the image quality, and the feature maps obtained by the multi-strategy fusion module are cross-fused by the cross-fusion module to capture the combined features.

3. The method for detecting the salient object based on the multi-strategy and cross feature fusion as claimed in claim 2, wherein the depth optimization module has the following structure:

4. The method for detecting the salient object based on the multi-strategy and cross-feature fusion as claimed in claim 2, wherein the multi-strategy fusion module performs pixel subtraction, pixel addition and pixel multiplication operations on the depth feature and the RGB feature respectively, and takes an average value and a maximum value on a channel dimension; subtracting pixels, adding pixels, performing pixel multiplication operation and performing pixel addition on the average value and the maximum value on the channel dimension to obtain a first output; and the upper layer of fusion features are subjected to pixel addition with the first output after being subjected to upsampling to be used as final output.

5. The method for detecting the salient object based on the multi-strategy and cross feature fusion as claimed in claim 2, wherein the structure of the cross fusion module is as follows:

second input

By feature extraction and first input

The result of the pixel addition is recorded as

Output via the first convolution block and

and performing pixel addition to obtain M, performing pixel addition on the M and the M, using a pixel addition result as an input for performing pixel multiplication on the M, using a pixel multiplication result as an input for performing pixel subtraction on the M, using a pixel subtraction result as an input for performing channel superposition on the M, and using an output of the channel superposition as a final output after passing through a second convolution block.