CN112348870B

CN112348870B - Significance target detection method based on residual error fusion

Info

Publication number: CN112348870B
Application number: CN202011235626.2A
Authority: CN
Inventors: 张立和; 金玉
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2022-09-30
Anticipated expiration: 2040-11-06
Also published as: CN112348870A

Abstract

The invention belongs to the technical field of artificial intelligence, and provides a significance target detection method based on residual error fusion. The method comprises the steps of firstly constructing a significance detection model, extracting multi-level RGB image features and depth image features through a double-flow feature extraction module, further extracting deep-level features by using a residual error module, and simultaneously gradually fusing features from RGB feature extraction branches and corresponding previous levels by using a fusion module so as to train and obtain a final algorithm model. The invention realizes the significance prediction from end to end, has low model complexity and can fully and effectively utilize RGB image information and depth image information to predict significance regions.

Description

Significance target detection method based on residual error fusion

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to deep learning, and particularly relates to an image saliency detection method.

Background

Salient object detection is an important first step in the computational mechanism solution of the surrounding environment. Its task is to enable a computer to mimic the human attention mechanism to detect areas of appealing attention in an image. These attractive regions contain most of the visual information in the image. By screening out the image foreground regions containing the main visual information, the subsequent steps of image understanding can obtain cleaner and more accurate content information in the image, and can also reduce calculation and storage resources when processing the image background region, so that the overall performance of the subsequent steps of image understanding is improved. Usually, one only focuses on the areas of the image that are most attractive to human eyes, i.e. foreground areas or salient objects, while ignoring background areas. Therefore, one uses a computer to simulate the human visual system for saliency detection.

However, most of the existing significance detection based on deep learning aims at rgb images, and only the color images are relied on to ignore corresponding depth information, so that the accuracy and efficiency of significance detection are limited, and particularly when the foreground and the background are difficult to distinguish, the RGBD significance detection is generated. The RGBD saliency detection aims to accurately detect a salient object from an image with the aid of a depth image. Although some progress has been made in the significance detection of RGBD, there is still a great room for improvement. Although the appearance of devices such as a Kinect and a light field camera facilitates the acquisition of depth images, certain noise is introduced, and how to design a better algorithm to fit a model under the condition is worthy of careful consideration. Secondly, a significance detection algorithm based on deep learning generally has a problem that how to better fuse RGB information and depth information, the RGB image contains a large amount of information such as color texture, the depth image contains abundant geometric and edge information, and the depth information contains some information contained by RGB, so that how to better combine the RGB information and the depth information to complement each other and to more accurately highlight a significance region is a problem which is worthy of consideration at present.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method makes up the defects of the existing method, provides the RGBD image saliency detection method based on residual error fusion, and achieves the purpose of obtaining higher model precision.

The technical scheme of the invention is as follows:

a salient object detection method based on residual fusion comprises the following steps:

(1) constructing a significance detection model

The significance detection model comprises a double-current feature extraction module, a multi-scale feature pyramid pooling module, a residual fusion module and a parallel upsampling module;

(2) carrying out channel copying on an original depth image corresponding to the RGB image I to obtain a depth image D;

(3) as shown in fig. 1, an RGB image I and a depth image D thereof are input into a saliency detection model, and a multi-level RGB image feature { Ii, I ═ 1,2,3,4,5} and a multi-level depth image feature { Di, I ═ 1,2,3,4,5} are extracted by an RGB image feature extraction branch (the uppermost row in fig. 1) and a depth image feature extraction branch (the lowermost row in fig. 1) in a dual-stream feature extraction module, respectively;

(4) adding a multi-scale feature pyramid pooling module to the final stage of the RGB image feature extraction branch, further extracting deep-level features through five residual error fusion modules (fig. 1 res-fuse), and performing parallel upsampling step by step to obtain a final significance prediction result P ^final ；

(5) The multiscale feature pyramid pooling module includes six sub-branches, as shown in fig. 2, and is configured to obtain context information of input feature data, where a first sub-branch employs a 1 × 1 convolutional layer, a second, a third, and a fourth sub-branches employ respectively a hole convolution with expansion rates of 3, 5, and 7, and a fifth sub-branch employs global average pooling to obtain a 1 × 1 feature representation; the sixth sub-branch adopts a direct jump connection mode to connect the input characteristic data to an output end; the first four branches further strengthen the feature expression by utilizing 1 multiplied by 1 convolutional layers and cavity convolution, and simultaneously keep the feature size and the number of channels unchanged; for the feature representation obtained by the convolution learning, the feature representation is further up-sampled to the size of the input feature respectively, and an up-sampling strategy of bilinear interpolation is adopted; finally, combining the six sub-branches in a channel cascade mode to obtain the multi-scale characteristic pyramid pooling characteristic representation of the input characteristic data;

(6) the residual fusion module is used for fusing the branches { Ii, Di, i ═ 1,2,3,4,5} from the feature extraction module, and is defined as follows:

wherein res _ fuse (-) represents residual fusion, Up (-) represents parallel upsampling, and C (-) represents fusion of two inputs in the channel direction; for residual fusion, two cases are distinguished: when its input is the last branch of the feature extraction module, directly connecting I ₅ After passing through a multiscale feature pyramid pooling module, as rgb features, adding the features after three times of continuous convolution and ReLU operations and the features after one time of convolution and ReLU operations of the features Di to obtain a residual error fusion result(ii) a Otherwise, for the ith-level RGB image feature Ii and the ith-level depth image feature Di of the feature extraction module, firstly, the residual error fusion result obtained by Ii +1 and Di +1 is sampled in parallel, then the Ii and the Ii are fused according to the channel direction to be used as RGB features, the features obtained after three times of continuous convolution and ReLU operations are carried out on the RGB features and the features obtained after one time of convolution and ReLU operations are added according to elements to obtain a residual error fusion result;

(7) the parallel upsampling includes four sub-branches, as shown in fig. 4, which use convolutional layers with different receiving domains, respectively, and are intended to capture different local structures; then, connecting the response generated by the four convolutional layers to a tensor feature with the size of H multiplied by W multiplied by 2C, dividing H multiplied by W multiplied by 2C according to a C/2 unit and splicing and recombining according to the length H direction and the W direction respectively in order to obtain the feature which is half of the original length and width dimension, and finally obtaining the feature with the size of 2H multiplied by 2W multiplied by C/2;

(8) initializing a weight parameter on ImageNet through an vgg-16 pre-training model; in the model training phase, optimization is carried out by taking a cross entropy loss function as an objective function, an Adam optimization algorithm is used, the momentum is set to be 0.9, the weight attenuation rate is set to be 0.1, the basic learning rate is set to be 1 multiplied by 10 < -6 >, and the batch size is set to be 1.

The invention has the beneficial effects that: the method fully utilizes complementary information contained in the RGB image and the corresponding depth image, and achieves the aim of accurately predicting the saliency area in the RGBD image in a residual error fusion mode. In addition, the characteristic aggregation module and a reasonable up-sampling mode aggregate the characteristics of different scales, so that the end-to-end significance prediction can be realized by fully and effectively utilizing RGB image information and depth image information.

Drawings

Fig. 1 is a frame diagram of an RGBD saliency detection method based on residual fusion, where the top row represents RGB image feature extraction branches, and the bottom row represents depth image feature extraction branches;

FIG. 2 is a schematic diagram of a multi-scale feature pyramid pooling module;

FIG. 3 is a schematic diagram of a residual fusion module;

fig. 4 is a schematic diagram of a parallel upsampling module.

Detailed Description

The following further describes the specific embodiments of the present invention with reference to the drawings and technical solutions.

The invention is implemented as follows:

(1) constructing significance detection model

The significance detection model comprises a double-current feature extraction module, a multi-scale feature pyramid pooling module, a residual error fusion module and a parallel upsampling module.

(2) And performing channel copy on the depth image corresponding to the RGB image to obtain a three-channel image, and obtaining a depth image D.

(3) Inputting the RGB image I and the depth image D thereof into a saliency detection model, and extracting multi-level RGB image features { Ii, I ═ 1,2,3,4,5} and depth image features { Di, I ═ 1,2,3,4,5} respectively by an RGB image feature extraction branch (RGB encoder) and a depth image feature extraction branch (depth encoder) in a dual-stream feature extraction module.

(4) And adding a characteristic pyramid module to the final stage of the RGB encoder branch, further extracting deep-level characteristics through a residual fusion module, up-sampling step by step, and simultaneously supervising step by step in the extraction process to achieve the aim of global optimization.

(5) As shown in fig. 2, the multi-scale feature pooling module includes six sub-branches to obtain context information of input RGB image feature data, where a first sub-branch adopts a 1 × 1 convolutional layer, a second, a third, and a fourth sub-branches respectively adopt hole convolutions with expansion rates of 3, 5, and 7, and a fifth sub-branch adopts global average pooling to obtain a 1 × 1 feature representation; the sixth sub-branch connects the input signature to the output in a direct hop-by-hop fashion. The first four branches further enhance feature expression using 1 × 1 convolutional layers and hole convolutions while keeping feature size and number of channels unchanged. For the feature representation obtained by the convolution learning, the feature representation is further up-sampled to the size of the input feature, and an up-sampling strategy of bilinear interpolation is adopted. And the fifth branch is cascaded to the final feature map in a global average pooling way, and finally, the features of the six sub-branches are combined in a channel cascading way to obtain feature representation fused with multi-scale pooling.

(6) The residual fusion module is used for fusing branches { Ii, Di, i ═ 1,2,3,4,5} from the encoder feature extraction, and is defined as follows:

wherein res _ fuse (-) represents residual fusion, Up (-) represents parallel upsampling, and C (-) represents fusion of two features according to channel direction, as shown in fig. 3. It can be seen that for residual fusion, there are three input sources, namely, RGB encoder features with corresponding sizes, depth encoder features, and output features from the previous residual fusion module (when the residual block is at the rightmost end in fig. 1, there is no such portion, i is 5), the RGB encoder features with corresponding sizes and the features from the previous residual fusion module are first cascaded according to the channel direction, and then the residual fusion features are obtained when the RGB encoder features and the depth encoder features are respectively input into the residual fusion module.

(7) As shown in fig. 3, the parallel upsampling includes four sub-branches, which use convolutional layers with different receiving domains, respectively, and are intended to capture different local structures, then, the responses generated by the four convolutional layers are connected to a tensor feature with a size of H × W × 2C, in order to obtain a feature which is half of the original length and width, we can divide H × W × 2C by channels according to C/2 units, and finally, perform splicing recombination according to the length and H direction and according to the W direction, respectively, to obtain a feature with a size of 2H × 2W × C/2.

(8) Initializing a weight parameter on ImageNet through vgg-16 pre-training model; in the model training phase, optimization is carried out by taking a cross entropy loss function as an objective function, an Adam optimization algorithm is used, the momentum is set to be 0.9, the weight attenuation rate is set to be 0.1, the basic learning rate is set to be 1 multiplied by 10 < -6 >, and the batch size is set to be 1.

Claims

1. A salient object detection method based on residual fusion is characterized by comprising the following steps:

(1) constructing a significance detection model

(3) inputting an RGB image I and a depth image D thereof into a significance detection model, and respectively extracting multi-level RGB image features { Ii, I-1, 2,3,4,5} and multi-level depth image features { Di, I-1, 2,3,4,5} through an RGB image feature extraction branch and a depth image feature extraction branch in a double-current feature extraction module;

(4) adding a multi-scale feature pyramid pooling module to the final stage of the RGB image feature extraction branch, further extracting deep-level features through five residual fusion modules, and performing parallel upsampling step by step to obtain a final significance prediction result P ^final ；

(5) The multi-scale feature pyramid pooling module comprises six sub-branches and is used for obtaining context information of input feature data, wherein the first sub-branch adopts a 1 x 1 convolutional layer, the second, third and fourth sub-branches respectively adopt hole convolutions with expansion rates of 3, 5 and 7, and the fifth sub-branch adopts global average pooling to obtain a 1 x 1 feature representation; the sixth sub-branch adopts a direct jump connection mode to connect the input characteristic data to an output end; the first four branches further strengthen the feature expression by utilizing 1 multiplied by 1 convolutional layers and cavity convolution, and simultaneously keep the feature size and the number of channels unchanged; for the feature representation obtained by convolution learning, further up-sampling to the size of the input feature respectively, and adopting an up-sampling strategy of bilinear interpolation; finally, combining the six sub-branches in a channel cascade mode to obtain multi-scale characteristic pyramid pooled characteristic representation of the input characteristic data;

(6) the residual fusion module is used for fusing branches { Ii, Di, i ═ 1,2,3,4,5} from the dual-stream feature extraction module, and is defined as follows:

wherein res _ fuse (-) represents residual fusion, Up (-) represents parallel upsampling, and C (-) represents fusion of two inputs in the channel direction; for residual fusion, two cases are distinguished: when the input is the last branch of the feature extraction module, directly connecting I ₅ After passing through a multi-scale feature pyramid pooling module, taking the feature as an rgb feature, performing three times of continuous convolution and ReLU operations, and adding the feature of the feature Di after performing one time of convolution and ReLU operations according to elements to obtain a residual error fusion result; otherwise, for the ith-level RGB image feature Ii and the ith-level depth image feature Di of the feature extraction module, firstly, the residual fusion result obtained by Ii +1 and Di +1 is sampled in parallel, then the residual fusion result and the Ii are fused according to the channel direction to be used as RGB features, the features obtained after three times of continuous convolution and ReLU operations are carried out on the RGB features and the features obtained after one time of convolution and ReLU operations are added according to elements to obtain a residual fusion result;

(7) the parallel upsampling comprises four sub-branches, which respectively use convolutional layers with different receiving domains, intended to capture different local structures; then, connecting the response generated by the four convolutional layers to a tensor feature with the size of H multiplied by W multiplied by 2C, dividing H multiplied by W multiplied by 2C according to a C/2 unit and splicing and recombining according to the length H direction and the W direction respectively in order to obtain the feature which is half of the original length and width dimension, and finally obtaining the feature with the size of 2H multiplied by 2W multiplied by C/2;