CN113538379B

CN113538379B - Double-stream coding fusion significance detection method based on RGB and gray level images

Info

Publication number: CN113538379B
Application number: CN202110805754.4A
Authority: CN
Inventors: 徐涛; 赵未硕; 史增勇; 周纪勇; 蔡磊; 马玉琨; 柴豪杰
Original assignee: Henan Institute of Science and Technology
Current assignee: Henan Institute of Science and Technology
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2022-11-22
Anticipated expiration: 2041-07-16
Also published as: CN113538379A

Abstract

The invention provides a double-current coding fusion significance detection method based on RGB and gray level images, which comprises the following steps: obtaining an RGB image and a true value image corresponding to the RGB image, generating a gray image corresponding to the RGB image, and copying and combining the gray image to obtain a three-channel gray image; secondly, inputting the three-channel gray-scale image and the RGB image into an encoder network respectively to obtain a multi-scale characteristic image; then, decoding the multi-scale characteristic graph by using a decoder network, and outputting a predicted image; calculating loss values of the prediction image and the true value image by using a loss function, and judging whether the training of the encoder-decoder network is finished according to the loss values; and finally, acquiring an image to be detected, respectively inputting the image to be detected and a three-channel gray scale map corresponding to the image to be detected into the coding region-decoder network, and outputting a prediction result of the image to be detected. The invention optimizes the edge part of the saliency image and highlights the saliency objects through the designed double-current encoder and the multi-scale decoder.

Description

Double-stream coding fusion significance detection method based on RGB and gray level images

Technical Field

The invention relates to the technical field of image processing, in particular to a double-stream coding fusion significance detection method based on RGB and gray level images.

Background

Salient Object Detection (SOD) aims to highlight the most interesting objects or regions of human vision in a scene. The method has wide application in computer vision, including image segmentation, image retrieval, object detection, visual tracking, image compression, scene classification and the like. Traditional methods rely primarily on manually-made low-level features such as color, shape, texture features and heuristic priors such as center priors, background priors, etc. But this approach leads to non-ideal detection results due to the lack of advanced semantic information. Recently, due to unprecedented success of Convolutional Neural Networks (CNNs), particularly Full Convolutional Networks (FCNs), FCN-based approaches have greatly improved SOD performance. Most of them use RGB images for saliency prediction. In recent years, some methods use an RGB image and a depth image to perform significance prediction (RGB-D) together, and the significance detection method of RDB-D effectively improves the quality of a prediction image.

However, the RGB-D saliency detection method requires an RGB image and its depth map to cooperate with input for prediction, and although it has a high-quality prediction result, because the depth map acquisition cost is high, and the cost of most devices carrying depth cameras is too high, the application scenarios of this method are not wide at present. Although the method for performing significance prediction by using RGB images has achieved good results, the existing methods still have many problems. Firstly, in the existing model network coding, only the RGB image is used to make some feature information difficult to distinguish, which results in the edge of the salient object in the predicted image being not clear enough, the internal display of the salient object being uneven, the contour prediction of the salient object being inaccurate, etc. In addition, the network feature extraction is not sufficient, most of the network feature extraction only focuses on the feature fusion in the decoding stage, and the feature extraction part in the encoding stage is omitted.

Disclosure of Invention

Aiming at the technical problems that the edge of a significant object in a predicted image is not clear enough, the internal display of the significant object is not uniform and the contour prediction of the significant object is inaccurate due to the fact that only RGB images are utilized to distinguish some characteristic information in the existing model network coding, the invention provides a double-current coding fusion significance detection method based on RGB and gray level images, which comprises a double-current coder and a multi-scale decoder, and designs a coding fusion module and a characteristic fusion module by considering the fusion of the respective advantages of RGB characteristics and gray level image characteristics; meanwhile, the problem that the sizes of the salient objects in the salient images are different is considered, and multi-scale side output fusion is adopted during decoding; therefore, the method can better optimize the edge part of the saliency image, more uniformly highlight the saliency object and extract more salient features under the condition that the background or the saliency object is complex.

The technical scheme of the invention is realized as follows:

a double-stream coding fusion significance detection method based on RGB and gray level images comprises the following steps:

s1, obtaining an RGB image and a truth map corresponding to the RGB image from a DUTS-TR data set, and processing the RGB image to generate a gray level image corresponding to the RGB image;

s2, copying and combining the gray level images in the step S1 to obtain a three-channel gray level image;

s3, respectively inputting the three-channel gray-scale image and the RGB image into an encoder network to obtain a multi-scale characteristic image;

s4, decoding the multi-scale feature map by using a decoder network, and outputting a predicted image;

s5, calculating loss values of the predicted image and the true value image by using a loss function, judging whether the loss values are threshold values or not, if so, obtaining a trained coding region-decoder network, executing the step S6, otherwise, automatically modifying weight parameters of all layers of the coder network and the decoder network according to the loss values, and returning to the step S3;

s6, obtaining the image to be detected, generating a three-channel gray-scale image of the image to be detected, respectively inputting the three-channel gray-scale image of the image to be detected and the three-channel gray-scale image of the image to be detected into a coding region-decoder network, and outputting a prediction result of the image to be detected.

The method for generating the gray level image corresponding to the RGB image comprises the following steps:

Gray＝R×0.299+G×0.587+B×0.114；

wherein, gray is a Gray image, R is a red channel pixel value of an RGB image, G is a green channel pixel value of the RGB image, and B is a blue channel pixel value of the RGB image, respectively.

The encoder network comprises an encode1-I module, an encode1-II module, an encode1-III module and an encode1-IV moduleAn encode1-V module, an encode2-I module, an encode2-II module, an encode2-III module, an encode2-IV module, an encode2-V module and a double-current fusion module TSF ₁ Double-current fusion module TSF ₂ Double-current fusion module TSF ₃ Double-current fusion module TSF ₄ Double-current fusion module TSF ₅ And a bridge module;

the input end of the encode1-I module is connected with a first input layer, the input of the first input layer is a three-channel gray-scale image, the output end of the encode1-I module is connected with the input end of the encode1-II module, the output end of the encode1-II module is connected with the input end of the encode1-III module, the output end of the encode1-III module is connected with the input end of the encode1-IV module, and the output end of the encode1-IV module is connected with the input end of the encode1-V module;

the input end of the encode2-I module is connected with the second input layer, the input of the second input layer is an RGB image, and the second input layer, the output end of the encode1-I module, the output end of the encode2-I module and the TSF (dual stream fusion module) ₁ Is connected with the input end of the double-current fusion module TSF ₁ The output end of the encode1-II module, the output end of the encode2-I module, the output end of the encode2-II module and the TSF (double current fusion module) ₂ Is connected with the input end of the double-current fusion module TSF ₂ The output end of the module is connected with the input end of the encode2-III module, the output end of the encode1-III module, the output end of the encode2-II module, the output end of the encode2-III module and the TSF (double current fusion module) ₃ Is connected with the input end of the double-current fusion module TSF ₃ The output end of the module is connected with the input end of the encode2-IV module, the output end of the encode1-IV module, the output end of the encode2-III module, the output end of the encode2-IV module and the TSF (double current fusion module) ₄ Is connected with the input end of the double-current fusion module TSF ₄ The output end of the module is connected with the input end of the encode2-V module, the output end of the encode1-V module, the output end of the encode2-IV module, the output end of the encode2-V module and the TSF (double current fusion module) ₅ Is connected with the input end of the double-current fusion module TSF ₅ The output end of the bridge module is connected with the input end of the bridge module；

The second input layer, the output end of the encode2-I module, the output end of the encode2-II module, the output end of the encode2-III module, the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module are all connected with a decoder network.

The decoder network comprises a feature fusion module FF ₁ And a feature fusion module FF ₂ Feature fusion module FF ₃ Feature fusion module FF ₄ Feature fusion module FF ₅ Decoding fusion module DF ₁ Decoding fusion module DF ₂ Decoding fusion module DF ₃ Decoding fusion module DF ₄ Decoding fusion module DF ₅ A decode-I module, a decode-II module, a decode-III module, a decode-IV module and a decode-V module;

feature fusion module FF ₁ The input end of the first input layer is respectively connected with the output end of the second input layer, the output end of the encode2-I module and the output end of the encode2-II module, and the feature fusion module FF ₁ Output end and decoding fusion module DF ₁ Are connected with the input end of the power supply; feature fusion module FF ₂ The input end of the module is respectively connected with the output end of the encode2-I module, the output end of the encode2-II module and the output end of the encode2-III module, and the characteristic fusion module FF ₂ Output end and decoding fusion module DF ₂ Are connected with each other; feature fusion module FF ₃ The input end of the module is respectively connected with the output end of the encode2-II module, the output end of the encode2-III module and the output end of the encode2-IV module, and the feature fusion module FF ₃ Output end and decoding fusion module DF ₃ Are connected with the input end of the power supply; feature fusion module FF ₄ The input end of the module is respectively connected with the output end of the encode2-III module, the output end of the encode2-IV module and the output end of the encode2-V module, and the characteristic fusion module FF ₄ Output end and decoding fusion module DF ₄ Are connected with each other; feature fusion module FF ₅ The input end of the character fusion module FF is respectively connected with the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module ₅ The output end of the bridge module and the output end of the bridge module are allAnd decoding fusion module DF ₅ Are connected with the input end of the power supply; decoding fusion module DF ₅ The output end of the decoder-V module is connected with the input end of the decoder-V module, and the output end of the decoder-V module is connected with the decoding fusion module DF ₄ Is connected to the input end of the decoding fusion module DF ₄ Is connected with the input end of the decode-IV module, and the output end of the decode-IV module is connected with the decoding fusion module DF ₃ Is connected to the input end of the decoding fusion module DF ₃ Is connected with the input end of the decode-III module, and the output end of the decode-III module is connected with the decoding fusion module DF ₂ Is connected to the input end of the decoding fusion module DF ₂ Is connected with the input end of the decode-II module, and the output end of the decode-II module is connected with the decoding fusion module DF ₁ Is connected to the input end of the decoding fusion module DF ₁ The output end of the decoder-I module is connected with the input end of the output layer, and the output end of the output layer outputs a predicted image.

The first input layer and the second input layer are both a convolution layer I, a batch normalization layer I and an activation layer I; the convolution kernel of the convolution layer I is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 1, and the number of output channels is 64;

the encode1-I module and the encode2-I module are respectively of a convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; the convolution kernels of the convolution layers II and III are 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64;

the encode1-II module and the encode2-II module are respectively of a convolution layer IV-batch normalization layer IV-active layer IV-convolution layer V-batch normalization layer V; the convolution kernel of the convolution layer IV is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 64, and the number of output channels is 128; the convolution kernel of convolution layer V is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 128, and the number of output channels is 128;

the encode1-III module and the encode2-III module are respectively of a convolution layer VI-batch normalization layer VI-activation layer VI-convolution layer VII-batch normalization layer VII; the convolution kernel of the convolution layer VI is 3 × 3, the step length is 2, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 256; the convolution kernel of the convolution layer VII is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 256;

the encode1-IV module and the encode2-IV module are respectively of a convolutional layer VIII-batch normalization layer VIII-activation layer VIII-convolutional layer IX-batch normalization layer IX; wherein, the convolution kernel of the convolution layer VIII is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 512; the convolution kernel of convolution layer IX is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512;

the encode1-V module and the encode2-V module are respectively a convolution layer X-batch normalization layer X-activation layer X-convolution layer XI-batch normalization layer XI; wherein, the convolution kernel of convolution layer X is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolution layer XI is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512;

the bridge module has the structure of a convolution layer XII, a batch normalization layer XII, an activation layer XII, a convolution layer XIII, a batch normalization layer XIII and an activation layer XIII; wherein, the convolution kernel of the convolution layer XII is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of the convolution layer XIII is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 64;

the decoder-I module, the decoder-II module, the decoder-III module, the decoder-IV module and the decoder-V module are all of a first convolution layer, a first batch normalization layer, a first activation layer, a second convolution layer, a second batch normalization layer and a second activation layer; the convolution kernel of the first convolution layer is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 64; the convolution kernel of the second convolution layer is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 64, and the number of output channels is 64;

the structure of the output layer is a third convolution layer-a third activation layer; wherein, the convolution kernel of the third convolution layer is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 61, and the number of output channels is 1;

the active layer I, the active layer II, the active layer IV, the active layer VI, the active layer VIII, the active layer X, the active layer XII, the active layer XIII, the first active layer and the second active layer are all ReLU activation functions; the third activation layer is a Sigmoid activation function.

The double-stream fusion module TSF ₁ —TSF ₅ The calculation method comprises the following steps:

wherein, TSF _i ∈{TSF ₁ ,TSF ₂ ,TSF ₃ ,TSF ₄ ,TSF ₅ And when i =1, the controller sets the control value to zero,

is the result of the second input layer;

for the results generated by the encode2-i module operation in the RGB stream,

for the result of the operation of the encode1-i modules in the Gray stream,

representing element-by-element addition, concat (-) representing join operations in channel dimensions, conv (-) representing convolution operations, bn (-) representing bulk normalization operations, relu (-) representing activation functions.

The method for calculating the bridge module comprises the following steps:

wherein bridge _out As output of bridge moduleAs a result of this the user can,

as a dual stream fusion module TSF ₅ Conv (-) represents the convolution operation, bn (-) represents the batch normalization operation, and Relu (-) represents the activation function.

The feature fusion module FF ₁ -FF ₅ The calculation method comprises the following steps:

FF _i ＝Concat(Relu(Bn(Conv(encode′ _i-1 )))，Relu(Bn(Conv(encode _i )))，Relu(Bn(Conv(encode′ _i+1 ))))；

wherein, FF _i ∈{FF ₁ ,FF ₂ ,FF ₃ ,FF ₄ ,FF ₅ }，encode′ _i-1 Is the result of size conversion of the output result of the encode2-i-1 module in the RGB stream, encode' _i+1 The result of size conversion of the output result of the encode2-i +1 module in the RGB stream, encode _i Outputting a result for an encode2-i module in the RGB stream; when i =1, encode' ₀ Is the result of size conversion of the result of the second input layer, encode 'when i = 5' ₆ And outputting a result after the result is subjected to size conversion for the bridge module.

The decoding fusion module DF ₁ —DF ₅ The calculation method comprises the following steps:

DF _i ＝Concat(Relu(Bn(Conv(FF _i )))，Relu(Bn(Conv(decode _i+1 ))))；

wherein, DF _i ∈{DF ₁ ,DF ₂ ,DF ₃ ,DF ₄ ,DF ₅ }，decode _i+1 As a result of decoding of the decode-i +1 module, when i =5, decode ₆ Is the output result of the bridge module.

The loss function is:

wherein L is a loss value, L ^(p) For the loss value corresponding to the pth module, P =5, respectively corresponding to the outputs of the decode-I module, the decode-II module, the decode-III module, the decode-IV module and the decode-V module;

wherein the content of the first and second substances,

in order to be a loss of the BCE,

for loss of SSIM, w _bce Weight lost for BCE, w _ssim Weight lost for SSIM;

l _bce ＝-∑ _(x,y) [g(x,y)log(p(x,y))+(1-g(x,y))log(1-p(x,y))]；

wherein g (x, y) is the pixel value of the pixel (x, y) of the true value image, p (x, y) is the pixel value of the pixel (x, y) of the predicted image, and μ _x Is the mean value of the pixel values x'. Mu. _y Is the mean, σ, of the pixel values y _x Is the standard deviation, σ, of the pixel value x _y Is the standard deviation, C, of the pixel value y ₁ 、C ₂ Are all bias parameters, x' = { x _j :j＝1,...,N ² Is the pixel value of the predicted image, y' = { y } _j :j＝1,...,N ² The pixel values of the true value map are multiplied by N, and the area size of the predicted image and the true value map is multiplied by N.

Compared with the prior art, the invention has the following beneficial effects:

1) The invention takes RESNet34 as a backbone, and only reserves the coding network part of the feature extraction of the RESNet 34; during encoding, a double-current model is used, feature extraction is simultaneously carried out on an RGB image and a gray image, and the advantages of the gray image, namely, the brightness and the outline features of the image are more easily extracted. Compared with a method of only using an RGB image, the method can more effectively extract the image features.

2) The invention provides a coding fusion module aiming at the problem that the image extraction characteristics of most networks in coding are insufficient, and each layer of coding is combined with the information of the upper layer of coding to participate in the coding of the current layer, so that the whole coding process is smoother, more effective characteristics can be reserved, and the coding result of each layer can be better fused with the corresponding decoding layer to guide the decoding process.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of a dual stream fusion module of the present invention.

FIG. 3 is a diagram of a feature fusion module of the present invention.

FIG. 4 is a decoding fusion module diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a dual-stream encoding fusion saliency detection method based on RGB and gray scale images, which specifically includes the following steps:

s1, obtaining an RGB image and a truth map corresponding to the RGB image from a DUTS-TR data set, and processing the RGB image to generate a gray image corresponding to the RGB image;

Gray＝R×0.299+G×0.587+B×0.114；

S2, copying and combining the gray level images in the step S1 to obtain a three-channel gray level image; the gray image is a single-channel gray image, the single-channel gray image is duplicated and combined into a three-channel gray image, and the RGB image and the three-channel gray image are uniformly scaled to 224 x 224 in size.

S3, respectively inputting the three-channel gray-scale image and the RGB image into an encoder network to obtain a multi-scale characteristic image; the three-channel grayscale image and the RGB image are fed separately into two parallel streams of the encoder network (as shown in fig. 1). The encoder network comprises an encode1-I module, an encode1-II module, an encode1-III module, an encode1-IV module, an encode1-V module, an encode2-I module, an encode2-II module, an encode2-III module, an encode2-IV module, an encode2-V module and a double-current fusion module TSF ₁ Double-current fusion module TSF ₂ Double-current fusion module TSF ₃ Double-current fusion module TSF ₄ Double-current fusion module TSF ₅ And a bridge module; the input end of the encode1-I module is connected with a first input layer, the input of the first input layer is a three-channel gray-scale image, the output end of the encode1-I module is connected with the input end of the encode1-II module, the output end of the encode1-II module is connected with the input end of the encode1-III module, the output end of the encode1-III module is connected with the input end of the encode1-IV module, and the output end of the encode1-IV module is connected with the input end of the encode1-V module; the input end of the encode2-I module is connected with the second input layer, the input of the second input layer is an RGB image, and the second input layer, the output end of the encode1-I module, the output end of the encode2-I module and the TSF (dual stream fusion module) ₁ Is connected with the input end of the double-current fusion module TSF ₁ The output end of the module is connected with the input end of the encode2-II module, the output end of the encode1-II module, the output end of the encode2-I module and the encodeOutput ends of e2-II modules are all fused with double-current TSF ₂ Is connected with the input end of the double-current fusion module TSF ₂ The output end of the module is connected with the input end of the encode2-III module, the output end of the encode1-III module, the output end of the encode2-II module, the output end of the encode2-III module and the TSF (double current fusion module) ₃ Is connected with the input end of the double-current fusion module TSF ₃ The output end of the module is connected with the input end of the encode2-IV module, the output end of the encode1-IV module, the output end of the encode2-III module, the output end of the encode2-IV module and the TSF (double current fusion module) ₄ Is connected with the input end of the double-current fusion module TSF ₄ The output end of the module is connected with the input end of the encode2-V module, the output end of the encode1-V module, the output end of the encode2-IV module, the output end of the encode2-V module and the TSF (double current fusion module) ₅ Is connected with the input end of the double-current fusion module TSF ₅ The output end of the bridge module is connected with the input end of the bridge module; the second input layer, the output end of the encode2-I module, the output end of the encode2-II module, the output end of the encode2-III module, the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module are all connected with a decoder network.

The first input layer and the second input layer are both a convolution layer I-batch normalization layer I-active layer I; as shown in table 1, the convolution kernel of convolutional layer I is 3 × 3, the step size is 2, the edge complement is 1, the number of input channels is 1, and the number of output channels is 64.

TABLE 1 Structure of input layer

1	3 x 3 convolutional layers with step of 2, edge offset of 1, number of input channels of 1, number of output channels of 64
		2	Batch normalization
3	ReLU activation function

The encode1-I module and the encode2-I module are respectively of a convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; as shown in table 2, the convolution kernels of convolutional layers II and III are 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 64, and the number of output channels is 64.

TABLE 2 Encode1 Structure

1	3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 64, number of output channels of 64
		2	Batch normalization
3	ReLU activation function
		4	3 x 3 convolutional layers with step of 1, edge complement of 1, input channel number of 64, output channel number of 64
5	Batch normalization

The encode1-II module and the encode2-II module are respectively of a convolution layer IV-a batch normalization layer IV-an activation layer IV-a convolution layer V-a batch normalization layer V; as shown in table 3, the convolution kernel of convolution layer IV is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 128; the convolution kernel of convolutional layer V is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 128, and the number of output channels is 128.

TABLE 3 Encode2 Structure

1	3 x 3 convolutional layers with step size of 2, edge offset of 1, number of input channels of 64, number of output channels of 128
		2	Batch normalization
3	ReLU activation function
		4	3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 128, number of output channels of 128
5	Batch normalization

The encode1-III module and the encode2-III module are respectively of a convolution layer VI-batch normalization layer VI-active layer VI-convolution layer VII-batch normalization layer VII; as shown in table 4, the convolution kernel of convolutional layer VI is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 256; the convolution kernel of convolutional layer VII is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 256, and the number of output channels is 256.

TABLE 4 Encode3 structure

The encode1-IV module and the encode2-IV module are respectively of a convolution layer VIII-batch normalization layer VIII-activation layer VIII-convolution layer IX-batch normalization layer IX; as shown in table 5, the convolution kernel of the convolution layer VIII is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 512; the convolution kernel of convolution layer IX is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512.

TABLE 5 Encode4 Structure

1	3 x 3 convolutional layers with stride of 2, edge offset of 1, input lane number of 256, output lane number of 512
		2	Batch normalization
3	ReLU activation function
		4	3 x 3 convolutional layers with stride of 1, edge offset of 1, input lane number of 512, output lane number of 512
5	Batch normalization

The encode1-V module and the encode2-V module are respectively a convolution layer X-batch normalization layer X-activation layer X-convolution layer XI-batch normalization layer XI; as shown in table 6, the convolution kernel of convolution layer X is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolutional layer XI is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512.

TABLE 6 Encode5 Structure

1	3 x 3 convolutional layers, stride of 2, edge complement of 1, input channel number of 512, output channel number of 512
		2	Batch normalization
3	ReLU activation function
		4	3 x 3 convolutional layers with stride of 1, edge complement of 1, input channel number of 512, output channel number of 512
5	Batch normalization

The bridge module has the structure of a convolution layer XII, a batch normalization layer XII, an activation layer XII, a convolution layer XIII, a batch normalization layer XIII and an activation layer XIII; as shown in table 7, the convolution kernel of convolutional layer XII is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolutional layer XIII is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 64. The active layer I, the active layer II, the active layer IV, the active layer VI, the active layer VIII, the active layer X, the active layer XII and the active layer XIII are all ReLU activation functions.

TABLE 7 Bridge Structure

In a feature extraction network, different levels of convolutional layers correspond to different degrees of feature extraction. The multi-level integration can improve the representation capability of different resolution characteristics, and the aggregation of shallow layer characteristics can further strengthen detailed information and inhibit noise. In order to make the feature extraction stage smoother, extract multilevel features more fully and enhance the feature extraction capability, a TSF (twos treamfusion) module is designed at the encoding stage. Different from other networks of the same type, the feature of TSF module aggregation is used to guide not only the corresponding decoding process, but also the next encoding process, and the specific calculation manner of the TSF module is shown in fig. 2:

wherein, TSF _i ∈{TSF ₁ ,TSF ₂ ,TSF ₃ ,TSF ₄ ,TSF ₅ -when i =1, the output of the controller,

is the result of the second input layer;

for the results generated by the encode2-i module operation in the RGB stream,

for the result of the operation of the encode1-i modules in the Gray stream,

representing element-by-element addition, concat (-) represents the join operation done on the channel dimension, conv (-) represents the convolution operation, bn (-) represents the batch normalization operation done, relu (-) represents the activation function.

Encode from RGB stream ₂ Initially, the input of each encoding operation in the RGB stream is the result of the aggregation of TSF modules in the previous layer, such operation is only for the RGB stream, and the input of each layer of the Gray stream is the output of the layer above the current stream. Because our grayscale stream extraction features only assist the RGB stream features, grayscale images, while useful for extracting contour information, contain fewer features relative to RGB images. The information of the gray stream is not encoded fused.

At the end of the encoder, in order to further enlarge the perceptual domain and reduce the number of channels from decoding to improve the execution efficiency of the network, a bridge module is added, and the calculation method of the bridge module is as follows:

wherein bridge _out As a result of the output of the bridge module,

as a dual stream fusion module TSF ₅ Conv (-) represents the convolution operation, bn (-) represents the batch normalization operation, and Relu (-) represents the activation function. The bridge module is used to reduce the number of channels and parameters.

S4, decoding the multi-scale feature map by using a decoder network, and outputting a predicted image; the corresponding encoder stage side outputs: each decoding stage has side output content from the corresponding encoding stage aggregated. In order to better acquire the context information of the encoding stage, the side output respectively decoder is designed with an FF (feature fuse) module for fusing the content (see fig. 3). Each layer of the decoder network aggregates the output of the previous layer and the output of the FF module of the corresponding decoding layer, and designs a DF (decode function) module to aggregate the characteristics during decoding, as shown in fig. 4; the decoder keeps the number of channels of 64 per layer unchanged, and the last output layer reduces the number of channels to 1 by using a 3 × 3 filter, and outputs a 224 × 224 single-channel predicted picture.

The decoder network comprises a feature fusion module FF ₁ And a feature fusion module FF ₂ Feature fusion module FF ₃ And a feature fusion module FF ₄ And a feature fusion module FF ₅ Decoding fusion module DF ₁ Decoding fusion module DF ₂ Decoding fusion module DF ₃ Decoding fusion module DF ₄ Decoding fusion module DF ₅ The decoder comprises a decoder-I module, a decoder-II module, a decoder-III module, a decoder-IV module and a decoder-V module; feature fusion module FF ₁ The input end of the second input layer is respectively connected with the output end of the encode2-I module and the output end of the encode2-II module, and the characteristic fusion module FF ₁ Output end and decoding fusion module DF ₁ Are connected with each other; feature fusion module FF ₂ The input end of the character fusion module FF is respectively connected with the output end of the encode2-I module, the output end of the encode2-II module and the output end of the encode2-III module ₂ Output end and decoding fusion module DF ₂ Are connected with the input end of the power supply; feature fusion module FF ₃ The input end of the module is respectively connected with the output end of the encode2-II module, the output end of the encode2-III module and the output end of the encode2-IV module, and the feature fusion module FF ₃ Output end and decoding fusion module DF ₃ Are connected with each other; feature fusion module FF ₄ The input end of the first module is respectively connected with the output end of the encode2-III module, the output end of the encode2-IV module and the encodeThe output ends of the e2-V modules are connected, and the characteristic fusion module FF ₄ Output end and decoding fusion module DF ₄ Are connected with each other; feature fusion module FF ₅ The input end of the character fusion module FF is respectively connected with the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module ₅ The output end of the bridge module and the decoding fusion module DF are connected ₅ Are connected with each other; decoding fusion module DF ₅ The output end of the decoder is connected with the input end of a decoder-V module, and the output end of the decoder-V module is connected with a decoding fusion module DF ₄ Is connected to the input end of the decoding fusion module DF ₄ The output end of the decoder-IV module is connected with the input end of the decoder-IV module, and the output end of the decoder-IV module is connected with the decoding fusion module DF ₃ Is connected to the input end of the decoding fusion module DF ₃ Is connected with the input end of the decode-III module, the output end of the decode-III module is connected with the decoding fusion module DF ₂ Is connected to the input end of the decoding fusion module DF ₂ Is connected with the input end of a decode-II module, the output end of which is connected with a decoding fusion module DF ₁ Is connected to a decoding fusion module DF ₁ The output end of the decoder-I module is connected with the input end of the output layer, and the output end of the output layer outputs a predicted image.

The decoder-I module, the decoder-II module, the decoder-III module, the decoder-IV module and the decoder-V module are all of a first convolution layer, a first batch normalization layer, a first activation layer, a second convolution layer, a second batch normalization layer and a second activation layer; as shown in table 8, the convolution kernel of the first convolution layer is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 64; the convolution kernel of the second convolution layer is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64; the structure of the output layer is a third convolution layer-a third activation layer; as shown in table 9, the third convolutional layer convolution kernel is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 61, and the number of output channels is 1; the first activation layer and the second activation layer are both ReLU activation functions; the third activation layer is a Sigmoid activation function.

TABLE 8 Decode5-Decode1 structure

1	3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 128, number of output channels of 64
		2	Batch normalization
3	ReLU activation function
		4	3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 64, number of output channels of 64
5	Batch normalization
		6	ReLU activation function
7	Upsampling operation of bilinear interpolation (resolution doubled)

Table 9 Output structure

1	3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels 61, number of output channels 1
		2	Sigmoid activation function

The calculation method of the feature fusion module comprises the following steps:

wherein, FF _i ∈{FF ₁ ,FF ₂ ,FF ₃ ,FF ₄ ,FF ₅ }，encode′ _i-1 The result, encode ', of the size conversion of the output result of the encode2-i-1 module in the RGB stream' _i+1 The result of size conversion of the output result of the encode2-i +1 module in the RGB stream, encode _i Outputting a result for an encode2-i module in the RGB stream; when i =1, encode' ₀ Is the result of size conversion of the result of the second input layer, encode 'when i = 5' ₆ And outputting a result after the result is subjected to size conversion for the bridge module.

The calculation method of the decoding fusion module comprises the following steps:

DF _i ＝Concat(Relu(Bn(Conv(FF _i )))，Relu(Bn(Conv(decode _i+1 ))))；

S5, calculating loss values of the prediction image and the true value image by using a loss function, judging whether the loss values are threshold values or not, if so, obtaining a trained coding region-decoder network, executing the step S6, otherwise, automatically modifying weight parameters of all layers of the encoder network and the decoder network according to the loss values, and returning to the step S3;

the loss function is defined as the sum of the losses of all output layers:

wherein L is a loss value, L ^(p) The P =5 loss value corresponds to the P-th module, and corresponds to the outputs of the decode-I module, the decode-II module, the decode-III module, the decode-IV module and the decode-V module, respectively.

In most tasks of significance detection, a BCE (binary cross entry) loss function is widely used, but the BCE only focuses on the loss of each pixel in the whole world and cannot well and uniformly highlight a significance region and the boundary thereof, and a weighted mixed loss function is designed:

wherein the content of the first and second substances,

in order to be a loss of the BCE,

for loss of SSIM, w _bce Weight lost for BCE, w _ssim Is the weight lost by SSIM.

BCE loss is the most common loss function of the two classification problem and the image segmentation problem, and is defined as:

l _bce ＝-∑ _(x,y) [g(x,y)log(p(x,y))+(1-g(x,y))log(1-p(x,y))]；

SSIM was originally proposed for image quality assessment. It captures structural information in the image. Therefore, it is incorporated into a loss function to highlight the structural information of a salient object, which is defined as follows:

wherein g (x, y) is epsilon [0,1 ]]For the pixel value of a pixel point (x, y) of the truth value map, p (x, y) is E [0,1]Is the pixel value, μ, of a pixel point (x, y) of a predicted image _x Is the mean value of the pixel values x'. Mu. _y Is the mean, σ, of the pixel values y _x Is the standard deviation, σ, of the pixel value x _y Is the standard deviation of the pixel value y', C ₁ 、C ₂ Are all bias parameters, C ₁ ＝0.01 ² ，C ₂ ＝0.03 ² To avoid the case of 0, x' = { x = _j :j＝1,...,N ² Is the pixel value of the predicted image, y' = { y } _j :j＝1,...,N ² The pixel values of the true value map are multiplied by N, and the area size of the predicted image and the true value map is multiplied by N. The embodiment of the present invention uses a local SSIM index instead of a global SSIM index.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A double-stream coding fusion significance detection method based on RGB and gray level images is characterized by comprising the following steps:

the encoder network comprises an encode1-I module, an encode1-II module, an encode1-III module, an encode1-IV module, an encode1-V module, an encode2-I module, an encode2-II module, an encode2-III module, an encode2-IV module, an encode2-V module and a double-current fusion module TSF ₁ Double-current fusion module TSF ₂ Double-current fusion module TSF ₃ Double-current fusion module TSF ₄ Double-current fusion module TSF ₅ And a bridge module;

the input end of the encode2-I module is connected with the second input layer, the input of the second input layer is an RGB image, and the second input layer, the output end of the encode1-I module, the output end of the encode2-I module and the dual-stream fusion module TSF ₁ Is connected with the input end of the double-current fusion module TSF ₁ The output end of the module is connected with the input end of the encode2-II module, the output end of the encode1-II module, the output end of the encode2-I module, the output end of the encode2-II module and the TSF (double current fusion module) ₂ Is connected with the input end of the double-current fusion module TSF ₂ The output end of the module is connected with the input end of the encode2-III module, the output end of the encode1-III module, the output end of the encode2-II module, the output end of the encode2-III module and the TSF (double current fusion module) ₃ Is connected with the input end of the double-current fusion module TSF ₃ The output end of the module is connected with the input end of the encode2-IV module, the output end of the encode1-IV module, the output end of the encode2-III module, the output end of the encode2-IV module and the TSF (double current fusion module) ₄ Is connected with the input end of the double-current fusion module TSF ₄ The output end of the module is connected with the input end of the encode2-V module, the output end of the encode1-V module, the output end of the encode2-IV module, the output end of the encode2-V module and the TSF (double current fusion module) ₅ Is connected with the input end of the double-current fusion module TSF ₅ The output end of the bridge module is connected with the input end of the bridge module;

the second input layer, the output end of the encode2-I module, the output end of the encode2-II module, the output end of the encode2-III module, the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module are all connected with a decoder network;

the double-flow fusion module TSF ₁ -TSF ₅ The calculation method comprises the following steps:

wherein, TSF _i ∈{TSF ₁ ，TSF ₂ ，TSF ₃ ，TSF ₄ ，TSF ₅ -when i =1, the output of the controller,

is the result of the second input layer;

for the results generated by the encode2-i module operation in the RGB stream,

for the result of the operation of the encode1-i modules in the Gray stream,

representing element-by-element addition, concat (r) representing a join operation on channel dimensions, conv (r) representing a convolution operation, bn (r) representing a batch normalization operation, relu (r) representing an activation function;

the method for calculating the bridge module comprises the following steps:

wherein bridge _out As a result of the output of the bridge module,

for the double-flow fusion module TSF ₅ Conv (-) represents convolution operation, bn (-) represents batch normalization operation, relu (-) represents activation function;

the decoder network comprises a feature fusion module FF ₁ Feature fusion module FF ₂ And a feature fusion module FF ₃ And a feature fusion module FF ₄ Feature fusion module FF ₅ Decoding fusion module DF ₁ Decoding fusion module DF ₂ Decoding fusion module DF ₃ Decoding fusion module DF ₄ Decoding fusion module DF ₅ A decode-I module, a decode-II module, a decode-III module, a decode-IV module and a decode-V module;

feature fusion module FF ₁ The input end of the first input layer is respectively connected with the output end of the second input layer, the output end of the encode2-I module and the output end of the encode2-II module, and the feature fusion module FF ₁ Output end and decoding fusion module DF ₁ Are connected with the input end of the power supply; feature fusion module FF ₂ The input end of the module is respectively connected with the output end of the encode2-I module, the output end of the encode2-II module and the output end of the encode2-III module, and the characteristic fusion module FF ₂ Output end and decoding fusion module DF ₂ Are connected with the input end of the power supply; feature fusion module FF ₃ The input end of the module is respectively connected with the output end of the encode2-II module, the output end of the encode2-III module and the output end of the encode2-IV module, and the feature fusion module FF ₃ Output end and decoding fusion module DF ₃ Are connected with the input end of the power supply; feature fusion module FF ₄ The input end of the module is respectively connected with the output end of the encode2-III module, the output end of the encode2-IV module and the output end of the encode2-V module, and the characteristic fusion module FF ₄ Output end and decoding fusion module DF ₄ Are connected with each other; feature(s)Fusion module FF ₅ The input end of the character fusion module FF is respectively connected with the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module ₅ The output end of the bridge module and the decoding fusion module DF are respectively arranged at the output end of the bridge module ₅ Are connected with the input end of the power supply; decoding fusion module DF ₅ The output end of the decoder is connected with the input end of a decoder-V module, and the output end of the decoder-V module is connected with a decoding fusion module DF ₄ Is connected to a decoding fusion module DF ₄ Is connected with the input end of the decode-IV module, and the output end of the decode-IV module is connected with the decoding fusion module DF ₃ Is connected to the input end of the decoding fusion module DF ₃ Is connected with the input end of the decode-III module, and the output end of the decode-III module is connected with the decoding fusion module DF ₂ Is connected to the input end of the decoding fusion module DF ₂ Is connected with the input end of the decode-II module, and the output end of the decode-II module is connected with the decoding fusion module DF ₁ Is connected to a decoding fusion module DF ₁ The output end of the decoder-I module is connected with the input end of the output layer, and the output end of the output layer outputs a predicted image;

the feature fusion module FF ₁ -FF ₅ The calculating method comprises the following steps:

wherein, FF _i ∈{FF ₁ ，FF ₂ ，FF ₃ ，FF ₄ ，FF ₅ }，encode′ _i-1 Is the result of size conversion of the output result of the encode2-i-1 module in the RGB stream, encode' _i+1 The result of size conversion of the output result of the encode2-i +1 module in the RGB stream, encode _i Outputting a result for an encode2-i module in the RGB stream; when i =1, encode' ₀ Is the result of size conversion of the result of the second input layer, encode 'when i = 5' ₆ Outputting a result of size conversion for the bridge module;

the decoding fusion module DF ₁ -DF ₅ The calculation method comprises the following steps:

DF _i ＝Concat(Relu(Bn(Conv(FF _i )))，Relu(Bn(Conv(decode _i+1 ))))；

wherein, DF _i ∈{DF ₁ ，DF ₂ ，DF ₃ ，DF ₄ ，DF ₅ }，decode _i+1 As a result of decoding of the decode-i +1 module, when i =5, decode ₆ The output result is the output result of the bridge module;

the encode1-I module and the encode2-I module are respectively of a convolution layer II-batch normalization layer II-active layer II-convolution layer III-batch normalization layer III; the convolution kernels of the convolution layers II and III are 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64;

the encode1-II module and the encode2-II module are respectively of a convolution layer IV-a batch normalization layer IV-an activation layer IV-a convolution layer V-a batch normalization layer V; the convolution kernel of the convolution layer IV is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 64, and the number of output channels is 128; the convolution kernel of convolution layer V is 3 x 3, the step length is 1, the edge complement is 1, the number of input channels is 128, and the number of output channels is 128;

the encode1-III module and the encode2-III module are respectively of a convolution layer VI-batch normalization layer VI-active layer VI-convolution layer VII-batch normalization layer VII; the convolution kernel of the convolution layer VI is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 256; the convolution kernel of the convolution layer VII is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 256;

the encode1-IV module and the encode2-IV module are respectively of a convolutional layer VIII-batch normalization layer VIII-activation layer VIII-convolutional layer IX-batch normalization layer IX; wherein, the convolution kernel of the convolution layer VIII is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 512; the convolution kernel of the convolution layer IX is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512;

the encode1-V module and the encode2-V module are respectively a convolution layer X-batch normalization layer X-activation layer X-convolution layer XI-batch normalization layer XI; wherein, the convolution kernel of convolution layer X is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolution layer XI is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512;

the bridge module has the structure of a convolution layer XII, a batch normalization layer XII, an activation layer XII, a convolution layer XIII, a batch normalization layer XIII and an activation layer XIII; wherein, the convolution kernel of the convolution layer XII is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of the convolution layer XIII is 3 multiplied by 3, the step size is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 64;

the decoder-I module, the decoder-II module, the decoder-III module, the decoder-IV module and the decoder-V module are all of a first convolution layer, a first batch normalization layer, a first activation layer, a second convolution layer, a second batch normalization layer and a second activation layer; the convolution kernel of the first convolution layer is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 64; the convolution kernel of the second convolution layer is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64;

the structure of the output layer is a third convolution layer-a third active layer; wherein, the convolution kernel of the third convolution layer is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 61, and the number of output channels is 1;

the activation layer I, the activation layer II, the activation layer IV, the activation layer VI, the activation layer VIII, the activation layer X, the activation layer XII, the activation layer XIII, the first activation layer and the second activation layer are all ReLU activation functions; the third activation layer is a Sigmoid activation function;

2. The method for detecting significance of dual-stream coding fusion based on RGB and gray scale image as claimed in claim 1, wherein the method for generating the gray scale image corresponding to the RGB image comprises:

Gray＝R×0.299+G×0.587+B×0.114；

3. The method for detecting significance of dual-stream coding fusion based on RGB and gray scale image as claimed in claim 1, wherein the loss function is:

wherein L is a loss value, L ^(p) The loss value corresponding to the pth module is P =5, and the output of the decode-I module, the decode-II module, the decode-III module, the decode-IV module and the decode-V module respectively correspond to the loss value corresponding to the pth module;

wherein, the first and the second end of the pipe are connected with each other,

in order to be a loss of the BCE,

for loss of SSIM, w _bce Weight lost for BCE, w _ssim Weight lost for SSIM;

l _bce ＝-∑ _(x，y) [g(x，y)log(p(x，y))+(1-g(x，y))log(1-p(x，y))]；

wherein g (x, y) is the pixel value of the pixel (x, y) of the true value image, p (x, y) is the pixel value of the pixel (x, y) of the predicted image, and μ _x Is the mean value of the pixel values x'. Mu. _y Is the mean, σ, of the pixel values y _x Is the standard deviation, σ, of the pixel value x _y Is the standard deviation of the pixel value y', C ₁ 、C ₂ Are all bias parameters, x' = { x _j ：j＝1，...，N ² Is the pixel value of the predicted image, y' = { y } _j ：j＝1，...，N ² Is the pixel value of the true value map, and N × N is the area size of the predicted image and the true value map.