CN113538379B - Double-stream coding fusion significance detection method based on RGB and gray level images - Google Patents

Double-stream coding fusion significance detection method based on RGB and gray level images Download PDF

Info

Publication number
CN113538379B
CN113538379B CN202110805754.4A CN202110805754A CN113538379B CN 113538379 B CN113538379 B CN 113538379B CN 202110805754 A CN202110805754 A CN 202110805754A CN 113538379 B CN113538379 B CN 113538379B
Authority
CN
China
Prior art keywords
module
layer
output end
encode2
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110805754.4A
Other languages
Chinese (zh)
Other versions
CN113538379A (en
Inventor
徐涛
赵未硕
史增勇
周纪勇
蔡磊
马玉琨
柴豪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Institute of Science and Technology
Original Assignee
Henan Institute of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Institute of Science and Technology filed Critical Henan Institute of Science and Technology
Priority to CN202110805754.4A priority Critical patent/CN113538379B/en
Publication of CN113538379A publication Critical patent/CN113538379A/en
Application granted granted Critical
Publication of CN113538379B publication Critical patent/CN113538379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Abstract

The invention provides a double-current coding fusion significance detection method based on RGB and gray level images, which comprises the following steps: obtaining an RGB image and a true value image corresponding to the RGB image, generating a gray image corresponding to the RGB image, and copying and combining the gray image to obtain a three-channel gray image; secondly, inputting the three-channel gray-scale image and the RGB image into an encoder network respectively to obtain a multi-scale characteristic image; then, decoding the multi-scale characteristic graph by using a decoder network, and outputting a predicted image; calculating loss values of the prediction image and the true value image by using a loss function, and judging whether the training of the encoder-decoder network is finished according to the loss values; and finally, acquiring an image to be detected, respectively inputting the image to be detected and a three-channel gray scale map corresponding to the image to be detected into the coding region-decoder network, and outputting a prediction result of the image to be detected. The invention optimizes the edge part of the saliency image and highlights the saliency objects through the designed double-current encoder and the multi-scale decoder.

Description

Double-stream coding fusion significance detection method based on RGB and gray level images
Technical Field
The invention relates to the technical field of image processing, in particular to a double-stream coding fusion significance detection method based on RGB and gray level images.
Background
Salient Object Detection (SOD) aims to highlight the most interesting objects or regions of human vision in a scene. The method has wide application in computer vision, including image segmentation, image retrieval, object detection, visual tracking, image compression, scene classification and the like. Traditional methods rely primarily on manually-made low-level features such as color, shape, texture features and heuristic priors such as center priors, background priors, etc. But this approach leads to non-ideal detection results due to the lack of advanced semantic information. Recently, due to unprecedented success of Convolutional Neural Networks (CNNs), particularly Full Convolutional Networks (FCNs), FCN-based approaches have greatly improved SOD performance. Most of them use RGB images for saliency prediction. In recent years, some methods use an RGB image and a depth image to perform significance prediction (RGB-D) together, and the significance detection method of RDB-D effectively improves the quality of a prediction image.
However, the RGB-D saliency detection method requires an RGB image and its depth map to cooperate with input for prediction, and although it has a high-quality prediction result, because the depth map acquisition cost is high, and the cost of most devices carrying depth cameras is too high, the application scenarios of this method are not wide at present. Although the method for performing significance prediction by using RGB images has achieved good results, the existing methods still have many problems. Firstly, in the existing model network coding, only the RGB image is used to make some feature information difficult to distinguish, which results in the edge of the salient object in the predicted image being not clear enough, the internal display of the salient object being uneven, the contour prediction of the salient object being inaccurate, etc. In addition, the network feature extraction is not sufficient, most of the network feature extraction only focuses on the feature fusion in the decoding stage, and the feature extraction part in the encoding stage is omitted.
Disclosure of Invention
Aiming at the technical problems that the edge of a significant object in a predicted image is not clear enough, the internal display of the significant object is not uniform and the contour prediction of the significant object is inaccurate due to the fact that only RGB images are utilized to distinguish some characteristic information in the existing model network coding, the invention provides a double-current coding fusion significance detection method based on RGB and gray level images, which comprises a double-current coder and a multi-scale decoder, and designs a coding fusion module and a characteristic fusion module by considering the fusion of the respective advantages of RGB characteristics and gray level image characteristics; meanwhile, the problem that the sizes of the salient objects in the salient images are different is considered, and multi-scale side output fusion is adopted during decoding; therefore, the method can better optimize the edge part of the saliency image, more uniformly highlight the saliency object and extract more salient features under the condition that the background or the saliency object is complex.
The technical scheme of the invention is realized as follows:
a double-stream coding fusion significance detection method based on RGB and gray level images comprises the following steps:
s1, obtaining an RGB image and a truth map corresponding to the RGB image from a DUTS-TR data set, and processing the RGB image to generate a gray level image corresponding to the RGB image;
s2, copying and combining the gray level images in the step S1 to obtain a three-channel gray level image;
s3, respectively inputting the three-channel gray-scale image and the RGB image into an encoder network to obtain a multi-scale characteristic image;
s4, decoding the multi-scale feature map by using a decoder network, and outputting a predicted image;
s5, calculating loss values of the predicted image and the true value image by using a loss function, judging whether the loss values are threshold values or not, if so, obtaining a trained coding region-decoder network, executing the step S6, otherwise, automatically modifying weight parameters of all layers of the coder network and the decoder network according to the loss values, and returning to the step S3;
s6, obtaining the image to be detected, generating a three-channel gray-scale image of the image to be detected, respectively inputting the three-channel gray-scale image of the image to be detected and the three-channel gray-scale image of the image to be detected into a coding region-decoder network, and outputting a prediction result of the image to be detected.
The method for generating the gray level image corresponding to the RGB image comprises the following steps:
Gray=R×0.299+G×0.587+B×0.114;
wherein, gray is a Gray image, R is a red channel pixel value of an RGB image, G is a green channel pixel value of the RGB image, and B is a blue channel pixel value of the RGB image, respectively.
The encoder network comprises an encode1-I module, an encode1-II module, an encode1-III module and an encode1-IV moduleAn encode1-V module, an encode2-I module, an encode2-II module, an encode2-III module, an encode2-IV module, an encode2-V module and a double-current fusion module TSF 1 Double-current fusion module TSF 2 Double-current fusion module TSF 3 Double-current fusion module TSF 4 Double-current fusion module TSF 5 And a bridge module;
the input end of the encode1-I module is connected with a first input layer, the input of the first input layer is a three-channel gray-scale image, the output end of the encode1-I module is connected with the input end of the encode1-II module, the output end of the encode1-II module is connected with the input end of the encode1-III module, the output end of the encode1-III module is connected with the input end of the encode1-IV module, and the output end of the encode1-IV module is connected with the input end of the encode1-V module;
the input end of the encode2-I module is connected with the second input layer, the input of the second input layer is an RGB image, and the second input layer, the output end of the encode1-I module, the output end of the encode2-I module and the TSF (dual stream fusion module) 1 Is connected with the input end of the double-current fusion module TSF 1 The output end of the encode1-II module, the output end of the encode2-I module, the output end of the encode2-II module and the TSF (double current fusion module) 2 Is connected with the input end of the double-current fusion module TSF 2 The output end of the module is connected with the input end of the encode2-III module, the output end of the encode1-III module, the output end of the encode2-II module, the output end of the encode2-III module and the TSF (double current fusion module) 3 Is connected with the input end of the double-current fusion module TSF 3 The output end of the module is connected with the input end of the encode2-IV module, the output end of the encode1-IV module, the output end of the encode2-III module, the output end of the encode2-IV module and the TSF (double current fusion module) 4 Is connected with the input end of the double-current fusion module TSF 4 The output end of the module is connected with the input end of the encode2-V module, the output end of the encode1-V module, the output end of the encode2-IV module, the output end of the encode2-V module and the TSF (double current fusion module) 5 Is connected with the input end of the double-current fusion module TSF 5 The output end of the bridge module is connected with the input end of the bridge module;
The second input layer, the output end of the encode2-I module, the output end of the encode2-II module, the output end of the encode2-III module, the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module are all connected with a decoder network.
The decoder network comprises a feature fusion module FF 1 And a feature fusion module FF 2 Feature fusion module FF 3 Feature fusion module FF 4 Feature fusion module FF 5 Decoding fusion module DF 1 Decoding fusion module DF 2 Decoding fusion module DF 3 Decoding fusion module DF 4 Decoding fusion module DF 5 A decode-I module, a decode-II module, a decode-III module, a decode-IV module and a decode-V module;
feature fusion module FF 1 The input end of the first input layer is respectively connected with the output end of the second input layer, the output end of the encode2-I module and the output end of the encode2-II module, and the feature fusion module FF 1 Output end and decoding fusion module DF 1 Are connected with the input end of the power supply; feature fusion module FF 2 The input end of the module is respectively connected with the output end of the encode2-I module, the output end of the encode2-II module and the output end of the encode2-III module, and the characteristic fusion module FF 2 Output end and decoding fusion module DF 2 Are connected with each other; feature fusion module FF 3 The input end of the module is respectively connected with the output end of the encode2-II module, the output end of the encode2-III module and the output end of the encode2-IV module, and the feature fusion module FF 3 Output end and decoding fusion module DF 3 Are connected with the input end of the power supply; feature fusion module FF 4 The input end of the module is respectively connected with the output end of the encode2-III module, the output end of the encode2-IV module and the output end of the encode2-V module, and the characteristic fusion module FF 4 Output end and decoding fusion module DF 4 Are connected with each other; feature fusion module FF 5 The input end of the character fusion module FF is respectively connected with the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module 5 The output end of the bridge module and the output end of the bridge module are allAnd decoding fusion module DF 5 Are connected with the input end of the power supply; decoding fusion module DF 5 The output end of the decoder-V module is connected with the input end of the decoder-V module, and the output end of the decoder-V module is connected with the decoding fusion module DF 4 Is connected to the input end of the decoding fusion module DF 4 Is connected with the input end of the decode-IV module, and the output end of the decode-IV module is connected with the decoding fusion module DF 3 Is connected to the input end of the decoding fusion module DF 3 Is connected with the input end of the decode-III module, and the output end of the decode-III module is connected with the decoding fusion module DF 2 Is connected to the input end of the decoding fusion module DF 2 Is connected with the input end of the decode-II module, and the output end of the decode-II module is connected with the decoding fusion module DF 1 Is connected to the input end of the decoding fusion module DF 1 The output end of the decoder-I module is connected with the input end of the output layer, and the output end of the output layer outputs a predicted image.
The first input layer and the second input layer are both a convolution layer I, a batch normalization layer I and an activation layer I; the convolution kernel of the convolution layer I is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 1, and the number of output channels is 64;
the encode1-I module and the encode2-I module are respectively of a convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; the convolution kernels of the convolution layers II and III are 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64;
the encode1-II module and the encode2-II module are respectively of a convolution layer IV-batch normalization layer IV-active layer IV-convolution layer V-batch normalization layer V; the convolution kernel of the convolution layer IV is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 64, and the number of output channels is 128; the convolution kernel of convolution layer V is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 128, and the number of output channels is 128;
the encode1-III module and the encode2-III module are respectively of a convolution layer VI-batch normalization layer VI-activation layer VI-convolution layer VII-batch normalization layer VII; the convolution kernel of the convolution layer VI is 3 × 3, the step length is 2, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 256; the convolution kernel of the convolution layer VII is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 256;
the encode1-IV module and the encode2-IV module are respectively of a convolutional layer VIII-batch normalization layer VIII-activation layer VIII-convolutional layer IX-batch normalization layer IX; wherein, the convolution kernel of the convolution layer VIII is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 512; the convolution kernel of convolution layer IX is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512;
the encode1-V module and the encode2-V module are respectively a convolution layer X-batch normalization layer X-activation layer X-convolution layer XI-batch normalization layer XI; wherein, the convolution kernel of convolution layer X is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolution layer XI is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512;
the bridge module has the structure of a convolution layer XII, a batch normalization layer XII, an activation layer XII, a convolution layer XIII, a batch normalization layer XIII and an activation layer XIII; wherein, the convolution kernel of the convolution layer XII is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of the convolution layer XIII is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 64;
the decoder-I module, the decoder-II module, the decoder-III module, the decoder-IV module and the decoder-V module are all of a first convolution layer, a first batch normalization layer, a first activation layer, a second convolution layer, a second batch normalization layer and a second activation layer; the convolution kernel of the first convolution layer is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 64; the convolution kernel of the second convolution layer is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 64, and the number of output channels is 64;
the structure of the output layer is a third convolution layer-a third activation layer; wherein, the convolution kernel of the third convolution layer is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 61, and the number of output channels is 1;
the active layer I, the active layer II, the active layer IV, the active layer VI, the active layer VIII, the active layer X, the active layer XII, the active layer XIII, the first active layer and the second active layer are all ReLU activation functions; the third activation layer is a Sigmoid activation function.
The double-stream fusion module TSF 1 —TSF 5 The calculation method comprises the following steps:
Figure BDA0003166470510000051
wherein, TSF i ∈{TSF 1 ,TSF 2 ,TSF 3 ,TSF 4 ,TSF 5 And when i =1, the controller sets the control value to zero,
Figure BDA0003166470510000052
is the result of the second input layer;
Figure BDA0003166470510000053
for the results generated by the encode2-i module operation in the RGB stream,
Figure BDA0003166470510000054
for the result of the operation of the encode1-i modules in the Gray stream,
Figure BDA0003166470510000055
representing element-by-element addition, concat (-) representing join operations in channel dimensions, conv (-) representing convolution operations, bn (-) representing bulk normalization operations, relu (-) representing activation functions.
The method for calculating the bridge module comprises the following steps:
Figure BDA0003166470510000056
wherein bridge out As output of bridge moduleAs a result of this the user can,
Figure BDA0003166470510000057
as a dual stream fusion module TSF 5 Conv (-) represents the convolution operation, bn (-) represents the batch normalization operation, and Relu (-) represents the activation function.
The feature fusion module FF 1 -FF 5 The calculation method comprises the following steps:
FF i =Concat(Relu(Bn(Conv(encode′ i-1 ))),Relu(Bn(Conv(encode i ))),Relu(Bn(Conv(encode′ i+1 ))));
wherein, FF i ∈{FF 1 ,FF 2 ,FF 3 ,FF 4 ,FF 5 },encode′ i-1 Is the result of size conversion of the output result of the encode2-i-1 module in the RGB stream, encode' i+1 The result of size conversion of the output result of the encode2-i +1 module in the RGB stream, encode i Outputting a result for an encode2-i module in the RGB stream; when i =1, encode' 0 Is the result of size conversion of the result of the second input layer, encode 'when i = 5' 6 And outputting a result after the result is subjected to size conversion for the bridge module.
The decoding fusion module DF 1 —DF 5 The calculation method comprises the following steps:
DF i =Concat(Relu(Bn(Conv(FF i ))),Relu(Bn(Conv(decode i+1 ))));
wherein, DF i ∈{DF 1 ,DF 2 ,DF 3 ,DF 4 ,DF 5 },decode i+1 As a result of decoding of the decode-i +1 module, when i =5, decode 6 Is the output result of the bridge module.
The loss function is:
Figure BDA0003166470510000058
wherein L is a loss value, L (p) For the loss value corresponding to the pth module, P =5, respectively corresponding to the outputs of the decode-I module, the decode-II module, the decode-III module, the decode-IV module and the decode-V module;
Figure BDA0003166470510000059
wherein the content of the first and second substances,
Figure BDA00031664705100000510
in order to be a loss of the BCE,
Figure BDA00031664705100000511
for loss of SSIM, w bce Weight lost for BCE, w ssim Weight lost for SSIM;
l bce =-∑ (x,y) [g(x,y)log(p(x,y))+(1-g(x,y))log(1-p(x,y))];
Figure BDA0003166470510000061
wherein g (x, y) is the pixel value of the pixel (x, y) of the true value image, p (x, y) is the pixel value of the pixel (x, y) of the predicted image, and μ x Is the mean value of the pixel values x'. Mu. y Is the mean, σ, of the pixel values y x Is the standard deviation, σ, of the pixel value x y Is the standard deviation, C, of the pixel value y 1 、C 2 Are all bias parameters, x' = { x j :j=1,...,N 2 Is the pixel value of the predicted image, y' = { y } j :j=1,...,N 2 The pixel values of the true value map are multiplied by N, and the area size of the predicted image and the true value map is multiplied by N.
Compared with the prior art, the invention has the following beneficial effects:
1) The invention takes RESNet34 as a backbone, and only reserves the coding network part of the feature extraction of the RESNet 34; during encoding, a double-current model is used, feature extraction is simultaneously carried out on an RGB image and a gray image, and the advantages of the gray image, namely, the brightness and the outline features of the image are more easily extracted. Compared with a method of only using an RGB image, the method can more effectively extract the image features.
2) The invention provides a coding fusion module aiming at the problem that the image extraction characteristics of most networks in coding are insufficient, and each layer of coding is combined with the information of the upper layer of coding to participate in the coding of the current layer, so that the whole coding process is smoother, more effective characteristics can be reserved, and the coding result of each layer can be better fused with the corresponding decoding layer to guide the decoding process.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a diagram of a dual stream fusion module of the present invention.
FIG. 3 is a diagram of a feature fusion module of the present invention.
FIG. 4 is a decoding fusion module diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a dual-stream encoding fusion saliency detection method based on RGB and gray scale images, which specifically includes the following steps:
s1, obtaining an RGB image and a truth map corresponding to the RGB image from a DUTS-TR data set, and processing the RGB image to generate a gray image corresponding to the RGB image;
the method for generating the gray level image corresponding to the RGB image comprises the following steps:
Gray=R×0.299+G×0.587+B×0.114;
wherein, gray is a Gray image, R is a red channel pixel value of an RGB image, G is a green channel pixel value of the RGB image, and B is a blue channel pixel value of the RGB image, respectively.
S2, copying and combining the gray level images in the step S1 to obtain a three-channel gray level image; the gray image is a single-channel gray image, the single-channel gray image is duplicated and combined into a three-channel gray image, and the RGB image and the three-channel gray image are uniformly scaled to 224 x 224 in size.
S3, respectively inputting the three-channel gray-scale image and the RGB image into an encoder network to obtain a multi-scale characteristic image; the three-channel grayscale image and the RGB image are fed separately into two parallel streams of the encoder network (as shown in fig. 1). The encoder network comprises an encode1-I module, an encode1-II module, an encode1-III module, an encode1-IV module, an encode1-V module, an encode2-I module, an encode2-II module, an encode2-III module, an encode2-IV module, an encode2-V module and a double-current fusion module TSF 1 Double-current fusion module TSF 2 Double-current fusion module TSF 3 Double-current fusion module TSF 4 Double-current fusion module TSF 5 And a bridge module; the input end of the encode1-I module is connected with a first input layer, the input of the first input layer is a three-channel gray-scale image, the output end of the encode1-I module is connected with the input end of the encode1-II module, the output end of the encode1-II module is connected with the input end of the encode1-III module, the output end of the encode1-III module is connected with the input end of the encode1-IV module, and the output end of the encode1-IV module is connected with the input end of the encode1-V module; the input end of the encode2-I module is connected with the second input layer, the input of the second input layer is an RGB image, and the second input layer, the output end of the encode1-I module, the output end of the encode2-I module and the TSF (dual stream fusion module) 1 Is connected with the input end of the double-current fusion module TSF 1 The output end of the module is connected with the input end of the encode2-II module, the output end of the encode1-II module, the output end of the encode2-I module and the encodeOutput ends of e2-II modules are all fused with double-current TSF 2 Is connected with the input end of the double-current fusion module TSF 2 The output end of the module is connected with the input end of the encode2-III module, the output end of the encode1-III module, the output end of the encode2-II module, the output end of the encode2-III module and the TSF (double current fusion module) 3 Is connected with the input end of the double-current fusion module TSF 3 The output end of the module is connected with the input end of the encode2-IV module, the output end of the encode1-IV module, the output end of the encode2-III module, the output end of the encode2-IV module and the TSF (double current fusion module) 4 Is connected with the input end of the double-current fusion module TSF 4 The output end of the module is connected with the input end of the encode2-V module, the output end of the encode1-V module, the output end of the encode2-IV module, the output end of the encode2-V module and the TSF (double current fusion module) 5 Is connected with the input end of the double-current fusion module TSF 5 The output end of the bridge module is connected with the input end of the bridge module; the second input layer, the output end of the encode2-I module, the output end of the encode2-II module, the output end of the encode2-III module, the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module are all connected with a decoder network.
The first input layer and the second input layer are both a convolution layer I-batch normalization layer I-active layer I; as shown in table 1, the convolution kernel of convolutional layer I is 3 × 3, the step size is 2, the edge complement is 1, the number of input channels is 1, and the number of output channels is 64.
TABLE 1 Structure of input layer
1 3 x 3 convolutional layers with step of 2, edge offset of 1, number of input channels of 1, number of output channels of 64
2 Batch normalization
3 ReLU activation function
The encode1-I module and the encode2-I module are respectively of a convolution layer II-batch normalization layer II-activation layer II-convolution layer III-batch normalization layer III; as shown in table 2, the convolution kernels of convolutional layers II and III are 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 64, and the number of output channels is 64.
TABLE 2 Encode1 Structure
1 3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 64, number of output channels of 64
2 Batch normalization
3 ReLU activation function
4 3 x 3 convolutional layers with step of 1, edge complement of 1, input channel number of 64, output channel number of 64
5 Batch normalization
The encode1-II module and the encode2-II module are respectively of a convolution layer IV-a batch normalization layer IV-an activation layer IV-a convolution layer V-a batch normalization layer V; as shown in table 3, the convolution kernel of convolution layer IV is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 128; the convolution kernel of convolutional layer V is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 128, and the number of output channels is 128.
TABLE 3 Encode2 Structure
1 3 x 3 convolutional layers with step size of 2, edge offset of 1, number of input channels of 64, number of output channels of 128
2 Batch normalization
3 ReLU activation function
4 3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 128, number of output channels of 128
5 Batch normalization
The encode1-III module and the encode2-III module are respectively of a convolution layer VI-batch normalization layer VI-active layer VI-convolution layer VII-batch normalization layer VII; as shown in table 4, the convolution kernel of convolutional layer VI is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 256; the convolution kernel of convolutional layer VII is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 256, and the number of output channels is 256.
TABLE 4 Encode3 structure
Figure BDA0003166470510000081
Figure BDA0003166470510000091
The encode1-IV module and the encode2-IV module are respectively of a convolution layer VIII-batch normalization layer VIII-activation layer VIII-convolution layer IX-batch normalization layer IX; as shown in table 5, the convolution kernel of the convolution layer VIII is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 512; the convolution kernel of convolution layer IX is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512.
TABLE 5 Encode4 Structure
1 3 x 3 convolutional layers with stride of 2, edge offset of 1, input lane number of 256, output lane number of 512
2 Batch normalization
3 ReLU activation function
4 3 x 3 convolutional layers with stride of 1, edge offset of 1, input lane number of 512, output lane number of 512
5 Batch normalization
The encode1-V module and the encode2-V module are respectively a convolution layer X-batch normalization layer X-activation layer X-convolution layer XI-batch normalization layer XI; as shown in table 6, the convolution kernel of convolution layer X is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolutional layer XI is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512.
TABLE 6 Encode5 Structure
1 3 x 3 convolutional layers, stride of 2, edge complement of 1, input channel number of 512, output channel number of 512
2 Batch normalization
3 ReLU activation function
4 3 x 3 convolutional layers with stride of 1, edge complement of 1, input channel number of 512, output channel number of 512
5 Batch normalization
The bridge module has the structure of a convolution layer XII, a batch normalization layer XII, an activation layer XII, a convolution layer XIII, a batch normalization layer XIII and an activation layer XIII; as shown in table 7, the convolution kernel of convolutional layer XII is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolutional layer XIII is 3 × 3, the step size is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 64. The active layer I, the active layer II, the active layer IV, the active layer VI, the active layer VIII, the active layer X, the active layer XII and the active layer XIII are all ReLU activation functions.
TABLE 7 Bridge Structure
Figure BDA0003166470510000092
Figure BDA0003166470510000101
In a feature extraction network, different levels of convolutional layers correspond to different degrees of feature extraction. The multi-level integration can improve the representation capability of different resolution characteristics, and the aggregation of shallow layer characteristics can further strengthen detailed information and inhibit noise. In order to make the feature extraction stage smoother, extract multilevel features more fully and enhance the feature extraction capability, a TSF (twos treamfusion) module is designed at the encoding stage. Different from other networks of the same type, the feature of TSF module aggregation is used to guide not only the corresponding decoding process, but also the next encoding process, and the specific calculation manner of the TSF module is shown in fig. 2:
Figure BDA0003166470510000102
wherein, TSF i ∈{TSF 1 ,TSF 2 ,TSF 3 ,TSF 4 ,TSF 5 -when i =1, the output of the controller,
Figure BDA0003166470510000103
is the result of the second input layer;
Figure BDA0003166470510000104
for the results generated by the encode2-i module operation in the RGB stream,
Figure BDA0003166470510000105
for the result of the operation of the encode1-i modules in the Gray stream,
Figure BDA0003166470510000106
representing element-by-element addition, concat (-) represents the join operation done on the channel dimension, conv (-) represents the convolution operation, bn (-) represents the batch normalization operation done, relu (-) represents the activation function.
Encode from RGB stream 2 Initially, the input of each encoding operation in the RGB stream is the result of the aggregation of TSF modules in the previous layer, such operation is only for the RGB stream, and the input of each layer of the Gray stream is the output of the layer above the current stream. Because our grayscale stream extraction features only assist the RGB stream features, grayscale images, while useful for extracting contour information, contain fewer features relative to RGB images. The information of the gray stream is not encoded fused.
At the end of the encoder, in order to further enlarge the perceptual domain and reduce the number of channels from decoding to improve the execution efficiency of the network, a bridge module is added, and the calculation method of the bridge module is as follows:
Figure BDA0003166470510000107
wherein bridge out As a result of the output of the bridge module,
Figure BDA0003166470510000108
as a dual stream fusion module TSF 5 Conv (-) represents the convolution operation, bn (-) represents the batch normalization operation, and Relu (-) represents the activation function. The bridge module is used to reduce the number of channels and parameters.
S4, decoding the multi-scale feature map by using a decoder network, and outputting a predicted image; the corresponding encoder stage side outputs: each decoding stage has side output content from the corresponding encoding stage aggregated. In order to better acquire the context information of the encoding stage, the side output respectively decoder is designed with an FF (feature fuse) module for fusing the content (see fig. 3). Each layer of the decoder network aggregates the output of the previous layer and the output of the FF module of the corresponding decoding layer, and designs a DF (decode function) module to aggregate the characteristics during decoding, as shown in fig. 4; the decoder keeps the number of channels of 64 per layer unchanged, and the last output layer reduces the number of channels to 1 by using a 3 × 3 filter, and outputs a 224 × 224 single-channel predicted picture.
The decoder network comprises a feature fusion module FF 1 And a feature fusion module FF 2 Feature fusion module FF 3 And a feature fusion module FF 4 And a feature fusion module FF 5 Decoding fusion module DF 1 Decoding fusion module DF 2 Decoding fusion module DF 3 Decoding fusion module DF 4 Decoding fusion module DF 5 The decoder comprises a decoder-I module, a decoder-II module, a decoder-III module, a decoder-IV module and a decoder-V module; feature fusion module FF 1 The input end of the second input layer is respectively connected with the output end of the encode2-I module and the output end of the encode2-II module, and the characteristic fusion module FF 1 Output end and decoding fusion module DF 1 Are connected with each other; feature fusion module FF 2 The input end of the character fusion module FF is respectively connected with the output end of the encode2-I module, the output end of the encode2-II module and the output end of the encode2-III module 2 Output end and decoding fusion module DF 2 Are connected with the input end of the power supply; feature fusion module FF 3 The input end of the module is respectively connected with the output end of the encode2-II module, the output end of the encode2-III module and the output end of the encode2-IV module, and the feature fusion module FF 3 Output end and decoding fusion module DF 3 Are connected with each other; feature fusion module FF 4 The input end of the first module is respectively connected with the output end of the encode2-III module, the output end of the encode2-IV module and the encodeThe output ends of the e2-V modules are connected, and the characteristic fusion module FF 4 Output end and decoding fusion module DF 4 Are connected with each other; feature fusion module FF 5 The input end of the character fusion module FF is respectively connected with the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module 5 The output end of the bridge module and the decoding fusion module DF are connected 5 Are connected with each other; decoding fusion module DF 5 The output end of the decoder is connected with the input end of a decoder-V module, and the output end of the decoder-V module is connected with a decoding fusion module DF 4 Is connected to the input end of the decoding fusion module DF 4 The output end of the decoder-IV module is connected with the input end of the decoder-IV module, and the output end of the decoder-IV module is connected with the decoding fusion module DF 3 Is connected to the input end of the decoding fusion module DF 3 Is connected with the input end of the decode-III module, the output end of the decode-III module is connected with the decoding fusion module DF 2 Is connected to the input end of the decoding fusion module DF 2 Is connected with the input end of a decode-II module, the output end of which is connected with a decoding fusion module DF 1 Is connected to a decoding fusion module DF 1 The output end of the decoder-I module is connected with the input end of the output layer, and the output end of the output layer outputs a predicted image.
The decoder-I module, the decoder-II module, the decoder-III module, the decoder-IV module and the decoder-V module are all of a first convolution layer, a first batch normalization layer, a first activation layer, a second convolution layer, a second batch normalization layer and a second activation layer; as shown in table 8, the convolution kernel of the first convolution layer is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 64; the convolution kernel of the second convolution layer is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64; the structure of the output layer is a third convolution layer-a third activation layer; as shown in table 9, the third convolutional layer convolution kernel is 3 × 3, the step size is 1, the edge supplement is 1, the number of input channels is 61, and the number of output channels is 1; the first activation layer and the second activation layer are both ReLU activation functions; the third activation layer is a Sigmoid activation function.
TABLE 8 Decode5-Decode1 structure
1 3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 128, number of output channels of 64
2 Batch normalization
3 ReLU activation function
4 3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels of 64, number of output channels of 64
5 Batch normalization
6 ReLU activation function
7 Upsampling operation of bilinear interpolation (resolution doubled)
Table 9 Output structure
1 3 x 3 convolutional layers with stride of 1, edge offset of 1, number of input channels 61, number of output channels 1
2 Sigmoid activation function
The calculation method of the feature fusion module comprises the following steps:
FF i =Concat(Relu(Bn(Conv(encode′ i-1 ))),Relu(Bn(Conv(encode i ))),Relu(Bn(Conv(encode′ i+1 ))));
wherein, FF i ∈{FF 1 ,FF 2 ,FF 3 ,FF 4 ,FF 5 },encode′ i-1 The result, encode ', of the size conversion of the output result of the encode2-i-1 module in the RGB stream' i+1 The result of size conversion of the output result of the encode2-i +1 module in the RGB stream, encode i Outputting a result for an encode2-i module in the RGB stream; when i =1, encode' 0 Is the result of size conversion of the result of the second input layer, encode 'when i = 5' 6 And outputting a result after the result is subjected to size conversion for the bridge module.
The calculation method of the decoding fusion module comprises the following steps:
DF i =Concat(Relu(Bn(Conv(FF i ))),Relu(Bn(Conv(decode i+1 ))));
wherein, DF i ∈{DF 1 ,DF 2 ,DF 3 ,DF 4 ,DF 5 },decode i+1 As a result of decoding of the decode-i +1 module, when i =5, decode 6 Is the output result of the bridge module.
S5, calculating loss values of the prediction image and the true value image by using a loss function, judging whether the loss values are threshold values or not, if so, obtaining a trained coding region-decoder network, executing the step S6, otherwise, automatically modifying weight parameters of all layers of the encoder network and the decoder network according to the loss values, and returning to the step S3;
the loss function is defined as the sum of the losses of all output layers:
Figure BDA0003166470510000121
wherein L is a loss value, L (p) The P =5 loss value corresponds to the P-th module, and corresponds to the outputs of the decode-I module, the decode-II module, the decode-III module, the decode-IV module and the decode-V module, respectively.
In most tasks of significance detection, a BCE (binary cross entry) loss function is widely used, but the BCE only focuses on the loss of each pixel in the whole world and cannot well and uniformly highlight a significance region and the boundary thereof, and a weighted mixed loss function is designed:
Figure BDA0003166470510000131
wherein the content of the first and second substances,
Figure BDA0003166470510000132
in order to be a loss of the BCE,
Figure BDA0003166470510000133
for loss of SSIM, w bce Weight lost for BCE, w ssim Is the weight lost by SSIM.
BCE loss is the most common loss function of the two classification problem and the image segmentation problem, and is defined as:
l bce =-∑ (x,y) [g(x,y)log(p(x,y))+(1-g(x,y))log(1-p(x,y))];
SSIM was originally proposed for image quality assessment. It captures structural information in the image. Therefore, it is incorporated into a loss function to highlight the structural information of a salient object, which is defined as follows:
Figure BDA0003166470510000134
wherein g (x, y) is epsilon [0,1 ]]For the pixel value of a pixel point (x, y) of the truth value map, p (x, y) is E [0,1]Is the pixel value, μ, of a pixel point (x, y) of a predicted image x Is the mean value of the pixel values x'. Mu. y Is the mean, σ, of the pixel values y x Is the standard deviation, σ, of the pixel value x y Is the standard deviation of the pixel value y', C 1 、C 2 Are all bias parameters, C 1 =0.01 2 ,C 2 =0.03 2 To avoid the case of 0, x' = { x = j :j=1,...,N 2 Is the pixel value of the predicted image, y' = { y } j :j=1,...,N 2 The pixel values of the true value map are multiplied by N, and the area size of the predicted image and the true value map is multiplied by N. The embodiment of the present invention uses a local SSIM index instead of a global SSIM index.
S6, obtaining the image to be detected, generating a three-channel gray-scale image of the image to be detected, respectively inputting the three-channel gray-scale image of the image to be detected and the three-channel gray-scale image of the image to be detected into a coding region-decoder network, and outputting a prediction result of the image to be detected.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (3)

1. A double-stream coding fusion significance detection method based on RGB and gray level images is characterized by comprising the following steps:
s1, obtaining an RGB image and a truth map corresponding to the RGB image from a DUTS-TR data set, and processing the RGB image to generate a gray image corresponding to the RGB image;
s2, copying and combining the gray level images in the step S1 to obtain a three-channel gray level image;
s3, respectively inputting the three-channel gray-scale image and the RGB image into an encoder network to obtain a multi-scale characteristic image;
the encoder network comprises an encode1-I module, an encode1-II module, an encode1-III module, an encode1-IV module, an encode1-V module, an encode2-I module, an encode2-II module, an encode2-III module, an encode2-IV module, an encode2-V module and a double-current fusion module TSF 1 Double-current fusion module TSF 2 Double-current fusion module TSF 3 Double-current fusion module TSF 4 Double-current fusion module TSF 5 And a bridge module;
the input end of the encode1-I module is connected with a first input layer, the input of the first input layer is a three-channel gray-scale image, the output end of the encode1-I module is connected with the input end of the encode1-II module, the output end of the encode1-II module is connected with the input end of the encode1-III module, the output end of the encode1-III module is connected with the input end of the encode1-IV module, and the output end of the encode1-IV module is connected with the input end of the encode1-V module;
the input end of the encode2-I module is connected with the second input layer, the input of the second input layer is an RGB image, and the second input layer, the output end of the encode1-I module, the output end of the encode2-I module and the dual-stream fusion module TSF 1 Is connected with the input end of the double-current fusion module TSF 1 The output end of the module is connected with the input end of the encode2-II module, the output end of the encode1-II module, the output end of the encode2-I module, the output end of the encode2-II module and the TSF (double current fusion module) 2 Is connected with the input end of the double-current fusion module TSF 2 The output end of the module is connected with the input end of the encode2-III module, the output end of the encode1-III module, the output end of the encode2-II module, the output end of the encode2-III module and the TSF (double current fusion module) 3 Is connected with the input end of the double-current fusion module TSF 3 The output end of the module is connected with the input end of the encode2-IV module, the output end of the encode1-IV module, the output end of the encode2-III module, the output end of the encode2-IV module and the TSF (double current fusion module) 4 Is connected with the input end of the double-current fusion module TSF 4 The output end of the module is connected with the input end of the encode2-V module, the output end of the encode1-V module, the output end of the encode2-IV module, the output end of the encode2-V module and the TSF (double current fusion module) 5 Is connected with the input end of the double-current fusion module TSF 5 The output end of the bridge module is connected with the input end of the bridge module;
the second input layer, the output end of the encode2-I module, the output end of the encode2-II module, the output end of the encode2-III module, the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module are all connected with a decoder network;
the double-flow fusion module TSF 1 -TSF 5 The calculation method comprises the following steps:
Figure FDA0003878290280000011
wherein, TSF i ∈{TSF 1 ,TSF 2 ,TSF 3 ,TSF 4 ,TSF 5 -when i =1, the output of the controller,
Figure FDA0003878290280000021
is the result of the second input layer;
Figure FDA0003878290280000022
for the results generated by the encode2-i module operation in the RGB stream,
Figure FDA0003878290280000023
for the result of the operation of the encode1-i modules in the Gray stream,
Figure FDA0003878290280000024
representing element-by-element addition, concat (r) representing a join operation on channel dimensions, conv (r) representing a convolution operation, bn (r) representing a batch normalization operation, relu (r) representing an activation function;
the method for calculating the bridge module comprises the following steps:
Figure FDA0003878290280000025
wherein bridge out As a result of the output of the bridge module,
Figure FDA0003878290280000026
for the double-flow fusion module TSF 5 Conv (-) represents convolution operation, bn (-) represents batch normalization operation, relu (-) represents activation function;
s4, decoding the multi-scale feature map by using a decoder network, and outputting a predicted image;
the decoder network comprises a feature fusion module FF 1 Feature fusion module FF 2 And a feature fusion module FF 3 And a feature fusion module FF 4 Feature fusion module FF 5 Decoding fusion module DF 1 Decoding fusion module DF 2 Decoding fusion module DF 3 Decoding fusion module DF 4 Decoding fusion module DF 5 A decode-I module, a decode-II module, a decode-III module, a decode-IV module and a decode-V module;
feature fusion module FF 1 The input end of the first input layer is respectively connected with the output end of the second input layer, the output end of the encode2-I module and the output end of the encode2-II module, and the feature fusion module FF 1 Output end and decoding fusion module DF 1 Are connected with the input end of the power supply; feature fusion module FF 2 The input end of the module is respectively connected with the output end of the encode2-I module, the output end of the encode2-II module and the output end of the encode2-III module, and the characteristic fusion module FF 2 Output end and decoding fusion module DF 2 Are connected with the input end of the power supply; feature fusion module FF 3 The input end of the module is respectively connected with the output end of the encode2-II module, the output end of the encode2-III module and the output end of the encode2-IV module, and the feature fusion module FF 3 Output end and decoding fusion module DF 3 Are connected with the input end of the power supply; feature fusion module FF 4 The input end of the module is respectively connected with the output end of the encode2-III module, the output end of the encode2-IV module and the output end of the encode2-V module, and the characteristic fusion module FF 4 Output end and decoding fusion module DF 4 Are connected with each other; feature(s)Fusion module FF 5 The input end of the character fusion module FF is respectively connected with the output end of the encode2-IV module, the output end of the encode2-V module and the output end of the bridge module 5 The output end of the bridge module and the decoding fusion module DF are respectively arranged at the output end of the bridge module 5 Are connected with the input end of the power supply; decoding fusion module DF 5 The output end of the decoder is connected with the input end of a decoder-V module, and the output end of the decoder-V module is connected with a decoding fusion module DF 4 Is connected to a decoding fusion module DF 4 Is connected with the input end of the decode-IV module, and the output end of the decode-IV module is connected with the decoding fusion module DF 3 Is connected to the input end of the decoding fusion module DF 3 Is connected with the input end of the decode-III module, and the output end of the decode-III module is connected with the decoding fusion module DF 2 Is connected to the input end of the decoding fusion module DF 2 Is connected with the input end of the decode-II module, and the output end of the decode-II module is connected with the decoding fusion module DF 1 Is connected to a decoding fusion module DF 1 The output end of the decoder-I module is connected with the input end of the output layer, and the output end of the output layer outputs a predicted image;
the feature fusion module FF 1 -FF 5 The calculating method comprises the following steps:
FF i =Concat(Relu(Bn(Conv(encode′ i-1 ))),Relu(Bn(Conv(encode i ))),Relu(Bn(Conv(encode′ i+1 ))));
wherein, FF i ∈{FF 1 ,FF 2 ,FF 3 ,FF 4 ,FF 5 },encode′ i-1 Is the result of size conversion of the output result of the encode2-i-1 module in the RGB stream, encode' i+1 The result of size conversion of the output result of the encode2-i +1 module in the RGB stream, encode i Outputting a result for an encode2-i module in the RGB stream; when i =1, encode' 0 Is the result of size conversion of the result of the second input layer, encode 'when i = 5' 6 Outputting a result of size conversion for the bridge module;
the decoding fusion module DF 1 -DF 5 The calculation method comprises the following steps:
DF i =Concat(Relu(Bn(Conv(FF i ))),Relu(Bn(Conv(decode i+1 ))));
wherein, DF i ∈{DF 1 ,DF 2 ,DF 3 ,DF 4 ,DF 5 },decode i+1 As a result of decoding of the decode-i +1 module, when i =5, decode 6 The output result is the output result of the bridge module;
the first input layer and the second input layer are both a convolution layer I, a batch normalization layer I and an activation layer I; the convolution kernel of the convolution layer I is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 1, and the number of output channels is 64;
the encode1-I module and the encode2-I module are respectively of a convolution layer II-batch normalization layer II-active layer II-convolution layer III-batch normalization layer III; the convolution kernels of the convolution layers II and III are 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64;
the encode1-II module and the encode2-II module are respectively of a convolution layer IV-a batch normalization layer IV-an activation layer IV-a convolution layer V-a batch normalization layer V; the convolution kernel of the convolution layer IV is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 64, and the number of output channels is 128; the convolution kernel of convolution layer V is 3 x 3, the step length is 1, the edge complement is 1, the number of input channels is 128, and the number of output channels is 128;
the encode1-III module and the encode2-III module are respectively of a convolution layer VI-batch normalization layer VI-active layer VI-convolution layer VII-batch normalization layer VII; the convolution kernel of the convolution layer VI is 3 × 3, the step size is 2, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 256; the convolution kernel of the convolution layer VII is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 256;
the encode1-IV module and the encode2-IV module are respectively of a convolutional layer VIII-batch normalization layer VIII-activation layer VIII-convolutional layer IX-batch normalization layer IX; wherein, the convolution kernel of the convolution layer VIII is 3 multiplied by 3, the step length is 2, the edge supplement is 1, the number of input channels is 256, and the number of output channels is 512; the convolution kernel of the convolution layer IX is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512;
the encode1-V module and the encode2-V module are respectively a convolution layer X-batch normalization layer X-activation layer X-convolution layer XI-batch normalization layer XI; wherein, the convolution kernel of convolution layer X is 3 × 3, the step length is 2, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of convolution layer XI is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 512, and the number of output channels is 512;
the bridge module has the structure of a convolution layer XII, a batch normalization layer XII, an activation layer XII, a convolution layer XIII, a batch normalization layer XIII and an activation layer XIII; wherein, the convolution kernel of the convolution layer XII is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 512; the convolution kernel of the convolution layer XIII is 3 multiplied by 3, the step size is 1, the edge supplement is 1, the number of input channels is 512, and the number of output channels is 64;
the decoder-I module, the decoder-II module, the decoder-III module, the decoder-IV module and the decoder-V module are all of a first convolution layer, a first batch normalization layer, a first activation layer, a second convolution layer, a second batch normalization layer and a second activation layer; the convolution kernel of the first convolution layer is 3 × 3, the step length is 1, the edge supplement is 1, the number of input channels is 128, and the number of output channels is 64; the convolution kernel of the second convolution layer is 3 multiplied by 3, the step length is 1, the edge supplement is 1, the number of input channels is 64, and the number of output channels is 64;
the structure of the output layer is a third convolution layer-a third active layer; wherein, the convolution kernel of the third convolution layer is 3 × 3, the step length is 1, the edge complement is 1, the number of input channels is 61, and the number of output channels is 1;
the activation layer I, the activation layer II, the activation layer IV, the activation layer VI, the activation layer VIII, the activation layer X, the activation layer XII, the activation layer XIII, the first activation layer and the second activation layer are all ReLU activation functions; the third activation layer is a Sigmoid activation function;
s5, calculating loss values of the prediction image and the true value image by using a loss function, judging whether the loss values are threshold values or not, if so, obtaining a trained coding region-decoder network, executing the step S6, otherwise, automatically modifying weight parameters of all layers of the encoder network and the decoder network according to the loss values, and returning to the step S3;
s6, obtaining the image to be detected, generating a three-channel gray-scale image of the image to be detected, respectively inputting the three-channel gray-scale image of the image to be detected and the three-channel gray-scale image of the image to be detected into a coding region-decoder network, and outputting a prediction result of the image to be detected.
2. The method for detecting significance of dual-stream coding fusion based on RGB and gray scale image as claimed in claim 1, wherein the method for generating the gray scale image corresponding to the RGB image comprises:
Gray=R×0.299+G×0.587+B×0.114;
wherein, gray is a Gray image, R is a red channel pixel value of an RGB image, G is a green channel pixel value of the RGB image, and B is a blue channel pixel value of the RGB image, respectively.
3. The method for detecting significance of dual-stream coding fusion based on RGB and gray scale image as claimed in claim 1, wherein the loss function is:
Figure FDA0003878290280000041
wherein L is a loss value, L (p) The loss value corresponding to the pth module is P =5, and the output of the decode-I module, the decode-II module, the decode-III module, the decode-IV module and the decode-V module respectively correspond to the loss value corresponding to the pth module;
Figure FDA0003878290280000051
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003878290280000052
in order to be a loss of the BCE,
Figure FDA0003878290280000053
for loss of SSIM, w bce Weight lost for BCE, w ssim Weight lost for SSIM;
l bce =-∑ (x,y) [g(x,y)log(p(x,y))+(1-g(x,y))log(1-p(x,y))];
Figure FDA0003878290280000054
wherein g (x, y) is the pixel value of the pixel (x, y) of the true value image, p (x, y) is the pixel value of the pixel (x, y) of the predicted image, and μ x Is the mean value of the pixel values x'. Mu. y Is the mean, σ, of the pixel values y x Is the standard deviation, σ, of the pixel value x y Is the standard deviation of the pixel value y', C 1 、C 2 Are all bias parameters, x' = { x j :j=1,...,N 2 Is the pixel value of the predicted image, y' = { y } j :j=1,...,N 2 Is the pixel value of the true value map, and N × N is the area size of the predicted image and the true value map.
CN202110805754.4A 2021-07-16 2021-07-16 Double-stream coding fusion significance detection method based on RGB and gray level images Active CN113538379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110805754.4A CN113538379B (en) 2021-07-16 2021-07-16 Double-stream coding fusion significance detection method based on RGB and gray level images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110805754.4A CN113538379B (en) 2021-07-16 2021-07-16 Double-stream coding fusion significance detection method based on RGB and gray level images

Publications (2)

Publication Number Publication Date
CN113538379A CN113538379A (en) 2021-10-22
CN113538379B true CN113538379B (en) 2022-11-22

Family

ID=78099670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110805754.4A Active CN113538379B (en) 2021-07-16 2021-07-16 Double-stream coding fusion significance detection method based on RGB and gray level images

Country Status (1)

Country Link
CN (1) CN113538379B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139385A (en) * 2015-08-12 2015-12-09 西安电子科技大学 Image visual saliency region detection method based on deep automatic encoder reconfiguration
CN111104943A (en) * 2019-12-17 2020-05-05 西安电子科技大学 Color image region-of-interest extraction method based on decision-level fusion
CN113066074A (en) * 2021-04-10 2021-07-02 浙江科技学院 Visual saliency prediction method based on binocular parallax offset fusion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8989437B2 (en) * 2011-05-16 2015-03-24 Microsoft Corporation Salient object detection by composition
CN112132156B (en) * 2020-08-18 2023-08-22 山东大学 Image saliency target detection method and system based on multi-depth feature fusion

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139385A (en) * 2015-08-12 2015-12-09 西安电子科技大学 Image visual saliency region detection method based on deep automatic encoder reconfiguration
CN111104943A (en) * 2019-12-17 2020-05-05 西安电子科技大学 Color image region-of-interest extraction method based on decision-level fusion
CN113066074A (en) * 2021-04-10 2021-07-02 浙江科技学院 Visual saliency prediction method based on binocular parallax offset fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
融合RGB与灰度图像特征的行人再识别方法;姜国权;《计算机工程》;20210430;226-234 *

Also Published As

Publication number Publication date
CN113538379A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN110992275B (en) Refined single image rain removing method based on generation of countermeasure network
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN109409435B (en) Depth perception significance detection method based on convolutional neural network
CN113392711B (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN110070574B (en) Binocular vision stereo matching method based on improved PSMAT net
Li et al. Globally and locally semantic colorization via exemplar-based broad-GAN
CN113379707A (en) RGB-D significance detection method based on dynamic filtering decoupling convolution network
CN115424088A (en) Image processing model training method and device
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN111899168A (en) Remote sensing image super-resolution reconstruction method and system based on feature enhancement
CN113076947A (en) RGB-T image significance detection system with cross-guide fusion
Zhang et al. Hierarchical attention aggregation with multi-resolution feature learning for GAN-based underwater image enhancement
CN113538379B (en) Double-stream coding fusion significance detection method based on RGB and gray level images
CN117151990B (en) Image defogging method based on self-attention coding and decoding
CN111539434B (en) Infrared weak and small target detection method based on similarity
CN116681621A (en) Face image restoration method based on feature fusion and multiplexing
CN113298094B (en) RGB-T significance target detection method based on modal association and double-perception decoder
CN116188652A (en) Face gray image coloring method based on double-scale circulation generation countermeasure
CN114693953A (en) RGB-D significance target detection method based on cross-modal bidirectional complementary network
CN114663292A (en) Ultra-lightweight picture defogging and identification network model and picture defogging and identification method
Parihar et al. UndarkGAN: Low-light Image Enhancement with Cycle-consistent Adversarial Networks
CN111899161A (en) Super-resolution reconstruction method
CN112907469B (en) Underwater image identification method based on Lab domain enhancement, classification and contrast improvement
Zhou et al. Near-Infrared Image Colorization with Weighted UNet++ and Auxiliary Color Enhancement GAN
CN116843559A (en) Underwater image enhancement method based on image processing and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant