CN112529862A - Significance image detection method for interactive cycle characteristic remodeling - Google Patents
Significance image detection method for interactive cycle characteristic remodeling Download PDFInfo
- Publication number
- CN112529862A CN112529862A CN202011413838.5A CN202011413838A CN112529862A CN 112529862 A CN112529862 A CN 112529862A CN 202011413838 A CN202011413838 A CN 202011413838A CN 112529862 A CN112529862 A CN 112529862A
- Authority
- CN
- China
- Prior art keywords
- feature maps
- block
- feature
- layer
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 64
- 238000007634 remodeling Methods 0.000 title claims abstract description 12
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 9
- 230000002776 aggregation Effects 0.000 claims abstract description 146
- 238000004220 aggregation Methods 0.000 claims abstract description 146
- 238000013528 artificial neural network Methods 0.000 claims abstract description 96
- 238000000034 method Methods 0.000 claims abstract description 56
- 238000000605 extraction Methods 0.000 claims abstract description 54
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000012360 testing method Methods 0.000 claims abstract description 6
- 238000011176 pooling Methods 0.000 claims description 82
- 230000004913 activation Effects 0.000 claims description 59
- 230000006870 function Effects 0.000 claims description 28
- 238000005070 sampling Methods 0.000 claims description 23
- 230000004927 fusion Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 230000003321 amplification Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 11
- 238000005096 rolling process Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 16
- 238000011156 evaluation Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000008961 swelling Effects 0.000 description 2
- RTAQQCXQSZGOHL-UHFFFAOYSA-N Titanium Chemical compound [Ti] RTAQQCXQSZGOHL-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4007—Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/20—Image enhancement or restoration using local operators
- G06T5/30—Erosion or dilatation, e.g. thinning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a significance image detection method for interactive circulation feature remodeling, which constructs a convolutional neural network in a training stage, and comprises an input layer, a coding part, a decoding part and an output layer, wherein the coding part comprises a neural network block, and the decoding part comprises an information extraction block, a feature remodeling block, an information remodeling block, an expansion rolling block and a feature aggregation block; inputting three channels of RGB images of the 3D image and a three-channel depth map obtained by processing the depth image into a convolutional neural network for training to obtain a significance detection map; obtaining an optimal weight vector and an optimal bias item by calculating a loss function value between the significance detection graph and the label image; inputting three channels of RGB images of the 3D image to be detected and three-channel depth maps corresponding to the depth images into a convolutional neural network training model in a testing stage, and predicting by using an optimal weight vector and an optimal bias term to obtain a significance prediction image; the method has the advantages of clear significance detection result and high detection precision.
Description
Technical Field
The invention relates to a significance image detection technology for deep learning, in particular to a significance image detection method for interactive cycle characteristic remodeling.
Background
With the rapid development of artificial intelligence in the computer field, the saliency detection of images has become an increasingly interesting research field. The Salient Object Detection (SOD) aims at distinguishing visually most distinctive objects from the input image, and over the last decades, hundreds of conventional methods have been developed to solve the task of Salient Object Detection, which is an effective pre-processing step among many image processing and computer vision tasks, such as Object segmentation and tracking, video compression, image editing, texture smoothing, etc. Recent work is to learn and detect deep features of salient objects by using Convolutional Neural Networks (CNN), and these convolutional neural network models adopt a coding and decoding structure, and have a simple structure and high calculation efficiency. In the codec structure, the encoder usually extracts a plurality of features of different semantic levels and resolutions by using a pre-trained classification model (such as ResNet and VGG); the decoder combines the extracted features to generate a saliency map. The existing significance detection method of the coding and decoding structure using the convolutional neural network is quite effective, but the accuracy is still challenging. For example: features of different semantic levels and resolutions have different distribution characteristics, and high-level features have abundant semantic information but lack accurate position information; the low-level features have rich details, but are full of background noise, so that the detection accuracy of the method for fusing the high-level features and the low-level features is still not ideal. For features of different modalities, there is cluttered background information in both RGB information and depth information, and further intensive research is still needed to effectively distinguish the background from the foreground so as to generate a better saliency image.
Disclosure of Invention
The invention aims to solve the technical problem of providing a significance image detection method for interactive cycle characteristic remodeling, which has clear significance detection result and high detection precision.
The technical scheme adopted by the invention for solving the technical problems is as follows: a method for detecting a significance image of interactive cycle feature remodeling is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images asDenote the depth image of the k-th pair of original 3D images asTaking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label imageThen, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; wherein N is a positive integer, N is not less than 200, k is a positive integer, k is not less than 1 and not more than 200, x is not less than 1 and not more than W, y is not less than 1 and not more than H, W represents the width of the original 3D image and the RGB image, the depth image and the corresponding label image thereof, H represents the original 3D image and the RGB image, the depth image and the corresponding label image thereofThe height of the image or images,to representThe middle coordinate position is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 characteristic reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 characteristic aggregation blocks; the output layer comprises an output convolution layer, the size of convolution kernels of the output convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, and the step length is 1;
for an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height of the original RGB image is H;
for a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein the width of the original depth image is W, and the height of the original depth image is H;
for the coding part, a1 st neural network block, a2 nd neural network block, a3 rd neural network block, a4 th neural network block and a5 th neural network block are sequentially connected to form a color coding stream, and a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 isHas a height ofThe input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 isHas a height ofThe input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and the width of each characteristic map in S4 isHas a height ofThe input end of the 5 th neural network block receives all the characteristic maps in S4, the output end of the 5 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S5, and the width of each characteristic map in S5 isHas a height ofThe input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2Has a height ofThe input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3Has a height ofThe input end of the 9 th neural network block receives all the feature maps in D3, the output end of the 9 th neural network block outputs 512 feature maps, the set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4Has a height ofThe input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5Has a height ofThe encoding part provides all the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4 and D5 to the decoding part;
for the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3Has a height ofThe first input terminal of the 2 nd feature reconstruction block receives all the feature maps in S2, and the second input terminal of the 2 nd feature reconstruction blockThe input end receives all the feature maps in F3, the output end of the 2 nd feature reconstruction block outputs 128 feature maps, the set of the 128 feature maps is marked as F4, and the width of each feature map in F4 is equal toHas a height ofThe first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5Has a height ofThe first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6Has a height ofThe first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7Has a height ofThe first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8Has a height ofThe first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9Has a height ofThe first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10Has a height ofThe input end of the 2 nd information extraction block receives all the feature maps in S5, the output end of the 2 nd information extraction block outputs 512 feature maps, the set of the 512 feature maps is denoted as F11, and the width of each feature map in F11 isHas a height ofThe input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal toHas a height ofThe input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal toHas a height ofThe input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal toHas a height ofThe 5 th expansion volume block receives all the characteristic maps in D5 at its input end, and outputs 512 characteristic maps at its output end, and the 512 characteristic maps are set as P5 and each characteristic map in P5Has a width ofHas a height ofThe first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal toHas a height ofThe first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2Has a height ofThe first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, the output end of the 3 rd feature aggregation block outputs 64 feature maps, the set of the 64 feature maps is marked as A3, and the width of each feature map in A3 is equal toHas a height ofThe first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal toHas a height ofA first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer;
for the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map;
step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and marking the significance detection map corresponding to the kth pair of original 3D images asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D imagesAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; wherein M is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.
In step 1_2, the 2 information extraction blocks have the same structure and are composed of a1 st convolution block, a first maximum pooling layer, a first average pooling layer, a2 nd convolution block, a3 rd convolution block and a first up-sampling layer, wherein the 1 st convolution block comprises a first convolution layer, a first convolution layer and a first up-sampling layer which are sequentially connectedThe active layer, the second convolution layer and the second active layer, the 2 nd convolution block comprises a third convolution layer and a third active layer which are connected in sequence, the 3 rd convolution block comprises a fourth convolution layer and a fourth active layer which are connected in sequence, the input end of the first convolution layer in the 1 st information extraction block receives all feature maps in D1, the input end of the first convolution layer in the 2 nd information extraction block receives all feature maps in S5, the input end of the first maximum pooling layer, the input end of the first average pooling layer and the input end of the third convolution layer all receive all feature maps output by the output end of the fourth active layer, the channel number superposition operation is carried out on all feature maps output by the output end of the first maximum pooling layer and all feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, element multiplication operation is carried out on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, element addition operation is carried out on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, for the 1 st information extraction block, the set formed by all the feature maps obtained after the element addition operation is F1, and for the 2 nd information extraction block, the set formed by all the feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as niThen the number n of input channels of the 1 st information extraction block164, the number n of input channels of the 2 nd information extraction block2512, the convolution kernel size of the first convolution layer and the fourth convolution layer in the ith information extraction block is 1 × 1, and the number of convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the second convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero-filling parameter is 1, i is 1,2, and the first active layer, the second active layer, the third active layer and the fourth active layer are arranged in sequenceThe activation mode of the first upsampling layer is 'Relu', the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 multiplied by 2, the step length is 2, the value of the zero padding parameter is 0, the magnification of the first maximum pooling layer is 2, and the interpolation method is bilinear interpolation.
In step 1_2, the 5 feature reconstruction blocks have the same structure and are composed of a context attention block and a channel attention block, and for the 1 st feature reconstruction block, which performs a first element addition operation on all feature maps in S1 and all feature maps in F1, the input terminal of the context attention block receives all feature maps obtained after the first element addition operation, the input terminal of the channel attention block receives all feature maps output from the output terminal of the context attention block, element multiplication operation is carried out on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, performing element addition operation for a second time on all feature maps obtained by multiplying all feature maps in the S1 by elements, wherein a set formed by all feature maps obtained by the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10.
In step 1_2, the 4 information reconstruction blocks have the same structure and are composed of a second maximum pooling layer, a second average pooling layer, a4 th convolution block and a5 th convolution block, the 4 th convolution block comprises a fifth convolution layer and a fifth active layer which are sequentially connected, the 5 th convolution block comprises a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer which are sequentially connected, and the input end of the second maximum pooling layer and the input end of the second average pooling layer in the 1 st information reconstruction block both receive all the feature maps and the sixth convolution layer in F2The inputs of the stacks receive all the feature maps in D2, the inputs of the second largest pooling layer and the second average pooling layer in the 2 nd information re-modeling block receive all the feature maps in F4, the input of the sixth pooling layer receives all the feature maps in D3, the input of the second largest pooling layer and the input of the second average pooling layer in the 3 rd information re-modeling block receive all the feature maps in F6, the input of the sixth pooling layer receives all the feature maps in D4, the input of the second largest pooling layer and the input of the second average pooling layer in the 4 th information re-modeling block receive all the feature maps in F8, the input of the sixth pooling layer receives all the feature maps in D5, the all feature maps output at the output of the second largest subtracting pooling layer and all the feature maps output at the output of the second average pooling layer are operated by elements, receiving all feature maps obtained after the element subtraction operation by an input end of the fifth convolutional layer, performing element multiplication operation on all feature maps output by an output end of the fifth active layer and all feature maps output by an output end of the seventh active layer, performing element addition operation on all feature maps output by an output end of the fifth active layer and all feature maps obtained after the element multiplication operation, wherein for a1 st information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F3, for a2 nd information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F5, for a3 rd information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F7, and for a4 th information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F9; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1jThe number of input channels of the second input end is n2jThe number n1 of input channels at the first input of the 1 st information reproduction block164, number of input channels n2 of second input end1128, the number of input channels n1 at the first input of the 2 nd information reconstruction block2128, number of input channels n2 of second input end2256, number n1 of input channels at the first input of the 3 rd information reconstruction block3256, number of input channels n of second input end23512, the number of input channels n1 at the first input of the 4 th information reconstruction block4512, number of input channels n2 of second input end4512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2jThe step length is 1, the value of the zero padding parameter is 1, the activation mode of the fifth activation layer, the sixth activation layer and the seventh activation layer is 'Relu', the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 x 2, the step length is 2, the value of the zero padding parameter is 0, and when element subtraction operation is performed on all feature maps output by the output end of the second maximum pooling layer and all feature maps output by the output end of the second average pooling layer, corresponding elements in the corresponding feature maps output by the output end of the second average pooling layer are subtracted from elements in the feature maps output by the output end of the second maximum pooling layer.
In step 1_2, the 5 feature aggregation blocks have the same structure and are composed of a 6 th convolution block, a 7 th convolution block, a 8 th convolution block, a 9 th convolution block, a 10 th convolution block, a 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer and a residual fusion block, wherein the 6 th convolution block includes an eighth convolution layer and an eighth active layer which are sequentially connected, the 7 th convolution block includes a ninth convolution layer and a ninth active layer which are sequentially connected, the 8 th convolution block includes a tenth convolution layer and a tenth active layer which are sequentially connected, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer which are sequentially connected, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer which are sequentially connected, the 11 th convolution block includes a thirteenth convolution layer and a thirteenth active layer which are sequentially connected, the 12 th convolution block includes a fourteenth convolution layer and a fourteenth active layer which are sequentially connected, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth activation layer which are connected in sequence, and the residual error fusion block comprises a fifteenth convolution layer and a fifteenth activation layer which are connected in sequenceThe input terminal of the eighth convolutional layer in the 1 st feature aggregation block receives all the feature maps in F10, the input terminal of the ninth convolutional layer receives all the feature maps in P5, the input terminal of the second upsampling layer receives all the feature maps in F11, the input terminal of the eighth convolutional layer in the 2 nd feature aggregation block receives all the feature maps in F8, the input terminal of the ninth convolutional layer receives all the feature maps in P4, the input terminal of the second upsampling layer receives all the feature maps in A1, the input terminal of the eighth convolutional layer in the 3 rd feature aggregation block receives all the feature maps in F6, the input terminal of the ninth convolutional layer receives all the feature maps in P3, the input terminal of the second upsampling layer receives all the feature maps in A2, the input terminal of the eighth convolutional layer in the 4 th feature aggregation block receives all the feature maps in F4, The input end of the ninth convolutional layer receives all the feature maps in P2, the input end of the second upsampling layer receives all the feature maps in A3, the input end of the eighth convolutional layer of the 5 th feature aggregation block receives all the feature maps in F2, the input end of the ninth convolutional layer receives all the feature maps in P1, the input end of the second upsampling layer receives all the feature maps in A4, all the feature maps output by the output end of the eighth active layer and all the feature maps output by the output end of the ninth active layer are respectively subjected to channel quartering, the channel quartering is respectively divided into four parts in sequence, the first channel number superposition operation is carried out on the 1 st part of all the feature maps output by the output end of the eighth active layer and the 1 st part of all the feature maps output by the output end of the ninth active layer, the second channel number superposition operation is carried out on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, performing a third channel number superposition operation on the 3 rd parts of all the characteristic diagrams output by the output end of the eighth active layer and the 3 rd parts of all the characteristic diagrams output by the output end of the ninth active layer, performing a fourth channel number superposition operation on the 4 th parts of all the characteristic diagrams output by the output end of the eighth active layer and the 4 th parts of all the characteristic diagrams output by the output end of the ninth active layer, receiving all the characteristic diagrams output by the output end of the second upsampling layer by the input end of the tenth convolutional layer, and terminating the input end of the eleventh convolutional layer by the output end of the eleventh convolutional layerReceiving all feature maps obtained after the first channel number superposition operation, receiving all feature maps obtained after the second channel number superposition operation by an input end of a twelfth convolution layer, receiving all feature maps obtained after the third channel number superposition operation by an input end of a thirteenth convolution layer, receiving all feature maps obtained after the fourth channel number superposition operation by an input end of a fourteenth convolution layer, performing fifth channel number superposition operation on all feature maps output by an output end of an eleventh active layer, all feature maps output by an output end of the twelfth active layer, all feature maps output by an output end of the thirteenth active layer and all feature maps output by an output end of the fourteenth active layer, receiving all feature maps obtained after the fifth channel number superposition operation by an input end of a fifteenth convolution layer, performing element multiplication operation on all feature maps output by an output end of the tenth active layer and all feature maps output by an output end of the fifteenth active layer, performing a first element addition operation on all feature maps output by an output end of a tenth active layer and all feature maps obtained after the element multiplication operation, receiving all feature maps obtained after the first element addition operation by an input end of a sixteenth active layer, performing a second element addition operation on all feature maps output by an output end of the sixteenth convolutional layer and all feature maps obtained after the first element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is A1 for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A2 for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A3 for A3 rd feature aggregation block, and a set formed by all feature maps obtained after the second element addition operation is A4 for a4 th feature aggregation block, for the 5 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1mThe number of input channels of the second input end is n2mThe number of input channels of the third input end is n3mNumber n1 of input channels at the first input of the 1 st feature aggregation block1512, number of input channels n2 of second input end1512, number of input channels n3 of third input end1512, the number of input channels n1 at the first input of the 2 nd feature aggregation block2512, number of input channels n2 of second input end2512, number of input channels n3 of third input end2256, input channel number n1 at the first input of the 3 rd feature aggregation block3256, number of input channels n2 of second input end3256, number of input channels n3 of third input end3128, the number of input channels n1 at the first input of the 4 th feature aggregation block4128, number of input channels n2 of second input end4128, number of input channels n3 of third input end464, the number n1 of input channels at the first input of the 5 th feature aggregation block564, number of input channels n2 of second input end564, number of input channels n3 of third input end532, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of ninth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero-filled parameter value of 0, m characteristic aggregation blockThe size of the convolution kernel of the sixteenth convolution layer of (2) is 3 × 3, and the number of convolution kernels is n3mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation.
Compared with the prior art, the invention has the advantages that:
1) the convolutional neural network constructed by the method is a double-flow end-to-end interaction cycle characteristic remodeling network system structure, information flows of two modes are mutually communicated so as to extract enough complementary information, and meanwhile, the background noise of the two modes is inhibited, so that a convolutional neural network training model obtained by training has better significance detection performance.
2) The convolutional neural network constructed by the method of the invention is provided with the information extraction block, and the information extraction block can further extract the foreground information of the shallow depth map and the foreground information of the deep color map through the pooling operation, thereby being beneficial to the full extraction of the information and leading the trained convolutional neural network training model to be capable of effectively detecting the significant objects.
3) The convolutional neural network constructed by the method of the invention is designed with the characteristic remolding block and the information remolding block, the characteristic remolding block fuses color information by taking depth information as weight, and the information remolding block fuses the fusion information of the characteristic remolding block and adjacent depth information again to obtain complementary context characteristics, so that the convolutional neural network training model obtained by training can effectively detect a significant object.
4) The feature aggregation block is designed in the convolutional neural network constructed by the method, and the local features and the global features of the two modes are fully fused, so that the convolutional neural network training model obtained by training can effectively detect the significant object.
Drawings
FIG. 1 is a schematic diagram of the structure of an end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 2 is a schematic diagram of the structure of an information extraction block in an end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 3 is a schematic diagram of the structure of the feature reconstruction block in the end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 4 is a schematic diagram of the structure of the information reconstruction block in the end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 5 is a schematic diagram of a structure of a feature aggregation block in an end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 6a is an RGB image of the 1 st pair of 3D images to be saliency detected;
FIG. 6b is a depth image of the 1 st pair of 3D images to be saliency detected;
FIG. 6c is a predicted salient image obtained by processing FIGS. 6a and 6b using the method of the present invention;
FIG. 6D is a label image corresponding to the 1 st pair of 3D images to be detected for saliency;
FIG. 7a is an RGB image of the 2 nd pair of 3D images to be saliency detected;
FIG. 7b is a depth image of the 2 nd pair of 3D images to be saliency detected;
FIG. 7c is a predicted salient image obtained by processing FIGS. 7a and 7b using the method of the present invention;
FIG. 7D is a label image corresponding to the 2 nd pair of 3D images to be saliency detected;
FIG. 8a is an RGB image of a3 rd pair of 3D images to be saliency detected;
FIG. 8b is a depth image of the 3 rd pair of 3D images to be saliency detected;
FIG. 8c is a predicted salient image obtained by processing FIGS. 8a and 8b using the method of the present invention;
FIG. 8D is a label image corresponding to the 3 rd pair of 3D images to be saliency detected;
FIG. 9a is an RGB image of the 4 th pair of 3D images to be saliency detected;
FIG. 9b is a depth image of the 4 th pair of 3D images to be saliency detected;
FIG. 9c is a predicted salient image obtained by processing FIGS. 9a and 9b using the method of the present invention;
FIG. 9D is a label image corresponding to the 4 th pair of 3D images to be saliency detected;
FIG. 10a is a PR (accurate-recall) plot of a 3D image for inspection in a NJU2K dataset processed using the method of the present invention;
fig. 10b is a PR (precision-recall) plot obtained by processing a 3D image for detection in an NLPR dataset using the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a significance image detection method for interactive cycle characteristic remodeling.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images asDenote the depth image of the k-th pair of original 3D images asTaking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label imageThen, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; each pair of original 3D images comprises an RGB image and a depth image, N is a positive integer, N is more than or equal to 200, if N is 600, k is a positive integer, k is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, and W represents the original 3D images and the RGB images and the depth images thereofH represents the height of the original 3D image, its RGB image, depth image, and corresponding tag image, W-H-224 in this embodiment,to representThe middle coordinate position is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing an end-to-end convolutional neural network: as shown in fig. 1, the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB map input layer and a depth map input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 feature reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 feature aggregation blocks; the output layer comprises an output convolutional layer, the size of a convolution kernel of the output convolutional layer is 3 multiplied by 3, the number of the convolution kernels is 1, the step length is 1, and the output convolutional layer is a commonly used convolutional layer.
For an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height is H.
For a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein, the width of the original depth image is W, and the height is H.
For the coding part, a1 st neural network block, a2 nd neural network block, a3 rd neural network block, a4 th neural network block and a5 th neural network block are sequentially connected to form a color coding stream, and a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 isHas a height ofThe input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 isHas a height ofThe input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and each characteristic map in S4The width of the figure isHas a height ofThe input end of the 5 th neural network block receives all the characteristic maps in S4, the output end of the 5 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S5, and the width of each characteristic map in S5 isHas a height ofThe input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2Has a height ofThe input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3Has a height ofThe input of the 9 th neural network block receives all the feature maps in D3, the 9 thThe output ends of the 9 neural network blocks output 512 feature maps, a set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4Has a height ofThe input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5Has a height ofThe encoding portion provides all of the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4, D5 to the decoding portion.
For the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3Has a height ofThe first input end of the 2 nd feature reconstruction block receives all the feature maps in S2, the second input end of the 2 nd feature reconstruction block receives all the feature maps in F3, the output end of the 2 nd feature reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F4, and the width of each feature map in F4 is F4Has a height ofThe first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5Has a height ofThe first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6Has a height ofThe first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7Has a height ofThe first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8Has a height ofThe first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9Has a height ofThe first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10Has a height ofThe input terminal of the 2 nd information extraction block receives all the feature maps in S5, and the output terminal of the 2 nd information extraction block outputs 5The set of 12 feature maps is denoted as F11, and the width of each feature map in F11 isHas a height ofThe input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal toHas a height ofThe input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal toHas a height ofThe input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal toHas a height ofThe 5 th swellingThe input end of the swelling volume block receives all the characteristic maps in D5, the output end of the 5 th swelling volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P5, and the width of each characteristic map in P5 is equal toHas a height ofThe first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal toHas a height ofThe first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2Has a height ofThe first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, the output end of the 3 rd feature aggregation block outputs 64 feature maps, the collection of the 64 feature maps is marked as A3,each feature map in a3 has a width ofHas a height ofThe first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal toHas a height ofA first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer.
For the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map.
Step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and marking the significance detection map corresponding to the kth pair of original 3D images asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D imagesAndthe value of the loss function in between is recorded asIn this embodiment, the loss function value is obtained using the conventional two-class cross entropy.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; where M > 1, M is 1025 in this example.
The test stage process comprises the following specific steps:
step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.
In this embodiment, in step 1_2, the 2 information extraction blocks have the same structure, and as shown in fig. 2, the 2 information extraction blocks are composed of a1 st Convolution block, a first Maximum pooling layer (maxiumo), a first Average pooling layer (averpooling), a2 nd Convolution block, a3 rd Convolution block, and a first upsampling layer, the 1 st Convolution block includes a first Convolution layer (Conv), a first active layer (Act), a second Convolution layer, and a second active layer, the 2 nd Convolution block includes a third Convolution layer and a third active layer, which are connected in sequence, the 3 rd Convolution block includes a fourth Convolution layer and a fourth active layer, which are connected in sequence, an input end of the first Convolution layer in the 1 st information extraction block receives all feature maps in D1, an input end of the first Convolution layer in the 2 nd information extraction block receives all feature maps in S5, and an input end of the first Maximum pooling layer, The input end of the first average pooling layer and the input end of the third convolution layer both receive all the feature maps output by the output end of the fourth active layer, the channel number superposition operation is performed on all the feature maps output by the output end of the first maximum pooling layer and all the feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all the feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, the element multiplication operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, the element addition operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, and the set formed by all the feature maps obtained after the element addition operation for the 1 st information extraction block is F1, for the 2 nd information extraction block, a set formed by all feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as niThen the number n of input channels of the 1 st information extraction block164, the number n of input channels of the 2 nd information extraction block2512, the convolution kernel (kernel _ size) of the first convolution layer and the fourth convolution layer in the ith information extraction block has a size of 1 × 1, and the convolution kernelThe number of (filters) is niStep size (stride) is 1, zero padding parameter (padding) value is 0, convolution kernel size of the second convolution layer in the ith information extraction block is 3 x 3, and the number of convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 1, i is 1,2, the activation mode of the first activation layer, the second activation layer, the third activation layer and the fourth activation layer is "Relu", the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 × 2, the step length is 2, the value of the zero padding parameter is 0, the magnification (scale factor) of the first up-sampling layer is 2, and the interpolation method is bilinear interpolation (bilinear). Here, the number-of-channels superimposing operation, the element multiplication operation, and the element addition operation are all related art. C in fig. 2 denotes a channel number superimposing operation, + denotes an element adding operation, and x denotes an element multiplying operation.
In this embodiment, in step 1_2, the 5 feature reconstruction blocks have the same structure, as shown in figure 3, it consists of a context attention block and a channel attention block, for the 1 st feature reconstruction block, it performs the first element addition operation on all feature maps in S1 and all feature maps in F1, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, element multiplication operation is carried out on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, performing element addition operation for a second time on all feature maps obtained by multiplying all feature maps in the S1 by elements, wherein a set formed by all feature maps obtained by the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10. Here, the contextual attention block and the channel attention block refer to the DAM module from the papers m.zhang, s. -x.fei, j.liu, s.xu, y.piao, and h.lu, "asymmetry metric two-stream architecture for access rgb-d safety detection," in Proceedings of European Conference on Computer Vision,2020 "(zh\281565639, fisher seuskou, liujie, wink, pubrave day and luchuan," Asymmetric two-stream architecture for accurate rgb-d significance detection ", European Conference corpus of Computer Vision,2020 year). + in fig. 3 represents an element addition operation, and x represents an element multiplication operation.
In this embodiment, in step 1_2, the 4 information reconstruction blocks have the same structure, and as shown in fig. 4, the 4 th convolution block includes a fifth convolution layer and a fifth active layer connected in sequence, the 5 th convolution block includes a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer connected in sequence, the input terminal of the second maximum convolution layer and the input terminal of the second average convolution layer in the 1 st information reconstruction block both receive all the feature maps in F2, the input terminal of the sixth convolution layer receives all the feature maps in D2, the input terminal of the second maximum convolution layer and the input terminal of the second average convolution layer in the 2 nd information reconstruction block both receive all the feature maps in F4, and the input terminal of the sixth convolution layer receives all the feature maps in D3, the input of the second largest pooling layer and the input of the second average pooling layer in the 3 rd information reconstruction block both receive all the feature maps in F6, the input of the sixth convolution layer receives all the feature maps in D4, the input of the second largest pooling layer and the input of the second average pooling layer in the 4 th information reconstruction block both receive all the feature maps in F8, the input of the sixth convolution layer receives all the feature maps in D5, the element subtraction is performed on all the feature maps output by the output of the second largest pooling layer and all the feature maps output by the output of the second average pooling layer, the input of the fifth convolution layer receives the elementPerforming element multiplication on all feature maps output by the output end of the fifth active layer and all feature maps output by the output end of the seventh active layer, performing element addition on all feature maps output by the output end of the fifth active layer and all feature maps obtained after the element multiplication, wherein for the 1 st information reconstruction block, a set formed by all feature maps obtained after the element addition is F3, for the 2 nd information reconstruction block, a set formed by all feature maps obtained after the element addition is F5, for the 3 rd information reconstruction block, a set formed by all feature maps obtained after the element addition is F7, and for the 4 th information reconstruction block, a set formed by all feature maps obtained after the element addition is F9; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1jThe number of input channels of the second input end is n2jThe number n1 of input channels at the first input of the 1 st information reproduction block164, number of input channels n2 of second input end1128, the number of input channels n1 at the first input of the 2 nd information reconstruction block2128, number of input channels n2 of second input end2256, number n1 of input channels at the first input of the 3 rd information reconstruction block3256, number of input channels n2 of second input end3512, the number of input channels n1 at the first input of the 4 th information reconstruction block4512, number of input channels n2 of second input end4512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2jThe step length is 1, the value of the zero padding parameter is 1, the activation modes of the fifth activation layer, the sixth activation layer and the seventh activation layer are Relu, the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 multiplied by 2, the step length is 2, the value of the zero padding parameter is 0, and for the secondAnd when the element subtraction operation is carried out on all the characteristic graphs output by the output end of the maximum pooling layer and all the characteristic graphs output by the output end of the second average pooling layer, the element in the characteristic graph output by the output end of the second maximum pooling layer is subtracted by the corresponding element in the corresponding characteristic graph output by the output end of the second average pooling layer. Here, the element subtraction operation, the element multiplication operation, and the element addition operation are all related art. In fig. 4, "-" denotes an element subtraction operation, "+ denotes an element addition operation, and ×" denotes an element multiplication operation.
In this embodiment, in step 1_2, the 5 feature aggregation blocks have the same structure, and as shown in fig. 5, the feature aggregation block is composed of a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block, a 10 th convolution block, an 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer, and a residual fusion block, the 6 th convolution block includes an eighth convolution layer and an eighth active layer that are sequentially connected, the 7 th convolution block includes a ninth convolution layer and a ninth active layer that are sequentially connected, the 8 th convolution block includes a tenth convolution layer and a tenth active layer that are sequentially connected, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer that are sequentially connected, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer that are sequentially connected, the 11 th convolution block includes a thirteenth convolution layer and a thirteenth active layer that are sequentially connected, the 12 th convolution block comprises a fourteenth convolution layer and a fourteenth active layer which are connected in sequence, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth active layer which are connected in sequence, the residual fusion block comprises a sixteenth active layer, a third largest pooling layer and a sixteenth convolution layer which are connected in sequence, an input end of an eighth convolution layer in the 1 st feature aggregation block receives all feature maps in F10, an input end of a ninth convolution layer receives all feature maps in P5, an input end of a second up-sampling layer receives all feature maps in F11, an input end of an eighth convolution layer of the 2 nd feature aggregation block receives all feature maps in F8, an input end of the ninth convolution layer receives all feature maps in P4, an input end of the second up-sampling layer receives all feature maps in A1, an input end of an eighth convolution layer of the 3 rd feature aggregation block receives all feature maps in F6, Feeding of the ninth convolution layerAn input terminal receives all the feature maps in P3, an input terminal of a second upsampling layer receives all the feature maps in A2, an input terminal of an eighth convolutional layer of a4 th feature aggregation block receives all the feature maps in F4, an input terminal of a ninth convolutional layer receives all the feature maps in P2, an input terminal of the second upsampling layer receives all the feature maps in A3, an input terminal of an eighth convolutional layer of a5 th feature aggregation block receives all the feature maps in F2, an input terminal of the ninth convolutional layer receives all the feature maps in P1, an input terminal of the second upsampling layer receives all the feature maps in A4, all the feature maps output from an output terminal of the eighth active layer and all the feature maps output from an output terminal of the ninth active layer are respectively subjected to channel quartering cutting, the channel cutting is respectively performed in four times, a first channel number superposition operation is performed on a1 st copy of all the feature maps output from the output terminal of the eighth active layer and a1 copy of all the feature maps output from the output terminal of the ninth active layer, performing second channel number superposition on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, performing third channel number superposition on the 3 rd part of all the feature maps output by the output end of the eighth active layer and the 3 rd part of all the feature maps output by the output end of the ninth active layer, performing fourth channel number superposition on the 4 th part of all the feature maps output by the output end of the eighth active layer and the 4 th part of all the feature maps output by the output end of the ninth active layer, receiving all the feature maps output by the output end of the second up-sampling layer by the input end of the tenth convolutional layer, receiving all the feature maps obtained after the first channel number superposition operation by the input end of the eleventh convolutional layer, receiving all the feature maps obtained after the second channel number superposition operation by the input end of the twelfth convolutional layer, the input end of the thirteenth convolution layer receives all the characteristic diagrams obtained after the superposition operation of the third channel number, the input end of the fourteenth convolution layer receives all the characteristic diagrams obtained after the superposition operation of the fourth channel number, and the fifth channel superposition is carried out on all the characteristic diagrams output by the output end of the eleventh activation layer, all the characteristic diagrams output by the output end of the twelfth activation layer, all the characteristic diagrams output by the output end of the thirteenth activation layer and all the characteristic diagrams output by the output end of the fourteenth activation layerAdding operation, wherein an input end of a fifteenth convolution layer receives all feature maps obtained after the fifth channel number superposition operation, element multiplication operation is carried out on all feature maps output by an output end of a tenth active layer and all feature maps output by an output end of the fifteenth active layer, first element addition operation is carried out on all feature maps output by an output end of the tenth active layer and all feature maps obtained after the element multiplication operation, an input end of a sixteenth active layer receives all feature maps obtained after the first element addition operation, second element addition operation is carried out on all feature maps output by an output end of the sixteenth convolution layer and all feature maps obtained after the first element addition operation, for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A1, for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A2, for the 3 rd feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is A3, for the 4 th feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is a4, and for the 5 th feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is a 5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1mThe number of input channels of the second input end is n2mThe number of input channels of the third input end is n3mNumber n1 of input channels at the first input of the 1 st feature aggregation block1512, number of input channels n2 of second input end1512, number of input channels n3 of third input end1512, the number of input channels n1 at the first input of the 2 nd feature aggregation block2512, number of input channels n2 of second input end2512, number of input channels n3 of third input end2256, input channel number n1 at the first input of the 3 rd feature aggregation block3256, number of input channels n2 of second input end3256, number of input channels n3 of third input end3128, the number of input channels n1 at the first input of the 4 th feature aggregation block4128, number of input channels n2 of second input end4=128. Number n3 of input channels of third input end464, the number n1 of input channels at the first input of the 5 th feature aggregation block564, number of input channels n2 of second input end564, number of input channels n3 of third input end532, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of ninth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 0, convolution kernel size of sixteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation. Here, the channel number superposition operation, the element multiplication operation, and the element addition operation are all existingProvided is a technique. C in fig. 5 denotes a channel number superimposing operation, + denotes an element adding operation, and x denotes an element multiplying operation.
In this embodiment, the structures of 10 neural network blocks are the same, and the structure of the neural network block in the existing VGG-16 model is adopted; the 5 expanded volume blocks have the same structure and are cited as RFB modules In S.Liu, and D.Huang, "receptor field block net for acid and fast object detection", In Proceedings of the European Conference on Computer Vision,2018, pp.385-400 (Liu Song and Huangdi, "a network of receiving field blocks capable of accurately and rapidly detecting objects", European Computer Vision Conference discourse, page 385-400 In 2018).
To further illustrate the feasibility and effectiveness of the method of the present invention, experiments were conducted on the method of the present invention.
The method is tested by writing codes in a python language of a pytorech library, the experimental equipment is an Intel i5-7500 processor, and cuda acceleration is used under a NVIDIA TITAN XP-12GB video card. In order to ensure the rigor of the experiment, the data sets selected in the experiment are NJU2K and NLPR, which are known public data sets. NJU2K contains 1485 pairs of 3D images, 1400 pairs of 3D images for training, and 85 pairs of 3D images for detection; the NLPR comprises 730 pairs of 3D images, 650 pairs of 3D images for training, and 80 pairs of 3D images for detection.
In this experiment, 4 common objective parameters of the significance evaluation detection method were used as evaluation indexes: s ≠ Structure-measure) used for evaluating the structural similarity of the saliency detection image and the saliency region in the label image; the adpE ↓value, the adpF ↓valueand the MAE ↓averageabsolute Error (Mean Absolute Error) are used for evaluating the detection performance of the significance detection graph, and important indexes used for evaluating the quality of the detection method are calculated through calculating the accuracy rate and the recall rate.
Comparing the significance detection map generated by the method with the label image, and respectively using S ≠ adpE ℃ ↓, adpF ↓, and MAE ↓asevaluation indexes to evaluate the method, wherein the evaluation indexes of the two data sets are listed in Table 1, and the data listed in Table 1 shows that the method is excellent in performance of the two data sets.
TABLE 1 evaluation results of the method of the invention on two data sets
Fig. 6a is an RGB image of the 1 st pair of 3D images to be subjected to saliency detection, fig. 6b is a depth image of the 1 st pair of 3D images to be subjected to saliency detection, fig. 6c is a saliency prediction image obtained by processing the fig. 6a and 6b by using the method of the present invention, and fig. 6D is a label image corresponding to the 1 st pair of 3D images to be subjected to saliency detection; fig. 7a is an RGB image of the 2 nd pair of 3D images to be subjected to saliency detection, fig. 7b is a depth image of the 2 nd pair of 3D images to be subjected to saliency detection, fig. 7c is a saliency prediction image obtained by processing fig. 7a and 7b by using the method of the present invention, and fig. 7D is a label image corresponding to the 2 nd pair of 3D images to be subjected to saliency detection; fig. 8a is an RGB image of a3 rd pair of 3D images to be subjected to saliency detection, fig. 8b is a depth image of the 3 rd pair of 3D images to be subjected to saliency detection, fig. 8c is a saliency prediction image obtained by processing fig. 8a and 8b by using the method of the present invention, and fig. 8D is a label image corresponding to the 3 rd pair of 3D images to be subjected to saliency detection; fig. 9a is an RGB image of a4 th pair of 3D images to be subjected to saliency detection, fig. 9b is a depth image of the 4 th pair of 3D images to be subjected to saliency detection, fig. 9c is a saliency prediction image obtained by processing fig. 9a and 9b by using the method of the present invention, and fig. 9D is a label image corresponding to the 4 th pair of 3D images to be subjected to saliency detection. Fig. 6a and 6b, fig. 7a and 7b, fig. 8a and 8b, and fig. 9a and 9b are representative 3D images containing a plurality of objects, small objects, and complex salient objects, and these representative 3D images are processed by the method of the present invention, and the salient predictive images are correspondingly shown in fig. 6c, fig. 7c, fig. 8c, and fig. 9c, and compared with fig. 6D, fig. 7D, fig. 8D, and fig. 9D, it can be found that the salient regions in these 3D images can be accurately captured by the method of the present invention.
Fig. 10a is a PR (exact-recall) plot of a 3D image for detection in a NJU2K dataset processed using the method of the present invention, and fig. 10b is a PR (exact-recall) plot of a 3D image for detection in an NLPR dataset processed using the method of the present invention. As can be seen from fig. 10a and 10b, the area under the PR curve is large, which indicates that the method of the present invention has good detection performance. Precision in FIG. 10a and FIG. 10b represents "Precision rate" and Recall represents "Recall rate".
Claims (5)
1. A method for detecting a significance image of interactive cycle feature remodeling is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images asDenote the depth image of the k-th pair of original 3D images asTaking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label imageThen, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; wherein N is a positive integer, N is not less than 200, k is a positive integer, k is not less than 1 and not more than 200, x is not less than 1 and not more than W, y is not less than 1 and not more than H, W represents the width of the original 3D image and the RGB image thereof, the depth image and the corresponding label image, H represents the height of the original 3D image and the RGB image thereof, the depth image and the corresponding label image,to representThe middle coordinate position is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y),to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 characteristic reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 characteristic aggregation blocks; the output layer comprises an output convolution layer, the size of convolution kernels of the output convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, and the step length is 1;
for an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height of the original RGB image is H;
for a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein the width of the original depth image is W, and the height of the original depth image is H;
for the coding part, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are connected in sequence to form a color coding stream, and the 6 th neural network block, the 7 th neural network block and the 8 th neural network block are connected in sequence to form a color coding streamThe network block, the 9 th neural network block and the 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 isHas a height ofThe input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 isHas a height ofThe input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and the width of each characteristic map in S4 isHas a height ofThe input end of the 5 th neural network block receives all the feature maps in S4, the output end of the 5 th neural network block outputs 512 feature maps, and the 512 feature maps are outputThe set of constructs is denoted S5, and the width of each feature map in S5 isHas a height ofThe input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2Has a height ofThe input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3Has a height ofThe input end of the 9 th neural network block receives all the feature maps in D3, the output end of the 9 th neural network block outputs 512 feature maps, the set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4Has a height ofThe input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5Has a height ofThe encoding part provides all the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4 and D5 to the decoding part;
for the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3Has a height ofThe first input terminal of the 2 nd feature reconstruction block receives all the feature maps in S2, the second input terminal of the 2 nd feature reconstruction block receives all the feature maps in F3, and the output terminal of the 2 nd feature reconstruction block outputs 128 pieces of dataThe feature map is a set of 128 feature maps denoted as F4, and each feature map in F4 has a width of F4Has a height ofThe first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5Has a height ofThe first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6Has a height ofThe first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7Has a height ofThe first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8Has a height ofThe first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9Has a height ofThe first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10Has a height ofThe input end of the 2 nd information extraction block receives all the feature maps in S5, the output end of the 2 nd information extraction block outputs 512 feature maps, the set of the 512 feature maps is marked as F11, and the width of each feature map in F11Degree ofHas a height ofThe input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal toHas a height ofThe input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal toHas a height ofThe input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal toHas a height of5 thThe input end of each expansion volume block receives all the characteristic maps in D5, the output end of the 5 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P5, and the width of each characteristic map in P5 is equal toHas a height ofThe first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal toHas a height ofThe first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2Has a height ofThe first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, and the 3 rd feature aggregation block receives all the feature maps in A2The output end of the aggregation block outputs 64 feature maps, a set of the 64 feature maps is marked as A3, and the width of each feature map in A3 is equal toHas a height ofThe first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal toHas a height ofA first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer;
for the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map;
step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and inputting the kth pair of original 3D imagesThe corresponding significance detection map is marked asWherein,to representThe middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D imagesAndthe value of the loss function in between is recorded as
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; wherein M is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.
2. The method for detecting a saliency image of an interaction cycle feature remodeling of claim 1, wherein in step 1_2, 2 information extraction blocks have the same structure and are composed of a1 st convolution block, a first maximum pooling layer, a first average pooling layer, a2 nd convolution block, a3 rd convolution block and a first upsampling layer, the 1 st convolution block includes a first convolution layer, a first active layer, a second convolution layer and a second active layer which are connected in sequence, the 2 nd convolution block includes a third convolution layer and a third active layer which are connected in sequence, the 3 rd convolution block includes a fourth convolution layer and a fourth active layer which are connected in sequence, an input end of the first convolution layer in the 1 st information extraction block receives all feature maps in D1, an input end of the first convolution layer in the 2 nd information extraction block receives all feature maps in S5, an input end of the first maximum pooling layer, a second convolution block, and a second active layer which are connected in sequence, and a second convolution block includes a third convolution layer and a third active layer which are, The input end of the first average pooling layer and the input end of the third convolution layer both receive all the feature maps output by the output end of the fourth active layer, the channel number superposition operation is performed on all the feature maps output by the output end of the first maximum pooling layer and all the feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all the feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, the element multiplication operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, the element addition operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, and the set formed by all the feature maps obtained after the element addition operation for the 1 st information extraction block is F1, for the 2 nd information extraction block, a set formed by all feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as niThen the number n of input channels of the 1 st information extraction block164, the number n of input channels of the 2 nd information extraction block2512, the convolution kernel size of the first convolution layer and the fourth convolution layer in the ith information extraction block is 1 × 1, and the number of convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the second convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 1, i is 1,2, the activation mode of the first activation layer, the second activation layer, the third activation layer and the fourth activation layer is 'Relu', the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 × 2, the step length is 2, the value of the zero padding parameter is 0, the magnification of the first up-sampling layer is 2, and the interpolation method is bilinear interpolation.
3. The method for detecting a saliency image of an interactive cyclic feature reconstruction as claimed in claim 1, wherein in step 1_2, 5 feature reconstruction blocks have the same structure and are composed of a context attention block and a channel attention block, and for the 1 st feature reconstruction block, the method performs a first element addition operation on all feature maps in S1 and all feature maps in F1, receives all feature maps obtained after the first element addition operation at an input end of the context attention block, receives all feature maps output from an output end of the context attention block at an input end of the channel attention block, performs an element multiplication operation on all feature maps output from an output end of the channel attention block and all feature maps obtained after the first element addition operation, performs a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the element multiplication operation in S1, the set formed by all feature maps obtained after the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10.
4. The method according to claim 1, wherein in step 1_2, the 4 information reconstruction blocks have the same structure and are composed of a second maximum pooling layer, a second average pooling layer, a4 th convolution block and a5 th convolution block, the 4 th convolution block includes a fifth convolution layer and a fifth active layer which are connected in sequence, the 5 th convolution block includes a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer which are connected in sequence, an input of the second maximum pooling layer and an input of the second average pooling layer in the 1 st information reconstruction block both receive all the feature maps in F2, an input of the sixth convolution layer receives all the feature maps in D2, an input of the second maximum pooling layer and an input of the second average pooling layer in the 2 nd information reconstruction block both receive all the feature maps in F4, The input terminal of the sixth convolutional layer receives all the feature maps in D3, the input terminal of the second largest pooling layer and the input terminal of the second average pooling layer in the 3 rd information re-modeling block both receive all the feature maps in F6, the input terminal of the sixth convolutional layer receives all the feature maps in D4, the input terminal of the second largest pooling layer and the input terminal of the second average pooling layer in the 4 th information re-modeling block both receive all the feature maps in F8, the input terminal of the sixth convolutional layer receives all the feature maps in D5, the element subtraction operation is performed on all the feature maps output by the output terminal of the second largest pooling layer and all the feature maps output by the output terminal of the second average pooling layer, the input terminal of the fifth convolutional layer receives all the feature maps obtained after the element subtraction operation, the element multiplication operation is performed on all the feature maps output by the output terminal of the fifth active layer and all the feature maps output by the output terminal of the seventh active layer, performing element addition operation on all feature maps output by the output end of the fifth active layer and all feature maps obtained after element multiplication operation, and performing element addition operation on all feature maps obtained after element addition operation on the 1 st information reconstruction blockThe set of the features is F3, the set of all feature maps obtained after the element addition operation is F5 for the 2 nd information reconstruction block, the set of all feature maps obtained after the element addition operation is F7 for the 3 rd information reconstruction block, and the set of all feature maps obtained after the element addition operation is F9 for the 4 th information reconstruction block; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1jThe number of input channels of the second input end is n2jThe number n1 of input channels at the first input of the 1 st information reproduction block164, number of input channels n2 of second input end1128, the number of input channels n1 at the first input of the 2 nd information reconstruction block2128, number of input channels n2 of second input end2256, number n1 of input channels at the first input of the 3 rd information reconstruction block3256, number of input channels n2 of second input end3512, the number of input channels n1 at the first input of the 4 th information reconstruction block4512, number of input channels n2 of second input end4512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2jThe step length is 1, the value of the zero padding parameter is 1, the activation mode of the fifth activation layer, the sixth activation layer and the seventh activation layer is 'Relu', the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 x 2, the step length is 2, the value of the zero padding parameter is 0, and when element subtraction operation is performed on all feature maps output by the output end of the second maximum pooling layer and all feature maps output by the output end of the second average pooling layer, corresponding elements in the corresponding feature maps output by the output end of the second average pooling layer are subtracted from elements in the feature maps output by the output end of the second maximum pooling layer.
5. According toThe method for detecting a saliency image of an interaction cycle feature reconstruction as claimed in claim 1, wherein in said step 1_2, 5 feature aggregation blocks have the same structure and are composed of a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block, a 10 th convolution block, an 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer, and a residual fusion block, wherein the 6 th convolution block includes an eighth convolution layer and an eighth active layer which are connected in sequence, the 7 th convolution block includes a ninth convolution layer and a ninth active layer which are connected in sequence, the 8 th convolution block includes a tenth convolution layer and a tenth active layer which are connected in sequence, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer which are connected in sequence, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer which are connected in sequence, the 11 th convolution block comprises a thirteenth convolution layer and a thirteenth active layer which are connected in sequence, the 12 th convolution block comprises a fourteenth convolution layer and a fourteenth active layer which are connected in sequence, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth active layer which are connected in sequence, the residual fusion block comprises a sixteenth active layer, a third maximum pooling layer and a sixteenth convolution layer which are connected in sequence, an input end of an eighth convolution layer in the 1 st feature aggregation block receives all feature maps in F10, an input end of a ninth convolution layer receives all feature maps in P5, an input end of a second up-sampling layer receives all feature maps in F11, an input end of an eighth convolution layer in the 2 nd feature aggregation block receives all feature maps in F8, an input end of a ninth convolution layer receives all feature maps in P4, an input end of a second up-sampling layer receives all feature maps in A1, and an input end of an eighth convolution layer in the 3 rd feature aggregation block receives all feature maps in F6, The input of the ninth convolutional layer receives all the feature maps in P3, the input of the second upsampling layer receives all the feature maps in A2, the input of the eighth convolutional layer of the 4 th feature aggregation block receives all the feature maps in F4, the input of the ninth convolutional layer receives all the feature maps in P2, the input of the second upsampling layer receives all the feature maps in A3, the input of the eighth convolutional layer of the 5 th feature aggregation block receives all the feature maps in F2, and the input of the ninth convolutional layer receives all the features in P1Receiving all the feature maps in A4 by the input end of the second up-sampling layer, respectively carrying out channel quartering cutting on all the feature maps output by the output end of the eighth active layer and all the feature maps output by the output end of the ninth active layer, respectively dividing the cut feature maps into four parts in sequence, carrying out first channel number superposition operation on the 1 st part of all the feature maps output by the output end of the eighth active layer and the 1 st part of all the feature maps output by the output end of the ninth active layer, carrying out second channel number superposition operation on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, carrying out third channel number superposition operation on the 3 rd part of all the feature maps output by the output end of the eighth active layer and the 3 rd part of all the feature maps output by the output end of the ninth active layer, carrying out the 4 th part of all the feature maps output by the output end of the eighth active layer and the fourth channel number superposition operation of all the feature maps output by the output end of the ninth 4, performing fourth channel number superposition operation, receiving all feature maps output by the output end of the second up-sampling layer by the input end of the tenth convolution layer, receiving all feature maps obtained after the first channel number superposition operation by the input end of the eleventh convolution layer, receiving all feature maps obtained after the second channel number superposition operation by the input end of the twelfth convolution layer, receiving all feature maps obtained after the third channel number superposition operation by the input end of the thirteenth convolution layer, receiving all feature maps obtained after the fourth channel number superposition operation by the input end of the fourteenth convolution layer, performing fifth channel number superposition operation on all feature maps output by the output end of the eleventh active layer, all feature maps output by the output end of the twelfth active layer, all feature maps output by the output end of the thirteenth active layer and all feature maps output by the output end of the fourteenth active layer, receiving all feature maps obtained after the fifth channel number superposition operation by the input end of the fifteenth convolution layer, performing element multiplication operation on all feature maps output by the output end of the tenth active layer and all feature maps output by the output end of the fifteenth active layer, performing first element addition operation on all feature maps output by the output end of the tenth active layer and all feature maps obtained after the element multiplication operation, and obtaining a result after the input end of the sixteenth active layer receives the first element addition operationPerforming a second element addition operation on all feature maps output by the output end of the sixteenth convolutional layer and all feature maps obtained after the first element addition operation, wherein for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a1, for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a2, for A3 rd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A3, for a4 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a4, and for a5 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a 5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1mThe number of input channels of the second input end is n2mThe number of input channels of the third input end is n3mNumber n1 of input channels at the first input of the 1 st feature aggregation block1512, number of input channels n2 of second input end1512, number of input channels n3 of third input end1512, the number of input channels n1 at the first input of the 2 nd feature aggregation block2512, number of input channels n2 of second input end2512, number of input channels n3 of third input end2256, input channel number n1 at the first input of the 3 rd feature aggregation block3256, number of input channels n2 of second input end3256, number of input channels n3 of third input end3128, the number of input channels n1 at the first input of the 4 th feature aggregation block4128, number of input channels n2 of second input end4128, number of input channels n3 of third input end464, the number n1 of input channels at the first input of the 5 th feature aggregation block564, number of input channels n2 of second input end564, number of input channels n3 of third input end532, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, volume of ninth convolutional layer in mth characteristic aggregation blockThe size of the product kernel is 3 multiplied by 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 0, convolution kernel size of sixteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011413838.5A CN112529862A (en) | 2020-12-07 | 2020-12-07 | Significance image detection method for interactive cycle characteristic remodeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011413838.5A CN112529862A (en) | 2020-12-07 | 2020-12-07 | Significance image detection method for interactive cycle characteristic remodeling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112529862A true CN112529862A (en) | 2021-03-19 |
Family
ID=74997830
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011413838.5A Withdrawn CN112529862A (en) | 2020-12-07 | 2020-12-07 | Significance image detection method for interactive cycle characteristic remodeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112529862A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192073A (en) * | 2021-04-06 | 2021-07-30 | 浙江科技学院 | Clothing semantic segmentation method based on cross fusion network |
CN113313077A (en) * | 2021-06-30 | 2021-08-27 | 浙江科技学院 | Salient object detection method based on multi-strategy and cross feature fusion |
CN113538442A (en) * | 2021-06-04 | 2021-10-22 | 杭州电子科技大学 | RGB-D significant target detection method using adaptive feature fusion |
-
2020
- 2020-12-07 CN CN202011413838.5A patent/CN112529862A/en not_active Withdrawn
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192073A (en) * | 2021-04-06 | 2021-07-30 | 浙江科技学院 | Clothing semantic segmentation method based on cross fusion network |
CN113538442A (en) * | 2021-06-04 | 2021-10-22 | 杭州电子科技大学 | RGB-D significant target detection method using adaptive feature fusion |
CN113538442B (en) * | 2021-06-04 | 2024-04-09 | 杭州电子科技大学 | RGB-D significant target detection method using self-adaptive feature fusion |
CN113313077A (en) * | 2021-06-30 | 2021-08-27 | 浙江科技学院 | Salient object detection method based on multi-strategy and cross feature fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110782462B (en) | Semantic segmentation method based on double-flow feature fusion | |
Zhang et al. | Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning | |
Chen et al. | Banet: Bidirectional aggregation network with occlusion handling for panoptic segmentation | |
CN112529862A (en) | Significance image detection method for interactive cycle characteristic remodeling | |
CN110490082B (en) | Road scene semantic segmentation method capable of effectively fusing neural network features | |
CN112597985B (en) | Crowd counting method based on multi-scale feature fusion | |
Zeng et al. | LEARD-Net: Semantic segmentation for large-scale point cloud scene | |
CN110246148B (en) | Multi-modal significance detection method for depth information fusion and attention learning | |
CN110929736A (en) | Multi-feature cascade RGB-D significance target detection method | |
CN110458178B (en) | Multi-mode multi-spliced RGB-D significance target detection method | |
CN113192073A (en) | Clothing semantic segmentation method based on cross fusion network | |
CN112801068B (en) | Video multi-target tracking and segmenting system and method | |
CN112801029B (en) | Attention mechanism-based multitask learning method | |
Ha et al. | Deep neural networks using residual fast-slow refined highway and global atomic spatial attention for action recognition and detection | |
CN114037056A (en) | Method and device for generating neural network, computer equipment and storage medium | |
CN114419406A (en) | Image change detection method, training method, device and computer equipment | |
CN112836602A (en) | Behavior recognition method, device, equipment and medium based on space-time feature fusion | |
CN112700426A (en) | Method for detecting salient object in complex environment | |
Zhang et al. | LDD-Net: Lightweight printed circuit board defect detection network fusing multi-scale features | |
Yang et al. | Xception-based general forensic method on small-size images | |
Zhu et al. | MDAFormer: Multi-level difference aggregation transformer for change detection of VHR optical imagery | |
Park et al. | Pyramid attention upsampling module for object detection | |
CN113313077A (en) | Salient object detection method based on multi-strategy and cross feature fusion | |
CN112348011B (en) | Vehicle damage assessment method and device and storage medium | |
Zheng et al. | Transformer-based hierarchical dynamic decoders for salient object detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210319 |