CN112529862A - Significance image detection method for interactive cycle characteristic remodeling - Google Patents

Significance image detection method for interactive cycle characteristic remodeling Download PDF

Info

Publication number
CN112529862A
CN112529862A CN202011413838.5A CN202011413838A CN112529862A CN 112529862 A CN112529862 A CN 112529862A CN 202011413838 A CN202011413838 A CN 202011413838A CN 112529862 A CN112529862 A CN 112529862A
Authority
CN
China
Prior art keywords
feature maps
block
feature
layer
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011413838.5A
Other languages
Chinese (zh)
Inventor
周武杰
郭沁玲
雷景生
万健
钱小鸿
叶宁
甘兴利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202011413838.5A priority Critical patent/CN112529862A/en
Publication of CN112529862A publication Critical patent/CN112529862A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a significance image detection method for interactive circulation feature remodeling, which constructs a convolutional neural network in a training stage, and comprises an input layer, a coding part, a decoding part and an output layer, wherein the coding part comprises a neural network block, and the decoding part comprises an information extraction block, a feature remodeling block, an information remodeling block, an expansion rolling block and a feature aggregation block; inputting three channels of RGB images of the 3D image and a three-channel depth map obtained by processing the depth image into a convolutional neural network for training to obtain a significance detection map; obtaining an optimal weight vector and an optimal bias item by calculating a loss function value between the significance detection graph and the label image; inputting three channels of RGB images of the 3D image to be detected and three-channel depth maps corresponding to the depth images into a convolutional neural network training model in a testing stage, and predicting by using an optimal weight vector and an optimal bias term to obtain a significance prediction image; the method has the advantages of clear significance detection result and high detection precision.

Description

Significance image detection method for interactive cycle characteristic remodeling
Technical Field
The invention relates to a significance image detection technology for deep learning, in particular to a significance image detection method for interactive cycle characteristic remodeling.
Background
With the rapid development of artificial intelligence in the computer field, the saliency detection of images has become an increasingly interesting research field. The Salient Object Detection (SOD) aims at distinguishing visually most distinctive objects from the input image, and over the last decades, hundreds of conventional methods have been developed to solve the task of Salient Object Detection, which is an effective pre-processing step among many image processing and computer vision tasks, such as Object segmentation and tracking, video compression, image editing, texture smoothing, etc. Recent work is to learn and detect deep features of salient objects by using Convolutional Neural Networks (CNN), and these convolutional neural network models adopt a coding and decoding structure, and have a simple structure and high calculation efficiency. In the codec structure, the encoder usually extracts a plurality of features of different semantic levels and resolutions by using a pre-trained classification model (such as ResNet and VGG); the decoder combines the extracted features to generate a saliency map. The existing significance detection method of the coding and decoding structure using the convolutional neural network is quite effective, but the accuracy is still challenging. For example: features of different semantic levels and resolutions have different distribution characteristics, and high-level features have abundant semantic information but lack accurate position information; the low-level features have rich details, but are full of background noise, so that the detection accuracy of the method for fusing the high-level features and the low-level features is still not ideal. For features of different modalities, there is cluttered background information in both RGB information and depth information, and further intensive research is still needed to effectively distinguish the background from the foreground so as to generate a better saliency image.
Disclosure of Invention
The invention aims to solve the technical problem of providing a significance image detection method for interactive cycle characteristic remodeling, which has clear significance detection result and high detection precision.
The technical scheme adopted by the invention for solving the technical problems is as follows: a method for detecting a significance image of interactive cycle feature remodeling is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images as
Figure BDA0002819462340000021
Denote the depth image of the k-th pair of original 3D images as
Figure BDA0002819462340000022
Taking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label image
Figure BDA0002819462340000023
Then, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; wherein N is a positive integer, N is not less than 200, k is a positive integer, k is not less than 1 and not more than 200, x is not less than 1 and not more than W, y is not less than 1 and not more than H, W represents the width of the original 3D image and the RGB image, the depth image and the corresponding label image thereof, H represents the original 3D image and the RGB image, the depth image and the corresponding label image thereofThe height of the image or images,
Figure BDA0002819462340000024
to represent
Figure BDA0002819462340000025
The middle coordinate position is the pixel value of the pixel point of (x, y),
Figure BDA0002819462340000026
to represent
Figure BDA0002819462340000027
The middle coordinate position is the pixel value of the pixel point of (x, y),
Figure BDA0002819462340000028
to represent
Figure BDA0002819462340000029
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 characteristic reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 characteristic aggregation blocks; the output layer comprises an output convolution layer, the size of convolution kernels of the output convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, and the step length is 1;
for an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height of the original RGB image is H;
for a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein the width of the original depth image is W, and the height of the original depth image is H;
for the coding part, a1 st neural network block, a2 nd neural network block, a3 rd neural network block, a4 th neural network block and a5 th neural network block are sequentially connected to form a color coding stream, and a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 is
Figure BDA0002819462340000031
Has a height of
Figure BDA0002819462340000032
The input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 is
Figure BDA0002819462340000033
Has a height of
Figure BDA0002819462340000034
The input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and the width of each characteristic map in S4 is
Figure BDA0002819462340000035
Has a height of
Figure BDA0002819462340000036
The input end of the 5 th neural network block receives all the characteristic maps in S4, the output end of the 5 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S5, and the width of each characteristic map in S5 is
Figure BDA0002819462340000037
Has a height of
Figure BDA0002819462340000038
The input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2
Figure BDA0002819462340000039
Has a height of
Figure BDA00028194623400000310
The input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3
Figure BDA00028194623400000311
Has a height of
Figure BDA00028194623400000312
The input end of the 9 th neural network block receives all the feature maps in D3, the output end of the 9 th neural network block outputs 512 feature maps, the set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4
Figure BDA00028194623400000313
Has a height of
Figure BDA00028194623400000314
The input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5
Figure BDA00028194623400000315
Has a height of
Figure BDA00028194623400000316
The encoding part provides all the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4 and D5 to the decoding part;
for the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3
Figure BDA0002819462340000041
Has a height of
Figure BDA0002819462340000042
The first input terminal of the 2 nd feature reconstruction block receives all the feature maps in S2, and the second input terminal of the 2 nd feature reconstruction blockThe input end receives all the feature maps in F3, the output end of the 2 nd feature reconstruction block outputs 128 feature maps, the set of the 128 feature maps is marked as F4, and the width of each feature map in F4 is equal to
Figure BDA0002819462340000043
Has a height of
Figure BDA0002819462340000044
The first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5
Figure BDA0002819462340000045
Has a height of
Figure BDA0002819462340000046
The first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6
Figure BDA0002819462340000047
Has a height of
Figure BDA0002819462340000048
The first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7
Figure BDA0002819462340000049
Has a height of
Figure BDA00028194623400000410
The first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8
Figure BDA00028194623400000411
Has a height of
Figure BDA00028194623400000412
The first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9
Figure BDA00028194623400000413
Has a height of
Figure BDA00028194623400000414
The first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10
Figure BDA0002819462340000051
Has a height of
Figure BDA0002819462340000052
The input end of the 2 nd information extraction block receives all the feature maps in S5, the output end of the 2 nd information extraction block outputs 512 feature maps, the set of the 512 feature maps is denoted as F11, and the width of each feature map in F11 is
Figure BDA0002819462340000053
Has a height of
Figure BDA0002819462340000054
The input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal to
Figure BDA0002819462340000055
Has a height of
Figure BDA0002819462340000056
The input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal to
Figure BDA0002819462340000057
Has a height of
Figure BDA0002819462340000058
The input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal to
Figure BDA0002819462340000059
Has a height of
Figure BDA00028194623400000510
The 5 th expansion volume block receives all the characteristic maps in D5 at its input end, and outputs 512 characteristic maps at its output end, and the 512 characteristic maps are set as P5 and each characteristic map in P5Has a width of
Figure BDA00028194623400000511
Has a height of
Figure BDA00028194623400000512
The first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal to
Figure BDA00028194623400000513
Has a height of
Figure BDA00028194623400000514
The first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2
Figure BDA00028194623400000515
Has a height of
Figure BDA00028194623400000516
The first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, the output end of the 3 rd feature aggregation block outputs 64 feature maps, the set of the 64 feature maps is marked as A3, and the width of each feature map in A3 is equal to
Figure BDA0002819462340000061
Has a height of
Figure BDA0002819462340000062
The first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal to
Figure BDA0002819462340000063
Has a height of
Figure BDA0002819462340000064
A first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer;
for the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map;
step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and marking the significance detection map corresponding to the kth pair of original 3D images as
Figure BDA0002819462340000065
Wherein,
Figure BDA0002819462340000066
to represent
Figure BDA0002819462340000067
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D images
Figure BDA0002819462340000068
And
Figure BDA0002819462340000069
the value of the loss function in between is recorded as
Figure BDA00028194623400000610
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; wherein M is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.
In step 1_2, the 2 information extraction blocks have the same structure and are composed of a1 st convolution block, a first maximum pooling layer, a first average pooling layer, a2 nd convolution block, a3 rd convolution block and a first up-sampling layer, wherein the 1 st convolution block comprises a first convolution layer, a first convolution layer and a first up-sampling layer which are sequentially connectedThe active layer, the second convolution layer and the second active layer, the 2 nd convolution block comprises a third convolution layer and a third active layer which are connected in sequence, the 3 rd convolution block comprises a fourth convolution layer and a fourth active layer which are connected in sequence, the input end of the first convolution layer in the 1 st information extraction block receives all feature maps in D1, the input end of the first convolution layer in the 2 nd information extraction block receives all feature maps in S5, the input end of the first maximum pooling layer, the input end of the first average pooling layer and the input end of the third convolution layer all receive all feature maps output by the output end of the fourth active layer, the channel number superposition operation is carried out on all feature maps output by the output end of the first maximum pooling layer and all feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, element multiplication operation is carried out on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, element addition operation is carried out on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, for the 1 st information extraction block, the set formed by all the feature maps obtained after the element addition operation is F1, and for the 2 nd information extraction block, the set formed by all the feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as niThen the number n of input channels of the 1 st information extraction block164, the number n of input channels of the 2 nd information extraction block2512, the convolution kernel size of the first convolution layer and the fourth convolution layer in the ith information extraction block is 1 × 1, and the number of convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the second convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero-filling parameter is 1, i is 1,2, and the first active layer, the second active layer, the third active layer and the fourth active layer are arranged in sequenceThe activation mode of the first upsampling layer is 'Relu', the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 multiplied by 2, the step length is 2, the value of the zero padding parameter is 0, the magnification of the first maximum pooling layer is 2, and the interpolation method is bilinear interpolation.
In step 1_2, the 5 feature reconstruction blocks have the same structure and are composed of a context attention block and a channel attention block, and for the 1 st feature reconstruction block, which performs a first element addition operation on all feature maps in S1 and all feature maps in F1, the input terminal of the context attention block receives all feature maps obtained after the first element addition operation, the input terminal of the channel attention block receives all feature maps output from the output terminal of the context attention block, element multiplication operation is carried out on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, performing element addition operation for a second time on all feature maps obtained by multiplying all feature maps in the S1 by elements, wherein a set formed by all feature maps obtained by the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10.
In step 1_2, the 4 information reconstruction blocks have the same structure and are composed of a second maximum pooling layer, a second average pooling layer, a4 th convolution block and a5 th convolution block, the 4 th convolution block comprises a fifth convolution layer and a fifth active layer which are sequentially connected, the 5 th convolution block comprises a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer which are sequentially connected, and the input end of the second maximum pooling layer and the input end of the second average pooling layer in the 1 st information reconstruction block both receive all the feature maps and the sixth convolution layer in F2The inputs of the stacks receive all the feature maps in D2, the inputs of the second largest pooling layer and the second average pooling layer in the 2 nd information re-modeling block receive all the feature maps in F4, the input of the sixth pooling layer receives all the feature maps in D3, the input of the second largest pooling layer and the input of the second average pooling layer in the 3 rd information re-modeling block receive all the feature maps in F6, the input of the sixth pooling layer receives all the feature maps in D4, the input of the second largest pooling layer and the input of the second average pooling layer in the 4 th information re-modeling block receive all the feature maps in F8, the input of the sixth pooling layer receives all the feature maps in D5, the all feature maps output at the output of the second largest subtracting pooling layer and all the feature maps output at the output of the second average pooling layer are operated by elements, receiving all feature maps obtained after the element subtraction operation by an input end of the fifth convolutional layer, performing element multiplication operation on all feature maps output by an output end of the fifth active layer and all feature maps output by an output end of the seventh active layer, performing element addition operation on all feature maps output by an output end of the fifth active layer and all feature maps obtained after the element multiplication operation, wherein for a1 st information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F3, for a2 nd information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F5, for a3 rd information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F7, and for a4 th information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F9; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1jThe number of input channels of the second input end is n2jThe number n1 of input channels at the first input of the 1 st information reproduction block164, number of input channels n2 of second input end1128, the number of input channels n1 at the first input of the 2 nd information reconstruction block2128, number of input channels n2 of second input end2256, number n1 of input channels at the first input of the 3 rd information reconstruction block3256, number of input channels n of second input end23512, the number of input channels n1 at the first input of the 4 th information reconstruction block4512, number of input channels n2 of second input end4512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2jThe step length is 1, the value of the zero padding parameter is 1, the activation mode of the fifth activation layer, the sixth activation layer and the seventh activation layer is 'Relu', the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 x 2, the step length is 2, the value of the zero padding parameter is 0, and when element subtraction operation is performed on all feature maps output by the output end of the second maximum pooling layer and all feature maps output by the output end of the second average pooling layer, corresponding elements in the corresponding feature maps output by the output end of the second average pooling layer are subtracted from elements in the feature maps output by the output end of the second maximum pooling layer.
In step 1_2, the 5 feature aggregation blocks have the same structure and are composed of a 6 th convolution block, a 7 th convolution block, a 8 th convolution block, a 9 th convolution block, a 10 th convolution block, a 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer and a residual fusion block, wherein the 6 th convolution block includes an eighth convolution layer and an eighth active layer which are sequentially connected, the 7 th convolution block includes a ninth convolution layer and a ninth active layer which are sequentially connected, the 8 th convolution block includes a tenth convolution layer and a tenth active layer which are sequentially connected, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer which are sequentially connected, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer which are sequentially connected, the 11 th convolution block includes a thirteenth convolution layer and a thirteenth active layer which are sequentially connected, the 12 th convolution block includes a fourteenth convolution layer and a fourteenth active layer which are sequentially connected, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth activation layer which are connected in sequence, and the residual error fusion block comprises a fifteenth convolution layer and a fifteenth activation layer which are connected in sequenceThe input terminal of the eighth convolutional layer in the 1 st feature aggregation block receives all the feature maps in F10, the input terminal of the ninth convolutional layer receives all the feature maps in P5, the input terminal of the second upsampling layer receives all the feature maps in F11, the input terminal of the eighth convolutional layer in the 2 nd feature aggregation block receives all the feature maps in F8, the input terminal of the ninth convolutional layer receives all the feature maps in P4, the input terminal of the second upsampling layer receives all the feature maps in A1, the input terminal of the eighth convolutional layer in the 3 rd feature aggregation block receives all the feature maps in F6, the input terminal of the ninth convolutional layer receives all the feature maps in P3, the input terminal of the second upsampling layer receives all the feature maps in A2, the input terminal of the eighth convolutional layer in the 4 th feature aggregation block receives all the feature maps in F4, The input end of the ninth convolutional layer receives all the feature maps in P2, the input end of the second upsampling layer receives all the feature maps in A3, the input end of the eighth convolutional layer of the 5 th feature aggregation block receives all the feature maps in F2, the input end of the ninth convolutional layer receives all the feature maps in P1, the input end of the second upsampling layer receives all the feature maps in A4, all the feature maps output by the output end of the eighth active layer and all the feature maps output by the output end of the ninth active layer are respectively subjected to channel quartering, the channel quartering is respectively divided into four parts in sequence, the first channel number superposition operation is carried out on the 1 st part of all the feature maps output by the output end of the eighth active layer and the 1 st part of all the feature maps output by the output end of the ninth active layer, the second channel number superposition operation is carried out on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, performing a third channel number superposition operation on the 3 rd parts of all the characteristic diagrams output by the output end of the eighth active layer and the 3 rd parts of all the characteristic diagrams output by the output end of the ninth active layer, performing a fourth channel number superposition operation on the 4 th parts of all the characteristic diagrams output by the output end of the eighth active layer and the 4 th parts of all the characteristic diagrams output by the output end of the ninth active layer, receiving all the characteristic diagrams output by the output end of the second upsampling layer by the input end of the tenth convolutional layer, and terminating the input end of the eleventh convolutional layer by the output end of the eleventh convolutional layerReceiving all feature maps obtained after the first channel number superposition operation, receiving all feature maps obtained after the second channel number superposition operation by an input end of a twelfth convolution layer, receiving all feature maps obtained after the third channel number superposition operation by an input end of a thirteenth convolution layer, receiving all feature maps obtained after the fourth channel number superposition operation by an input end of a fourteenth convolution layer, performing fifth channel number superposition operation on all feature maps output by an output end of an eleventh active layer, all feature maps output by an output end of the twelfth active layer, all feature maps output by an output end of the thirteenth active layer and all feature maps output by an output end of the fourteenth active layer, receiving all feature maps obtained after the fifth channel number superposition operation by an input end of a fifteenth convolution layer, performing element multiplication operation on all feature maps output by an output end of the tenth active layer and all feature maps output by an output end of the fifteenth active layer, performing a first element addition operation on all feature maps output by an output end of a tenth active layer and all feature maps obtained after the element multiplication operation, receiving all feature maps obtained after the first element addition operation by an input end of a sixteenth active layer, performing a second element addition operation on all feature maps output by an output end of the sixteenth convolutional layer and all feature maps obtained after the first element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is A1 for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A2 for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A3 for A3 rd feature aggregation block, and a set formed by all feature maps obtained after the second element addition operation is A4 for a4 th feature aggregation block, for the 5 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1mThe number of input channels of the second input end is n2mThe number of input channels of the third input end is n3mNumber n1 of input channels at the first input of the 1 st feature aggregation block1512, number of input channels n2 of second input end1512, number of input channels n3 of third input end1512, the number of input channels n1 at the first input of the 2 nd feature aggregation block2512, number of input channels n2 of second input end2512, number of input channels n3 of third input end2256, input channel number n1 at the first input of the 3 rd feature aggregation block3256, number of input channels n2 of second input end3256, number of input channels n3 of third input end3128, the number of input channels n1 at the first input of the 4 th feature aggregation block4128, number of input channels n2 of second input end4128, number of input channels n3 of third input end464, the number n1 of input channels at the first input of the 5 th feature aggregation block564, number of input channels n2 of second input end564, number of input channels n3 of third input end532, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of ninth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero-filled parameter value of 0, m characteristic aggregation blockThe size of the convolution kernel of the sixteenth convolution layer of (2) is 3 × 3, and the number of convolution kernels is n3mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation.
Compared with the prior art, the invention has the advantages that:
1) the convolutional neural network constructed by the method is a double-flow end-to-end interaction cycle characteristic remodeling network system structure, information flows of two modes are mutually communicated so as to extract enough complementary information, and meanwhile, the background noise of the two modes is inhibited, so that a convolutional neural network training model obtained by training has better significance detection performance.
2) The convolutional neural network constructed by the method of the invention is provided with the information extraction block, and the information extraction block can further extract the foreground information of the shallow depth map and the foreground information of the deep color map through the pooling operation, thereby being beneficial to the full extraction of the information and leading the trained convolutional neural network training model to be capable of effectively detecting the significant objects.
3) The convolutional neural network constructed by the method of the invention is designed with the characteristic remolding block and the information remolding block, the characteristic remolding block fuses color information by taking depth information as weight, and the information remolding block fuses the fusion information of the characteristic remolding block and adjacent depth information again to obtain complementary context characteristics, so that the convolutional neural network training model obtained by training can effectively detect a significant object.
4) The feature aggregation block is designed in the convolutional neural network constructed by the method, and the local features and the global features of the two modes are fully fused, so that the convolutional neural network training model obtained by training can effectively detect the significant object.
Drawings
FIG. 1 is a schematic diagram of the structure of an end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 2 is a schematic diagram of the structure of an information extraction block in an end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 3 is a schematic diagram of the structure of the feature reconstruction block in the end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 4 is a schematic diagram of the structure of the information reconstruction block in the end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 5 is a schematic diagram of a structure of a feature aggregation block in an end-to-end convolutional neural network constructed by the method of the present invention;
FIG. 6a is an RGB image of the 1 st pair of 3D images to be saliency detected;
FIG. 6b is a depth image of the 1 st pair of 3D images to be saliency detected;
FIG. 6c is a predicted salient image obtained by processing FIGS. 6a and 6b using the method of the present invention;
FIG. 6D is a label image corresponding to the 1 st pair of 3D images to be detected for saliency;
FIG. 7a is an RGB image of the 2 nd pair of 3D images to be saliency detected;
FIG. 7b is a depth image of the 2 nd pair of 3D images to be saliency detected;
FIG. 7c is a predicted salient image obtained by processing FIGS. 7a and 7b using the method of the present invention;
FIG. 7D is a label image corresponding to the 2 nd pair of 3D images to be saliency detected;
FIG. 8a is an RGB image of a3 rd pair of 3D images to be saliency detected;
FIG. 8b is a depth image of the 3 rd pair of 3D images to be saliency detected;
FIG. 8c is a predicted salient image obtained by processing FIGS. 8a and 8b using the method of the present invention;
FIG. 8D is a label image corresponding to the 3 rd pair of 3D images to be saliency detected;
FIG. 9a is an RGB image of the 4 th pair of 3D images to be saliency detected;
FIG. 9b is a depth image of the 4 th pair of 3D images to be saliency detected;
FIG. 9c is a predicted salient image obtained by processing FIGS. 9a and 9b using the method of the present invention;
FIG. 9D is a label image corresponding to the 4 th pair of 3D images to be saliency detected;
FIG. 10a is a PR (accurate-recall) plot of a 3D image for inspection in a NJU2K dataset processed using the method of the present invention;
fig. 10b is a PR (precision-recall) plot obtained by processing a 3D image for detection in an NLPR dataset using the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The invention provides a significance image detection method for interactive cycle characteristic remodeling.
The specific steps of the training phase process are as follows:
step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images as
Figure BDA0002819462340000141
Denote the depth image of the k-th pair of original 3D images as
Figure BDA0002819462340000142
Taking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label image
Figure BDA0002819462340000143
Then, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; each pair of original 3D images comprises an RGB image and a depth image, N is a positive integer, N is more than or equal to 200, if N is 600, k is a positive integer, k is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, and W represents the original 3D images and the RGB images and the depth images thereofH represents the height of the original 3D image, its RGB image, depth image, and corresponding tag image, W-H-224 in this embodiment,
Figure BDA0002819462340000144
to represent
Figure BDA0002819462340000145
The middle coordinate position is the pixel value of the pixel point of (x, y),
Figure BDA0002819462340000146
to represent
Figure BDA0002819462340000147
The middle coordinate position is the pixel value of the pixel point of (x, y),
Figure BDA0002819462340000148
to represent
Figure BDA0002819462340000149
The middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 2: constructing an end-to-end convolutional neural network: as shown in fig. 1, the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB map input layer and a depth map input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 feature reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 feature aggregation blocks; the output layer comprises an output convolutional layer, the size of a convolution kernel of the output convolutional layer is 3 multiplied by 3, the number of the convolution kernels is 1, the step length is 1, and the output convolutional layer is a commonly used convolutional layer.
For an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height is H.
For a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein, the width of the original depth image is W, and the height is H.
For the coding part, a1 st neural network block, a2 nd neural network block, a3 rd neural network block, a4 th neural network block and a5 th neural network block are sequentially connected to form a color coding stream, and a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 is
Figure BDA0002819462340000151
Has a height of
Figure BDA0002819462340000152
The input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 is
Figure BDA0002819462340000153
Has a height of
Figure BDA0002819462340000154
The input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and each characteristic map in S4The width of the figure is
Figure BDA0002819462340000155
Has a height of
Figure BDA0002819462340000156
The input end of the 5 th neural network block receives all the characteristic maps in S4, the output end of the 5 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S5, and the width of each characteristic map in S5 is
Figure BDA0002819462340000157
Has a height of
Figure BDA0002819462340000158
The input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2
Figure BDA0002819462340000161
Has a height of
Figure BDA0002819462340000162
The input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3
Figure BDA0002819462340000163
Has a height of
Figure BDA0002819462340000164
The input of the 9 th neural network block receives all the feature maps in D3, the 9 thThe output ends of the 9 neural network blocks output 512 feature maps, a set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4
Figure BDA0002819462340000165
Has a height of
Figure BDA0002819462340000166
The input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5
Figure BDA0002819462340000167
Has a height of
Figure BDA0002819462340000168
The encoding portion provides all of the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4, D5 to the decoding portion.
For the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3
Figure BDA0002819462340000169
Has a height of
Figure BDA00028194623400001610
The first input end of the 2 nd feature reconstruction block receives all the feature maps in S2, the second input end of the 2 nd feature reconstruction block receives all the feature maps in F3, the output end of the 2 nd feature reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F4, and the width of each feature map in F4 is F4
Figure BDA00028194623400001611
Has a height of
Figure BDA00028194623400001612
The first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5
Figure BDA00028194623400001613
Has a height of
Figure BDA00028194623400001614
The first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6
Figure BDA00028194623400001615
Has a height of
Figure BDA00028194623400001616
The first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7
Figure BDA0002819462340000171
Has a height of
Figure BDA0002819462340000172
The first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8
Figure BDA0002819462340000173
Has a height of
Figure BDA0002819462340000174
The first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9
Figure BDA0002819462340000175
Has a height of
Figure BDA0002819462340000176
The first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10
Figure BDA0002819462340000177
Has a height of
Figure BDA0002819462340000178
The input terminal of the 2 nd information extraction block receives all the feature maps in S5, and the output terminal of the 2 nd information extraction block outputs 5The set of 12 feature maps is denoted as F11, and the width of each feature map in F11 is
Figure BDA0002819462340000179
Has a height of
Figure BDA00028194623400001710
The input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal to
Figure BDA00028194623400001711
Has a height of
Figure BDA00028194623400001712
The input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal to
Figure BDA00028194623400001713
Has a height of
Figure BDA00028194623400001714
The input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal to
Figure BDA00028194623400001715
Has a height of
Figure BDA00028194623400001716
The 5 th swellingThe input end of the swelling volume block receives all the characteristic maps in D5, the output end of the 5 th swelling volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P5, and the width of each characteristic map in P5 is equal to
Figure BDA00028194623400001717
Has a height of
Figure BDA00028194623400001718
The first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal to
Figure BDA0002819462340000181
Has a height of
Figure BDA0002819462340000182
The first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2
Figure BDA0002819462340000183
Has a height of
Figure BDA0002819462340000184
The first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, the output end of the 3 rd feature aggregation block outputs 64 feature maps, the collection of the 64 feature maps is marked as A3,each feature map in a3 has a width of
Figure BDA0002819462340000185
Has a height of
Figure BDA0002819462340000186
The first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal to
Figure BDA0002819462340000187
Has a height of
Figure BDA0002819462340000188
A first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer.
For the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map.
Step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and marking the significance detection map corresponding to the kth pair of original 3D images as
Figure BDA0002819462340000189
Wherein,
Figure BDA00028194623400001810
to represent
Figure BDA00028194623400001811
The middle coordinate position is the pixel value of the pixel point of (x, y).
Step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D images
Figure BDA00028194623400001812
And
Figure BDA00028194623400001813
the value of the loss function in between is recorded as
Figure BDA0002819462340000191
In this embodiment, the loss function value is obtained using the conventional two-class cross entropy.
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; where M > 1, M is 1025 in this example.
The test stage process comprises the following specific steps:
step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.
In this embodiment, in step 1_2, the 2 information extraction blocks have the same structure, and as shown in fig. 2, the 2 information extraction blocks are composed of a1 st Convolution block, a first Maximum pooling layer (maxiumo), a first Average pooling layer (averpooling), a2 nd Convolution block, a3 rd Convolution block, and a first upsampling layer, the 1 st Convolution block includes a first Convolution layer (Conv), a first active layer (Act), a second Convolution layer, and a second active layer, the 2 nd Convolution block includes a third Convolution layer and a third active layer, which are connected in sequence, the 3 rd Convolution block includes a fourth Convolution layer and a fourth active layer, which are connected in sequence, an input end of the first Convolution layer in the 1 st information extraction block receives all feature maps in D1, an input end of the first Convolution layer in the 2 nd information extraction block receives all feature maps in S5, and an input end of the first Maximum pooling layer, The input end of the first average pooling layer and the input end of the third convolution layer both receive all the feature maps output by the output end of the fourth active layer, the channel number superposition operation is performed on all the feature maps output by the output end of the first maximum pooling layer and all the feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all the feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, the element multiplication operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, the element addition operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, and the set formed by all the feature maps obtained after the element addition operation for the 1 st information extraction block is F1, for the 2 nd information extraction block, a set formed by all feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as niThen the number n of input channels of the 1 st information extraction block164, the number n of input channels of the 2 nd information extraction block2512, the convolution kernel (kernel _ size) of the first convolution layer and the fourth convolution layer in the ith information extraction block has a size of 1 × 1, and the convolution kernelThe number of (filters) is niStep size (stride) is 1, zero padding parameter (padding) value is 0, convolution kernel size of the second convolution layer in the ith information extraction block is 3 x 3, and the number of convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 1, i is 1,2, the activation mode of the first activation layer, the second activation layer, the third activation layer and the fourth activation layer is "Relu", the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 × 2, the step length is 2, the value of the zero padding parameter is 0, the magnification (scale factor) of the first up-sampling layer is 2, and the interpolation method is bilinear interpolation (bilinear). Here, the number-of-channels superimposing operation, the element multiplication operation, and the element addition operation are all related art. C in fig. 2 denotes a channel number superimposing operation, + denotes an element adding operation, and x denotes an element multiplying operation.
In this embodiment, in step 1_2, the 5 feature reconstruction blocks have the same structure, as shown in figure 3, it consists of a context attention block and a channel attention block, for the 1 st feature reconstruction block, it performs the first element addition operation on all feature maps in S1 and all feature maps in F1, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, element multiplication operation is carried out on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, performing element addition operation for a second time on all feature maps obtained by multiplying all feature maps in the S1 by elements, wherein a set formed by all feature maps obtained by the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10. Here, the contextual attention block and the channel attention block refer to the DAM module from the papers m.zhang, s. -x.fei, j.liu, s.xu, y.piao, and h.lu, "asymmetry metric two-stream architecture for access rgb-d safety detection," in Proceedings of European Conference on Computer Vision,2020 "(zh\281565639, fisher seuskou, liujie, wink, pubrave day and luchuan," Asymmetric two-stream architecture for accurate rgb-d significance detection ", European Conference corpus of Computer Vision,2020 year). + in fig. 3 represents an element addition operation, and x represents an element multiplication operation.
In this embodiment, in step 1_2, the 4 information reconstruction blocks have the same structure, and as shown in fig. 4, the 4 th convolution block includes a fifth convolution layer and a fifth active layer connected in sequence, the 5 th convolution block includes a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer connected in sequence, the input terminal of the second maximum convolution layer and the input terminal of the second average convolution layer in the 1 st information reconstruction block both receive all the feature maps in F2, the input terminal of the sixth convolution layer receives all the feature maps in D2, the input terminal of the second maximum convolution layer and the input terminal of the second average convolution layer in the 2 nd information reconstruction block both receive all the feature maps in F4, and the input terminal of the sixth convolution layer receives all the feature maps in D3, the input of the second largest pooling layer and the input of the second average pooling layer in the 3 rd information reconstruction block both receive all the feature maps in F6, the input of the sixth convolution layer receives all the feature maps in D4, the input of the second largest pooling layer and the input of the second average pooling layer in the 4 th information reconstruction block both receive all the feature maps in F8, the input of the sixth convolution layer receives all the feature maps in D5, the element subtraction is performed on all the feature maps output by the output of the second largest pooling layer and all the feature maps output by the output of the second average pooling layer, the input of the fifth convolution layer receives the elementPerforming element multiplication on all feature maps output by the output end of the fifth active layer and all feature maps output by the output end of the seventh active layer, performing element addition on all feature maps output by the output end of the fifth active layer and all feature maps obtained after the element multiplication, wherein for the 1 st information reconstruction block, a set formed by all feature maps obtained after the element addition is F3, for the 2 nd information reconstruction block, a set formed by all feature maps obtained after the element addition is F5, for the 3 rd information reconstruction block, a set formed by all feature maps obtained after the element addition is F7, and for the 4 th information reconstruction block, a set formed by all feature maps obtained after the element addition is F9; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1jThe number of input channels of the second input end is n2jThe number n1 of input channels at the first input of the 1 st information reproduction block164, number of input channels n2 of second input end1128, the number of input channels n1 at the first input of the 2 nd information reconstruction block2128, number of input channels n2 of second input end2256, number n1 of input channels at the first input of the 3 rd information reconstruction block3256, number of input channels n2 of second input end3512, the number of input channels n1 at the first input of the 4 th information reconstruction block4512, number of input channels n2 of second input end4512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2jThe step length is 1, the value of the zero padding parameter is 1, the activation modes of the fifth activation layer, the sixth activation layer and the seventh activation layer are Relu, the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 multiplied by 2, the step length is 2, the value of the zero padding parameter is 0, and for the secondAnd when the element subtraction operation is carried out on all the characteristic graphs output by the output end of the maximum pooling layer and all the characteristic graphs output by the output end of the second average pooling layer, the element in the characteristic graph output by the output end of the second maximum pooling layer is subtracted by the corresponding element in the corresponding characteristic graph output by the output end of the second average pooling layer. Here, the element subtraction operation, the element multiplication operation, and the element addition operation are all related art. In fig. 4, "-" denotes an element subtraction operation, "+ denotes an element addition operation, and ×" denotes an element multiplication operation.
In this embodiment, in step 1_2, the 5 feature aggregation blocks have the same structure, and as shown in fig. 5, the feature aggregation block is composed of a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block, a 10 th convolution block, an 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer, and a residual fusion block, the 6 th convolution block includes an eighth convolution layer and an eighth active layer that are sequentially connected, the 7 th convolution block includes a ninth convolution layer and a ninth active layer that are sequentially connected, the 8 th convolution block includes a tenth convolution layer and a tenth active layer that are sequentially connected, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer that are sequentially connected, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer that are sequentially connected, the 11 th convolution block includes a thirteenth convolution layer and a thirteenth active layer that are sequentially connected, the 12 th convolution block comprises a fourteenth convolution layer and a fourteenth active layer which are connected in sequence, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth active layer which are connected in sequence, the residual fusion block comprises a sixteenth active layer, a third largest pooling layer and a sixteenth convolution layer which are connected in sequence, an input end of an eighth convolution layer in the 1 st feature aggregation block receives all feature maps in F10, an input end of a ninth convolution layer receives all feature maps in P5, an input end of a second up-sampling layer receives all feature maps in F11, an input end of an eighth convolution layer of the 2 nd feature aggregation block receives all feature maps in F8, an input end of the ninth convolution layer receives all feature maps in P4, an input end of the second up-sampling layer receives all feature maps in A1, an input end of an eighth convolution layer of the 3 rd feature aggregation block receives all feature maps in F6, Feeding of the ninth convolution layerAn input terminal receives all the feature maps in P3, an input terminal of a second upsampling layer receives all the feature maps in A2, an input terminal of an eighth convolutional layer of a4 th feature aggregation block receives all the feature maps in F4, an input terminal of a ninth convolutional layer receives all the feature maps in P2, an input terminal of the second upsampling layer receives all the feature maps in A3, an input terminal of an eighth convolutional layer of a5 th feature aggregation block receives all the feature maps in F2, an input terminal of the ninth convolutional layer receives all the feature maps in P1, an input terminal of the second upsampling layer receives all the feature maps in A4, all the feature maps output from an output terminal of the eighth active layer and all the feature maps output from an output terminal of the ninth active layer are respectively subjected to channel quartering cutting, the channel cutting is respectively performed in four times, a first channel number superposition operation is performed on a1 st copy of all the feature maps output from the output terminal of the eighth active layer and a1 copy of all the feature maps output from the output terminal of the ninth active layer, performing second channel number superposition on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, performing third channel number superposition on the 3 rd part of all the feature maps output by the output end of the eighth active layer and the 3 rd part of all the feature maps output by the output end of the ninth active layer, performing fourth channel number superposition on the 4 th part of all the feature maps output by the output end of the eighth active layer and the 4 th part of all the feature maps output by the output end of the ninth active layer, receiving all the feature maps output by the output end of the second up-sampling layer by the input end of the tenth convolutional layer, receiving all the feature maps obtained after the first channel number superposition operation by the input end of the eleventh convolutional layer, receiving all the feature maps obtained after the second channel number superposition operation by the input end of the twelfth convolutional layer, the input end of the thirteenth convolution layer receives all the characteristic diagrams obtained after the superposition operation of the third channel number, the input end of the fourteenth convolution layer receives all the characteristic diagrams obtained after the superposition operation of the fourth channel number, and the fifth channel superposition is carried out on all the characteristic diagrams output by the output end of the eleventh activation layer, all the characteristic diagrams output by the output end of the twelfth activation layer, all the characteristic diagrams output by the output end of the thirteenth activation layer and all the characteristic diagrams output by the output end of the fourteenth activation layerAdding operation, wherein an input end of a fifteenth convolution layer receives all feature maps obtained after the fifth channel number superposition operation, element multiplication operation is carried out on all feature maps output by an output end of a tenth active layer and all feature maps output by an output end of the fifteenth active layer, first element addition operation is carried out on all feature maps output by an output end of the tenth active layer and all feature maps obtained after the element multiplication operation, an input end of a sixteenth active layer receives all feature maps obtained after the first element addition operation, second element addition operation is carried out on all feature maps output by an output end of the sixteenth convolution layer and all feature maps obtained after the first element addition operation, for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A1, for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A2, for the 3 rd feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is A3, for the 4 th feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is a4, and for the 5 th feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is a 5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1mThe number of input channels of the second input end is n2mThe number of input channels of the third input end is n3mNumber n1 of input channels at the first input of the 1 st feature aggregation block1512, number of input channels n2 of second input end1512, number of input channels n3 of third input end1512, the number of input channels n1 at the first input of the 2 nd feature aggregation block2512, number of input channels n2 of second input end2512, number of input channels n3 of third input end2256, input channel number n1 at the first input of the 3 rd feature aggregation block3256, number of input channels n2 of second input end3256, number of input channels n3 of third input end3128, the number of input channels n1 at the first input of the 4 th feature aggregation block4128, number of input channels n2 of second input end4=128. Number n3 of input channels of third input end464, the number n1 of input channels at the first input of the 5 th feature aggregation block564, number of input channels n2 of second input end564, number of input channels n3 of third input end532, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of ninth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 0, convolution kernel size of sixteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation. Here, the channel number superposition operation, the element multiplication operation, and the element addition operation are all existingProvided is a technique. C in fig. 5 denotes a channel number superimposing operation, + denotes an element adding operation, and x denotes an element multiplying operation.
In this embodiment, the structures of 10 neural network blocks are the same, and the structure of the neural network block in the existing VGG-16 model is adopted; the 5 expanded volume blocks have the same structure and are cited as RFB modules In S.Liu, and D.Huang, "receptor field block net for acid and fast object detection", In Proceedings of the European Conference on Computer Vision,2018, pp.385-400 (Liu Song and Huangdi, "a network of receiving field blocks capable of accurately and rapidly detecting objects", European Computer Vision Conference discourse, page 385-400 In 2018).
To further illustrate the feasibility and effectiveness of the method of the present invention, experiments were conducted on the method of the present invention.
The method is tested by writing codes in a python language of a pytorech library, the experimental equipment is an Intel i5-7500 processor, and cuda acceleration is used under a NVIDIA TITAN XP-12GB video card. In order to ensure the rigor of the experiment, the data sets selected in the experiment are NJU2K and NLPR, which are known public data sets. NJU2K contains 1485 pairs of 3D images, 1400 pairs of 3D images for training, and 85 pairs of 3D images for detection; the NLPR comprises 730 pairs of 3D images, 650 pairs of 3D images for training, and 80 pairs of 3D images for detection.
In this experiment, 4 common objective parameters of the significance evaluation detection method were used as evaluation indexes: s ≠ Structure-measure) used for evaluating the structural similarity of the saliency detection image and the saliency region in the label image; the adpE ↓value, the adpF ↓valueand the MAE ↓averageabsolute Error (Mean Absolute Error) are used for evaluating the detection performance of the significance detection graph, and important indexes used for evaluating the quality of the detection method are calculated through calculating the accuracy rate and the recall rate.
Comparing the significance detection map generated by the method with the label image, and respectively using S ≠ adpE ℃ ↓, adpF ↓, and MAE ↓asevaluation indexes to evaluate the method, wherein the evaluation indexes of the two data sets are listed in Table 1, and the data listed in Table 1 shows that the method is excellent in performance of the two data sets.
TABLE 1 evaluation results of the method of the invention on two data sets
Figure BDA0002819462340000261
Fig. 6a is an RGB image of the 1 st pair of 3D images to be subjected to saliency detection, fig. 6b is a depth image of the 1 st pair of 3D images to be subjected to saliency detection, fig. 6c is a saliency prediction image obtained by processing the fig. 6a and 6b by using the method of the present invention, and fig. 6D is a label image corresponding to the 1 st pair of 3D images to be subjected to saliency detection; fig. 7a is an RGB image of the 2 nd pair of 3D images to be subjected to saliency detection, fig. 7b is a depth image of the 2 nd pair of 3D images to be subjected to saliency detection, fig. 7c is a saliency prediction image obtained by processing fig. 7a and 7b by using the method of the present invention, and fig. 7D is a label image corresponding to the 2 nd pair of 3D images to be subjected to saliency detection; fig. 8a is an RGB image of a3 rd pair of 3D images to be subjected to saliency detection, fig. 8b is a depth image of the 3 rd pair of 3D images to be subjected to saliency detection, fig. 8c is a saliency prediction image obtained by processing fig. 8a and 8b by using the method of the present invention, and fig. 8D is a label image corresponding to the 3 rd pair of 3D images to be subjected to saliency detection; fig. 9a is an RGB image of a4 th pair of 3D images to be subjected to saliency detection, fig. 9b is a depth image of the 4 th pair of 3D images to be subjected to saliency detection, fig. 9c is a saliency prediction image obtained by processing fig. 9a and 9b by using the method of the present invention, and fig. 9D is a label image corresponding to the 4 th pair of 3D images to be subjected to saliency detection. Fig. 6a and 6b, fig. 7a and 7b, fig. 8a and 8b, and fig. 9a and 9b are representative 3D images containing a plurality of objects, small objects, and complex salient objects, and these representative 3D images are processed by the method of the present invention, and the salient predictive images are correspondingly shown in fig. 6c, fig. 7c, fig. 8c, and fig. 9c, and compared with fig. 6D, fig. 7D, fig. 8D, and fig. 9D, it can be found that the salient regions in these 3D images can be accurately captured by the method of the present invention.
Fig. 10a is a PR (exact-recall) plot of a 3D image for detection in a NJU2K dataset processed using the method of the present invention, and fig. 10b is a PR (exact-recall) plot of a 3D image for detection in an NLPR dataset processed using the method of the present invention. As can be seen from fig. 10a and 10b, the area under the PR curve is large, which indicates that the method of the present invention has good detection performance. Precision in FIG. 10a and FIG. 10b represents "Precision rate" and Recall represents "Recall rate".

Claims (5)

1. A method for detecting a significance image of interactive cycle feature remodeling is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images as
Figure FDA0002819462330000011
Denote the depth image of the k-th pair of original 3D images as
Figure FDA0002819462330000012
Taking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label image
Figure FDA0002819462330000013
Then, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; wherein N is a positive integer, N is not less than 200, k is a positive integer, k is not less than 1 and not more than 200, x is not less than 1 and not more than W, y is not less than 1 and not more than H, W represents the width of the original 3D image and the RGB image thereof, the depth image and the corresponding label image, H represents the height of the original 3D image and the RGB image thereof, the depth image and the corresponding label image,
Figure FDA0002819462330000014
to represent
Figure FDA0002819462330000015
The middle coordinate position is the pixel value of the pixel point of (x, y),
Figure FDA0002819462330000016
to represent
Figure FDA0002819462330000017
The middle coordinate position is the pixel value of the pixel point of (x, y),
Figure FDA0002819462330000018
to represent
Figure FDA0002819462330000019
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 characteristic reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 characteristic aggregation blocks; the output layer comprises an output convolution layer, the size of convolution kernels of the output convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, and the step length is 1;
for an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height of the original RGB image is H;
for a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein the width of the original depth image is W, and the height of the original depth image is H;
for the coding part, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are connected in sequence to form a color coding stream, and the 6 th neural network block, the 7 th neural network block and the 8 th neural network block are connected in sequence to form a color coding streamThe network block, the 9 th neural network block and the 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 is
Figure FDA0002819462330000021
Has a height of
Figure FDA0002819462330000022
The input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 is
Figure FDA0002819462330000023
Has a height of
Figure FDA0002819462330000024
The input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and the width of each characteristic map in S4 is
Figure FDA0002819462330000025
Has a height of
Figure FDA0002819462330000026
The input end of the 5 th neural network block receives all the feature maps in S4, the output end of the 5 th neural network block outputs 512 feature maps, and the 512 feature maps are outputThe set of constructs is denoted S5, and the width of each feature map in S5 is
Figure FDA0002819462330000027
Has a height of
Figure FDA0002819462330000028
The input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2
Figure FDA0002819462330000029
Has a height of
Figure FDA00028194623300000210
The input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3
Figure FDA00028194623300000211
Has a height of
Figure FDA00028194623300000212
The input end of the 9 th neural network block receives all the feature maps in D3, the output end of the 9 th neural network block outputs 512 feature maps, the set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4
Figure FDA00028194623300000213
Has a height of
Figure FDA00028194623300000214
The input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5
Figure FDA00028194623300000215
Has a height of
Figure FDA00028194623300000216
The encoding part provides all the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4 and D5 to the decoding part;
for the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3
Figure FDA0002819462330000031
Has a height of
Figure FDA0002819462330000032
The first input terminal of the 2 nd feature reconstruction block receives all the feature maps in S2, the second input terminal of the 2 nd feature reconstruction block receives all the feature maps in F3, and the output terminal of the 2 nd feature reconstruction block outputs 128 pieces of dataThe feature map is a set of 128 feature maps denoted as F4, and each feature map in F4 has a width of F4
Figure FDA0002819462330000033
Has a height of
Figure FDA0002819462330000034
The first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5
Figure FDA0002819462330000035
Has a height of
Figure FDA0002819462330000036
The first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6
Figure FDA0002819462330000037
Has a height of
Figure FDA0002819462330000038
The first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7
Figure FDA0002819462330000039
Has a height of
Figure FDA00028194623300000310
The first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8
Figure FDA00028194623300000311
Has a height of
Figure FDA00028194623300000312
The first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9
Figure FDA0002819462330000041
Has a height of
Figure FDA0002819462330000042
The first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10
Figure FDA0002819462330000043
Has a height of
Figure FDA0002819462330000044
The input end of the 2 nd information extraction block receives all the feature maps in S5, the output end of the 2 nd information extraction block outputs 512 feature maps, the set of the 512 feature maps is marked as F11, and the width of each feature map in F11Degree of
Figure FDA0002819462330000045
Has a height of
Figure FDA0002819462330000046
The input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal to
Figure FDA0002819462330000047
Has a height of
Figure FDA0002819462330000048
The input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal to
Figure FDA0002819462330000049
Has a height of
Figure FDA00028194623300000410
The input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal to
Figure FDA00028194623300000411
Has a height of
Figure FDA00028194623300000412
5 thThe input end of each expansion volume block receives all the characteristic maps in D5, the output end of the 5 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P5, and the width of each characteristic map in P5 is equal to
Figure FDA00028194623300000413
Has a height of
Figure FDA00028194623300000414
The first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal to
Figure FDA00028194623300000415
Has a height of
Figure FDA00028194623300000416
The first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2
Figure FDA00028194623300000417
Has a height of
Figure FDA00028194623300000418
The first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, and the 3 rd feature aggregation block receives all the feature maps in A2The output end of the aggregation block outputs 64 feature maps, a set of the 64 feature maps is marked as A3, and the width of each feature map in A3 is equal to
Figure FDA0002819462330000051
Has a height of
Figure FDA0002819462330000052
The first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal to
Figure FDA0002819462330000053
Has a height of
Figure FDA0002819462330000054
A first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer;
for the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map;
step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and inputting the kth pair of original 3D imagesThe corresponding significance detection map is marked as
Figure FDA0002819462330000055
Wherein,
Figure FDA0002819462330000056
to represent
Figure FDA0002819462330000057
The middle coordinate position is the pixel value of the pixel point of (x, y);
step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D images
Figure FDA0002819462330000058
And
Figure FDA0002819462330000059
the value of the loss function in between is recorded as
Figure FDA00028194623300000510
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; wherein M is greater than 1;
the test stage process comprises the following specific steps:
step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.
2. The method for detecting a saliency image of an interaction cycle feature remodeling of claim 1, wherein in step 1_2, 2 information extraction blocks have the same structure and are composed of a1 st convolution block, a first maximum pooling layer, a first average pooling layer, a2 nd convolution block, a3 rd convolution block and a first upsampling layer, the 1 st convolution block includes a first convolution layer, a first active layer, a second convolution layer and a second active layer which are connected in sequence, the 2 nd convolution block includes a third convolution layer and a third active layer which are connected in sequence, the 3 rd convolution block includes a fourth convolution layer and a fourth active layer which are connected in sequence, an input end of the first convolution layer in the 1 st information extraction block receives all feature maps in D1, an input end of the first convolution layer in the 2 nd information extraction block receives all feature maps in S5, an input end of the first maximum pooling layer, a second convolution block, and a second active layer which are connected in sequence, and a second convolution block includes a third convolution layer and a third active layer which are, The input end of the first average pooling layer and the input end of the third convolution layer both receive all the feature maps output by the output end of the fourth active layer, the channel number superposition operation is performed on all the feature maps output by the output end of the first maximum pooling layer and all the feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all the feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, the element multiplication operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, the element addition operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, and the set formed by all the feature maps obtained after the element addition operation for the 1 st information extraction block is F1, for the 2 nd information extraction block, a set formed by all feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as niThen the number n of input channels of the 1 st information extraction block164, the number n of input channels of the 2 nd information extraction block2512, the convolution kernel size of the first convolution layer and the fourth convolution layer in the ith information extraction block is 1 × 1, and the number of convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the second convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is niThe step length is 1, the value of the zero padding parameter is 1, i is 1,2, the activation mode of the first activation layer, the second activation layer, the third activation layer and the fourth activation layer is 'Relu', the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 × 2, the step length is 2, the value of the zero padding parameter is 0, the magnification of the first up-sampling layer is 2, and the interpolation method is bilinear interpolation.
3. The method for detecting a saliency image of an interactive cyclic feature reconstruction as claimed in claim 1, wherein in step 1_2, 5 feature reconstruction blocks have the same structure and are composed of a context attention block and a channel attention block, and for the 1 st feature reconstruction block, the method performs a first element addition operation on all feature maps in S1 and all feature maps in F1, receives all feature maps obtained after the first element addition operation at an input end of the context attention block, receives all feature maps output from an output end of the context attention block at an input end of the channel attention block, performs an element multiplication operation on all feature maps output from an output end of the channel attention block and all feature maps obtained after the first element addition operation, performs a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the element multiplication operation in S1, the set formed by all feature maps obtained after the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10.
4. The method according to claim 1, wherein in step 1_2, the 4 information reconstruction blocks have the same structure and are composed of a second maximum pooling layer, a second average pooling layer, a4 th convolution block and a5 th convolution block, the 4 th convolution block includes a fifth convolution layer and a fifth active layer which are connected in sequence, the 5 th convolution block includes a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer which are connected in sequence, an input of the second maximum pooling layer and an input of the second average pooling layer in the 1 st information reconstruction block both receive all the feature maps in F2, an input of the sixth convolution layer receives all the feature maps in D2, an input of the second maximum pooling layer and an input of the second average pooling layer in the 2 nd information reconstruction block both receive all the feature maps in F4, The input terminal of the sixth convolutional layer receives all the feature maps in D3, the input terminal of the second largest pooling layer and the input terminal of the second average pooling layer in the 3 rd information re-modeling block both receive all the feature maps in F6, the input terminal of the sixth convolutional layer receives all the feature maps in D4, the input terminal of the second largest pooling layer and the input terminal of the second average pooling layer in the 4 th information re-modeling block both receive all the feature maps in F8, the input terminal of the sixth convolutional layer receives all the feature maps in D5, the element subtraction operation is performed on all the feature maps output by the output terminal of the second largest pooling layer and all the feature maps output by the output terminal of the second average pooling layer, the input terminal of the fifth convolutional layer receives all the feature maps obtained after the element subtraction operation, the element multiplication operation is performed on all the feature maps output by the output terminal of the fifth active layer and all the feature maps output by the output terminal of the seventh active layer, performing element addition operation on all feature maps output by the output end of the fifth active layer and all feature maps obtained after element multiplication operation, and performing element addition operation on all feature maps obtained after element addition operation on the 1 st information reconstruction blockThe set of the features is F3, the set of all feature maps obtained after the element addition operation is F5 for the 2 nd information reconstruction block, the set of all feature maps obtained after the element addition operation is F7 for the 3 rd information reconstruction block, and the set of all feature maps obtained after the element addition operation is F9 for the 4 th information reconstruction block; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1jThe number of input channels of the second input end is n2jThe number n1 of input channels at the first input of the 1 st information reproduction block164, number of input channels n2 of second input end1128, the number of input channels n1 at the first input of the 2 nd information reconstruction block2128, number of input channels n2 of second input end2256, number n1 of input channels at the first input of the 3 rd information reconstruction block3256, number of input channels n2 of second input end3512, the number of input channels n1 at the first input of the 4 th information reconstruction block4512, number of input channels n2 of second input end4512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2jThe step length is 1, the value of the zero padding parameter is 1, the activation mode of the fifth activation layer, the sixth activation layer and the seventh activation layer is 'Relu', the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 x 2, the step length is 2, the value of the zero padding parameter is 0, and when element subtraction operation is performed on all feature maps output by the output end of the second maximum pooling layer and all feature maps output by the output end of the second average pooling layer, corresponding elements in the corresponding feature maps output by the output end of the second average pooling layer are subtracted from elements in the feature maps output by the output end of the second maximum pooling layer.
5. According toThe method for detecting a saliency image of an interaction cycle feature reconstruction as claimed in claim 1, wherein in said step 1_2, 5 feature aggregation blocks have the same structure and are composed of a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block, a 10 th convolution block, an 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer, and a residual fusion block, wherein the 6 th convolution block includes an eighth convolution layer and an eighth active layer which are connected in sequence, the 7 th convolution block includes a ninth convolution layer and a ninth active layer which are connected in sequence, the 8 th convolution block includes a tenth convolution layer and a tenth active layer which are connected in sequence, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer which are connected in sequence, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer which are connected in sequence, the 11 th convolution block comprises a thirteenth convolution layer and a thirteenth active layer which are connected in sequence, the 12 th convolution block comprises a fourteenth convolution layer and a fourteenth active layer which are connected in sequence, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth active layer which are connected in sequence, the residual fusion block comprises a sixteenth active layer, a third maximum pooling layer and a sixteenth convolution layer which are connected in sequence, an input end of an eighth convolution layer in the 1 st feature aggregation block receives all feature maps in F10, an input end of a ninth convolution layer receives all feature maps in P5, an input end of a second up-sampling layer receives all feature maps in F11, an input end of an eighth convolution layer in the 2 nd feature aggregation block receives all feature maps in F8, an input end of a ninth convolution layer receives all feature maps in P4, an input end of a second up-sampling layer receives all feature maps in A1, and an input end of an eighth convolution layer in the 3 rd feature aggregation block receives all feature maps in F6, The input of the ninth convolutional layer receives all the feature maps in P3, the input of the second upsampling layer receives all the feature maps in A2, the input of the eighth convolutional layer of the 4 th feature aggregation block receives all the feature maps in F4, the input of the ninth convolutional layer receives all the feature maps in P2, the input of the second upsampling layer receives all the feature maps in A3, the input of the eighth convolutional layer of the 5 th feature aggregation block receives all the feature maps in F2, and the input of the ninth convolutional layer receives all the features in P1Receiving all the feature maps in A4 by the input end of the second up-sampling layer, respectively carrying out channel quartering cutting on all the feature maps output by the output end of the eighth active layer and all the feature maps output by the output end of the ninth active layer, respectively dividing the cut feature maps into four parts in sequence, carrying out first channel number superposition operation on the 1 st part of all the feature maps output by the output end of the eighth active layer and the 1 st part of all the feature maps output by the output end of the ninth active layer, carrying out second channel number superposition operation on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, carrying out third channel number superposition operation on the 3 rd part of all the feature maps output by the output end of the eighth active layer and the 3 rd part of all the feature maps output by the output end of the ninth active layer, carrying out the 4 th part of all the feature maps output by the output end of the eighth active layer and the fourth channel number superposition operation of all the feature maps output by the output end of the ninth 4, performing fourth channel number superposition operation, receiving all feature maps output by the output end of the second up-sampling layer by the input end of the tenth convolution layer, receiving all feature maps obtained after the first channel number superposition operation by the input end of the eleventh convolution layer, receiving all feature maps obtained after the second channel number superposition operation by the input end of the twelfth convolution layer, receiving all feature maps obtained after the third channel number superposition operation by the input end of the thirteenth convolution layer, receiving all feature maps obtained after the fourth channel number superposition operation by the input end of the fourteenth convolution layer, performing fifth channel number superposition operation on all feature maps output by the output end of the eleventh active layer, all feature maps output by the output end of the twelfth active layer, all feature maps output by the output end of the thirteenth active layer and all feature maps output by the output end of the fourteenth active layer, receiving all feature maps obtained after the fifth channel number superposition operation by the input end of the fifteenth convolution layer, performing element multiplication operation on all feature maps output by the output end of the tenth active layer and all feature maps output by the output end of the fifteenth active layer, performing first element addition operation on all feature maps output by the output end of the tenth active layer and all feature maps obtained after the element multiplication operation, and obtaining a result after the input end of the sixteenth active layer receives the first element addition operationPerforming a second element addition operation on all feature maps output by the output end of the sixteenth convolutional layer and all feature maps obtained after the first element addition operation, wherein for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a1, for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a2, for A3 rd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A3, for a4 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a4, and for a5 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a 5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1mThe number of input channels of the second input end is n2mThe number of input channels of the third input end is n3mNumber n1 of input channels at the first input of the 1 st feature aggregation block1512, number of input channels n2 of second input end1512, number of input channels n3 of third input end1512, the number of input channels n1 at the first input of the 2 nd feature aggregation block2512, number of input channels n2 of second input end2512, number of input channels n3 of third input end2256, input channel number n1 at the first input of the 3 rd feature aggregation block3256, number of input channels n2 of second input end3256, number of input channels n3 of third input end3128, the number of input channels n1 at the first input of the 4 th feature aggregation block4128, number of input channels n2 of second input end4128, number of input channels n3 of third input end464, the number n1 of input channels at the first input of the 5 th feature aggregation block564, number of input channels n2 of second input end564, number of input channels n3 of third input end532, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, volume of ninth convolutional layer in mth characteristic aggregation blockThe size of the product kernel is 3 multiplied by 3, and the number of convolution kernels is n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3mStep size of 1, zero padding parameter value of 0, convolution kernel size of sixteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation.
CN202011413838.5A 2020-12-07 2020-12-07 Significance image detection method for interactive cycle characteristic remodeling Withdrawn CN112529862A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011413838.5A CN112529862A (en) 2020-12-07 2020-12-07 Significance image detection method for interactive cycle characteristic remodeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011413838.5A CN112529862A (en) 2020-12-07 2020-12-07 Significance image detection method for interactive cycle characteristic remodeling

Publications (1)

Publication Number Publication Date
CN112529862A true CN112529862A (en) 2021-03-19

Family

ID=74997830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011413838.5A Withdrawn CN112529862A (en) 2020-12-07 2020-12-07 Significance image detection method for interactive cycle characteristic remodeling

Country Status (1)

Country Link
CN (1) CN112529862A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
CN113313077A (en) * 2021-06-30 2021-08-27 浙江科技学院 Salient object detection method based on multi-strategy and cross feature fusion
CN113538442A (en) * 2021-06-04 2021-10-22 杭州电子科技大学 RGB-D significant target detection method using adaptive feature fusion

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
CN113538442A (en) * 2021-06-04 2021-10-22 杭州电子科技大学 RGB-D significant target detection method using adaptive feature fusion
CN113538442B (en) * 2021-06-04 2024-04-09 杭州电子科技大学 RGB-D significant target detection method using self-adaptive feature fusion
CN113313077A (en) * 2021-06-30 2021-08-27 浙江科技学院 Salient object detection method based on multi-strategy and cross feature fusion

Similar Documents

Publication Publication Date Title
CN110782462B (en) Semantic segmentation method based on double-flow feature fusion
Zhang et al. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning
Chen et al. Banet: Bidirectional aggregation network with occlusion handling for panoptic segmentation
CN112529862A (en) Significance image detection method for interactive cycle characteristic remodeling
CN110490082B (en) Road scene semantic segmentation method capable of effectively fusing neural network features
CN112597985B (en) Crowd counting method based on multi-scale feature fusion
Zeng et al. LEARD-Net: Semantic segmentation for large-scale point cloud scene
CN110246148B (en) Multi-modal significance detection method for depth information fusion and attention learning
CN110929736A (en) Multi-feature cascade RGB-D significance target detection method
CN110458178B (en) Multi-mode multi-spliced RGB-D significance target detection method
CN113192073A (en) Clothing semantic segmentation method based on cross fusion network
CN112801068B (en) Video multi-target tracking and segmenting system and method
CN112801029B (en) Attention mechanism-based multitask learning method
Ha et al. Deep neural networks using residual fast-slow refined highway and global atomic spatial attention for action recognition and detection
CN114037056A (en) Method and device for generating neural network, computer equipment and storage medium
CN114419406A (en) Image change detection method, training method, device and computer equipment
CN112836602A (en) Behavior recognition method, device, equipment and medium based on space-time feature fusion
CN112700426A (en) Method for detecting salient object in complex environment
Zhang et al. LDD-Net: Lightweight printed circuit board defect detection network fusing multi-scale features
Yang et al. Xception-based general forensic method on small-size images
Zhu et al. MDAFormer: Multi-level difference aggregation transformer for change detection of VHR optical imagery
Park et al. Pyramid attention upsampling module for object detection
CN113313077A (en) Salient object detection method based on multi-strategy and cross feature fusion
CN112348011B (en) Vehicle damage assessment method and device and storage medium
Zheng et al. Transformer-based hierarchical dynamic decoders for salient object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210319