CN112529862A

CN112529862A - Significance image detection method for interactive cycle characteristic remodeling

Info

Publication number: CN112529862A
Application number: CN202011413838.5A
Authority: CN
Inventors: 周武杰; 郭沁玲; 雷景生; 万健; 钱小鸿; 叶宁; 甘兴利
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-03-19

Abstract

The invention discloses a significance image detection method for interactive circulation feature remodeling, which constructs a convolutional neural network in a training stage, and comprises an input layer, a coding part, a decoding part and an output layer, wherein the coding part comprises a neural network block, and the decoding part comprises an information extraction block, a feature remodeling block, an information remodeling block, an expansion rolling block and a feature aggregation block; inputting three channels of RGB images of the 3D image and a three-channel depth map obtained by processing the depth image into a convolutional neural network for training to obtain a significance detection map; obtaining an optimal weight vector and an optimal bias item by calculating a loss function value between the significance detection graph and the label image; inputting three channels of RGB images of the 3D image to be detected and three-channel depth maps corresponding to the depth images into a convolutional neural network training model in a testing stage, and predicting by using an optimal weight vector and an optimal bias term to obtain a significance prediction image; the method has the advantages of clear significance detection result and high detection precision.

Description

Significance image detection method for interactive cycle characteristic remodeling

Technical Field

The invention relates to a significance image detection technology for deep learning, in particular to a significance image detection method for interactive cycle characteristic remodeling.

Background

With the rapid development of artificial intelligence in the computer field, the saliency detection of images has become an increasingly interesting research field. The Salient Object Detection (SOD) aims at distinguishing visually most distinctive objects from the input image, and over the last decades, hundreds of conventional methods have been developed to solve the task of Salient Object Detection, which is an effective pre-processing step among many image processing and computer vision tasks, such as Object segmentation and tracking, video compression, image editing, texture smoothing, etc. Recent work is to learn and detect deep features of salient objects by using Convolutional Neural Networks (CNN), and these convolutional neural network models adopt a coding and decoding structure, and have a simple structure and high calculation efficiency. In the codec structure, the encoder usually extracts a plurality of features of different semantic levels and resolutions by using a pre-trained classification model (such as ResNet and VGG); the decoder combines the extracted features to generate a saliency map. The existing significance detection method of the coding and decoding structure using the convolutional neural network is quite effective, but the accuracy is still challenging. For example: features of different semantic levels and resolutions have different distribution characteristics, and high-level features have abundant semantic information but lack accurate position information; the low-level features have rich details, but are full of background noise, so that the detection accuracy of the method for fusing the high-level features and the low-level features is still not ideal. For features of different modalities, there is cluttered background information in both RGB information and depth information, and further intensive research is still needed to effectively distinguish the background from the foreground so as to generate a better saliency image.

Disclosure of Invention

The invention aims to solve the technical problem of providing a significance image detection method for interactive cycle characteristic remodeling, which has clear significance detection result and high detection precision.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for detecting a significance image of interactive cycle feature remodeling is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting N pairs of original 3D images and label images corresponding to each pair of original 3D images, and recording RGB images of the kth pair of original 3D images as

Denote the depth image of the k-th pair of original 3D images as

Taking the real salient detection image corresponding to the k-th original 3D image as a label image and recording the label image

Then, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; wherein N is a positive integer, N is not less than 200, k is a positive integer, k is not less than 1 and not more than 200, x is not less than 1 and not more than W, y is not less than 1 and not more than H, W represents the width of the original 3D image and the RGB image, the depth image and the corresponding label image thereof, H represents the original 3D image and the RGB image, the depth image and the corresponding label image thereofThe height of the image or images,

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y),

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y),

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y);

step 1_ 2: constructing an end-to-end convolutional neural network: the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB (red, green and blue) image input layer and a depth image input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 characteristic reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 characteristic aggregation blocks; the output layer comprises an output convolution layer, the size of convolution kernels of the output convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, and the step length is 1;

for an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height of the original RGB image is H;

for a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein the width of the original depth image is W, and the height of the original depth image is H;

for the coding part, a1 st neural network block, a2 nd neural network block, a3 rd neural network block, a4 th neural network block and a5 th neural network block are sequentially connected to form a color coding stream, and a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block and a 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 is

Has a height of

The input end of the 3 rd neural network block receives all the feature maps in S2, the output end of the 3 rd neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as S3, and the width of each feature map in S3 is

Has a height of

The input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and the width of each characteristic map in S4 is

Has a height of

The input end of the 5 th neural network block receives all the characteristic maps in S4, the output end of the 5 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S5, and the width of each characteristic map in S5 is

Has a height of

The input end of the 6 th neural network block receives the three-channel depth map output by the output end of the depth map input layer, the output end of the 6 th neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as D1, and each feature map in D1 has the width of W and the height of H; the input end of the 7 th neural network block receives all the feature maps in D1, the output end of the 7 th neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as D2, and the width of each feature map in D2 is D2

Has a height of

The input end of the 8 th neural network block receives all the feature maps in D2, the output end of the 8 th neural network block outputs 256 feature maps, the set of the 256 feature maps is marked as D3, and the width of each feature map in D3 is D3

Has a height of

The input end of the 9 th neural network block receives all the feature maps in D3, the output end of the 9 th neural network block outputs 512 feature maps, the set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4

Has a height of

The input end of the 10 th neural network block receives all the characteristic maps in D4, the output end of the 10 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as D5, and the width of each characteristic map in D5 is D5

Has a height of

The encoding part provides all the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4 and D5 to the decoding part;

for the decoding part, the input end of the 1 st information extraction block receives all the feature maps in D1, the output end of the 1 st information extraction block outputs 64 feature maps, the set of the 64 feature maps is marked as F1, and each feature map in F1 has a width W and a height H; a first input end of the 1 st feature reconstruction block receives all the feature maps in the S1, a second input end of the 1 st feature reconstruction block receives all the feature maps in the F1, an output end of the 1 st feature reconstruction block outputs 64 feature maps, a set of the 64 feature maps is marked as F2, and each feature map in the F2 has a width W and a height H; the first input terminal of the 1 st information reconstruction block receives all the feature maps in F2, the second input terminal of the 1 st information reconstruction block receives all the feature maps in D2, the output terminal of the 1 st information reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F3, and the width of each feature map in F3 is F3

Has a height of

The first input terminal of the 2 nd feature reconstruction block receives all the feature maps in S2, and the second input terminal of the 2 nd feature reconstruction blockThe input end receives all the feature maps in F3, the output end of the 2 nd feature reconstruction block outputs 128 feature maps, the set of the 128 feature maps is marked as F4, and the width of each feature map in F4 is equal to

Has a height of

The first input terminal of the 2 nd information reconstruction block receives all the feature maps in F4, the second input terminal of the 2 nd information reconstruction block receives all the feature maps in D3, the output terminal of the 2 nd information reconstruction block outputs 256 feature maps, the set of the 256 feature maps is denoted as F5, and the width of each feature map in F5 is F5

Has a height of

The first input end of the 3 rd feature reconstruction block receives all the feature maps in S3, the second input end of the 3 rd feature reconstruction block receives all the feature maps in F5, the output end of the 3 rd feature reconstruction block outputs 256 feature maps, the 256 feature maps are recorded as F6, and the width of each feature map in F6 is F6

Has a height of

The first input terminal of the 3 rd information reconstruction block receives all the feature maps in F6, the second input terminal of the 3 rd information reconstruction block receives all the feature maps in D4, 512 feature maps are output from the output terminal of the 3 rd information reconstruction block, the 512 feature maps are recorded as F7, and the width of each feature map in F7 is F7

Has a height of

The first input end of the 4 th feature reconstruction block receives all the feature maps in S4, the second input end of the 4 th feature reconstruction block receives all the feature maps in F7, the output end of the 4 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F8 in a set, and the width of each feature map in F8 is F8

Has a height of

The first input terminal of the 4 th information reconstruction block receives all the feature maps in F8, the second input terminal of the 4 th information reconstruction block receives all the feature maps in D5, 512 feature maps are output from the output terminal of the 4 th information reconstruction block, the 512 feature maps are recorded as a set of F9, and the width of each feature map in F9 is F9

Has a height of

The first input end of the 5 th feature reconstruction block receives all the feature maps in S5, the second input end of the 5 th feature reconstruction block receives all the feature maps in F9, the output end of the 5 th feature reconstruction block outputs 512 feature maps, the 512 feature maps are recorded as F10 in the set, and the width of each feature map in F10 is F10

Has a height of

The input end of the 2 nd information extraction block receives all the feature maps in S5, the output end of the 2 nd information extraction block outputs 512 feature maps, the set of the 512 feature maps is denoted as F11, and the width of each feature map in F11 is

Has a height of

The input end of the 1 st expansion volume block receives all the feature maps in D1, the output end of the 1 st expansion volume block outputs 64 feature maps, a set of the 64 feature maps is marked as P1, and each feature map in P1 has a width W and a height H; the input end of the 2 nd expansion volume block receives all the characteristic maps in D2, the output end of the 2 nd expansion volume block outputs 128 characteristic maps, the set of the 128 characteristic maps is marked as P2, and the width of each characteristic map in P2 is equal to

Has a height of

The input end of the 3 rd expansion volume block receives all the characteristic maps in D3, the output end of the 3 rd expansion volume block outputs 256 characteristic maps, the set of the 256 characteristic maps is marked as P3, and the width of each characteristic map in P3 is equal to

Has a height of

The input end of the 4 th expansion volume block receives all the characteristic maps in D4, the output end of the 4 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P4, and the width of each characteristic map in P4 is equal to

Has a height of

The 5 th expansion volume block receives all the characteristic maps in D5 at its input end, and outputs 512 characteristic maps at its output end, and the 512 characteristic maps are set as P5 and each characteristic map in P5Has a width of

Has a height of

The first input end of the 1 st feature aggregation block receives all the feature maps in F10, the second input end of the 1 st feature aggregation block receives all the feature maps in P5, the third input end of the 1 st feature aggregation block receives all the feature maps in F11, the output end of the 1 st feature aggregation block outputs 256 feature maps, the set of the 256 feature maps is marked as A1, and the width of each feature map in A1 is equal to

Has a height of

The first input end of the 2 nd feature aggregation block receives all the feature maps in F8, the second input end of the 2 nd feature aggregation block receives all the feature maps in P4, the third input end of the 2 nd feature aggregation block receives all the feature maps in A1, the output end of the 2 nd feature aggregation block outputs 128 feature maps, the 128 feature maps are collected into A2, and the width of each feature map in A2 is A2

Has a height of

The first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, the output end of the 3 rd feature aggregation block outputs 64 feature maps, the set of the 64 feature maps is marked as A3, and the width of each feature map in A3 is equal to

Has a height of

The first input end of the 4 th feature aggregation block receives all the feature maps in F4, the second input end of the 4 th feature aggregation block receives all the feature maps in P2, the third input end of the 4 th feature aggregation block receives all the feature maps in A3, the output end of the 4 th feature aggregation block outputs 32 feature maps, the set of the 32 feature maps is marked as A4, and the width of each feature map in A4 is equal to

Has a height of

A first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer;

for the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map;

step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and marking the significance detection map corresponding to the kth pair of original 3D images as

Wherein,

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y);

step 1_ 4: calculating a loss function value between a corresponding significance detection image and a corresponding label image of each pair of original 3D images

And

the value of the loss function in between is recorded as

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; wherein M is greater than 1;

the test stage process comprises the following specific steps:

step 2_ 1: and inputting a three-channel depth map obtained by copying R channel components, G channel components, B channel components and depth images of the RGB images of the 3D images to be subjected to significance detection into a convolutional neural network training model, predicting by using the optimal weight vector and the optimal bias term, and predicting to obtain a corresponding significance prediction image.

In step 1_2, the 2 information extraction blocks have the same structure and are composed of a1 st convolution block, a first maximum pooling layer, a first average pooling layer, a2 nd convolution block, a3 rd convolution block and a first up-sampling layer, wherein the 1 st convolution block comprises a first convolution layer, a first convolution layer and a first up-sampling layer which are sequentially connectedThe active layer, the second convolution layer and the second active layer, the 2 nd convolution block comprises a third convolution layer and a third active layer which are connected in sequence, the 3 rd convolution block comprises a fourth convolution layer and a fourth active layer which are connected in sequence, the input end of the first convolution layer in the 1 st information extraction block receives all feature maps in D1, the input end of the first convolution layer in the 2 nd information extraction block receives all feature maps in S5, the input end of the first maximum pooling layer, the input end of the first average pooling layer and the input end of the third convolution layer all receive all feature maps output by the output end of the fourth active layer, the channel number superposition operation is carried out on all feature maps output by the output end of the first maximum pooling layer and all feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, element multiplication operation is carried out on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, element addition operation is carried out on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, for the 1 st information extraction block, the set formed by all the feature maps obtained after the element addition operation is F1, and for the 2 nd information extraction block, the set formed by all the feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as n_iThen the number n of input channels of the 1 st information extraction block₁64, the number n of input channels of the 2 nd information extraction block₂512, the convolution kernel size of the first convolution layer and the fourth convolution layer in the ith information extraction block is 1 × 1, and the number of convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the second convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is n_iThe step length is 1, the value of the zero-filling parameter is 1, i is 1,2, and the first active layer, the second active layer, the third active layer and the fourth active layer are arranged in sequenceThe activation mode of the first upsampling layer is 'Relu', the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 multiplied by 2, the step length is 2, the value of the zero padding parameter is 0, the magnification of the first maximum pooling layer is 2, and the interpolation method is bilinear interpolation.

In step 1_2, the 5 feature reconstruction blocks have the same structure and are composed of a context attention block and a channel attention block, and for the 1 st feature reconstruction block, which performs a first element addition operation on all feature maps in S1 and all feature maps in F1, the input terminal of the context attention block receives all feature maps obtained after the first element addition operation, the input terminal of the channel attention block receives all feature maps output from the output terminal of the context attention block, element multiplication operation is carried out on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, performing element addition operation for a second time on all feature maps obtained by multiplying all feature maps in the S1 by elements, wherein a set formed by all feature maps obtained by the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10.

In step 1_2, the 4 information reconstruction blocks have the same structure and are composed of a second maximum pooling layer, a second average pooling layer, a4 th convolution block and a5 th convolution block, the 4 th convolution block comprises a fifth convolution layer and a fifth active layer which are sequentially connected, the 5 th convolution block comprises a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer which are sequentially connected, and the input end of the second maximum pooling layer and the input end of the second average pooling layer in the 1 st information reconstruction block both receive all the feature maps and the sixth convolution layer in F2The inputs of the stacks receive all the feature maps in D2, the inputs of the second largest pooling layer and the second average pooling layer in the 2 nd information re-modeling block receive all the feature maps in F4, the input of the sixth pooling layer receives all the feature maps in D3, the input of the second largest pooling layer and the input of the second average pooling layer in the 3 rd information re-modeling block receive all the feature maps in F6, the input of the sixth pooling layer receives all the feature maps in D4, the input of the second largest pooling layer and the input of the second average pooling layer in the 4 th information re-modeling block receive all the feature maps in F8, the input of the sixth pooling layer receives all the feature maps in D5, the all feature maps output at the output of the second largest subtracting pooling layer and all the feature maps output at the output of the second average pooling layer are operated by elements, receiving all feature maps obtained after the element subtraction operation by an input end of the fifth convolutional layer, performing element multiplication operation on all feature maps output by an output end of the fifth active layer and all feature maps output by an output end of the seventh active layer, performing element addition operation on all feature maps output by an output end of the fifth active layer and all feature maps obtained after the element multiplication operation, wherein for a1 st information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F3, for a2 nd information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F5, for a3 rd information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F7, and for a4 th information reconstruction block, a set formed by all feature maps obtained after the element addition operation is F9; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1_jThe number of input channels of the second input end is n2_jThe number n1 of input channels at the first input of the 1 st information reproduction block₁64, number of input channels n2 of second input end₁128, the number of input channels n1 at the first input of the 2 nd information reconstruction block₂128, number of input channels n2 of second input end₂256, number n1 of input channels at the first input of the 3 rd information reconstruction block₃256, number of input channels n of second input end2₃512, the number of input channels n1 at the first input of the 4 th information reconstruction block₄512, number of input channels n2 of second input end₄512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2_jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2_jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2_jThe step length is 1, the value of the zero padding parameter is 1, the activation mode of the fifth activation layer, the sixth activation layer and the seventh activation layer is 'Relu', the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 x 2, the step length is 2, the value of the zero padding parameter is 0, and when element subtraction operation is performed on all feature maps output by the output end of the second maximum pooling layer and all feature maps output by the output end of the second average pooling layer, corresponding elements in the corresponding feature maps output by the output end of the second average pooling layer are subtracted from elements in the feature maps output by the output end of the second maximum pooling layer.

In step 1_2, the 5 feature aggregation blocks have the same structure and are composed of a 6 th convolution block, a 7 th convolution block, a 8 th convolution block, a 9 th convolution block, a 10 th convolution block, a 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer and a residual fusion block, wherein the 6 th convolution block includes an eighth convolution layer and an eighth active layer which are sequentially connected, the 7 th convolution block includes a ninth convolution layer and a ninth active layer which are sequentially connected, the 8 th convolution block includes a tenth convolution layer and a tenth active layer which are sequentially connected, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer which are sequentially connected, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer which are sequentially connected, the 11 th convolution block includes a thirteenth convolution layer and a thirteenth active layer which are sequentially connected, the 12 th convolution block includes a fourteenth convolution layer and a fourteenth active layer which are sequentially connected, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth activation layer which are connected in sequence, and the residual error fusion block comprises a fifteenth convolution layer and a fifteenth activation layer which are connected in sequenceThe input terminal of the eighth convolutional layer in the 1 st feature aggregation block receives all the feature maps in F10, the input terminal of the ninth convolutional layer receives all the feature maps in P5, the input terminal of the second upsampling layer receives all the feature maps in F11, the input terminal of the eighth convolutional layer in the 2 nd feature aggregation block receives all the feature maps in F8, the input terminal of the ninth convolutional layer receives all the feature maps in P4, the input terminal of the second upsampling layer receives all the feature maps in A1, the input terminal of the eighth convolutional layer in the 3 rd feature aggregation block receives all the feature maps in F6, the input terminal of the ninth convolutional layer receives all the feature maps in P3, the input terminal of the second upsampling layer receives all the feature maps in A2, the input terminal of the eighth convolutional layer in the 4 th feature aggregation block receives all the feature maps in F4, The input end of the ninth convolutional layer receives all the feature maps in P2, the input end of the second upsampling layer receives all the feature maps in A3, the input end of the eighth convolutional layer of the 5 th feature aggregation block receives all the feature maps in F2, the input end of the ninth convolutional layer receives all the feature maps in P1, the input end of the second upsampling layer receives all the feature maps in A4, all the feature maps output by the output end of the eighth active layer and all the feature maps output by the output end of the ninth active layer are respectively subjected to channel quartering, the channel quartering is respectively divided into four parts in sequence, the first channel number superposition operation is carried out on the 1 st part of all the feature maps output by the output end of the eighth active layer and the 1 st part of all the feature maps output by the output end of the ninth active layer, the second channel number superposition operation is carried out on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, performing a third channel number superposition operation on the 3 rd parts of all the characteristic diagrams output by the output end of the eighth active layer and the 3 rd parts of all the characteristic diagrams output by the output end of the ninth active layer, performing a fourth channel number superposition operation on the 4 th parts of all the characteristic diagrams output by the output end of the eighth active layer and the 4 th parts of all the characteristic diagrams output by the output end of the ninth active layer, receiving all the characteristic diagrams output by the output end of the second upsampling layer by the input end of the tenth convolutional layer, and terminating the input end of the eleventh convolutional layer by the output end of the eleventh convolutional layerReceiving all feature maps obtained after the first channel number superposition operation, receiving all feature maps obtained after the second channel number superposition operation by an input end of a twelfth convolution layer, receiving all feature maps obtained after the third channel number superposition operation by an input end of a thirteenth convolution layer, receiving all feature maps obtained after the fourth channel number superposition operation by an input end of a fourteenth convolution layer, performing fifth channel number superposition operation on all feature maps output by an output end of an eleventh active layer, all feature maps output by an output end of the twelfth active layer, all feature maps output by an output end of the thirteenth active layer and all feature maps output by an output end of the fourteenth active layer, receiving all feature maps obtained after the fifth channel number superposition operation by an input end of a fifteenth convolution layer, performing element multiplication operation on all feature maps output by an output end of the tenth active layer and all feature maps output by an output end of the fifteenth active layer, performing a first element addition operation on all feature maps output by an output end of a tenth active layer and all feature maps obtained after the element multiplication operation, receiving all feature maps obtained after the first element addition operation by an input end of a sixteenth active layer, performing a second element addition operation on all feature maps output by an output end of the sixteenth convolutional layer and all feature maps obtained after the first element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is A1 for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A2 for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A3 for A3 rd feature aggregation block, and a set formed by all feature maps obtained after the second element addition operation is A4 for a4 th feature aggregation block, for the 5 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1_mThe number of input channels of the second input end is n2_mThe number of input channels of the third input end is n3_mNumber n1 of input channels at the first input of the 1 st feature aggregation block₁512, number of input channels n2 of second input end₁512, number of input channels n3 of third input end₁512, the number of input channels n1 at the first input of the 2 nd feature aggregation block₂512, number of input channels n2 of second input end₂512, number of input channels n3 of third input end₂256, input channel number n1 at the first input of the 3 rd feature aggregation block₃256, number of input channels n2 of second input end₃256, number of input channels n3 of third input end₃128, the number of input channels n1 at the first input of the 4 th feature aggregation block₄128, number of input channels n2 of second input end₄128, number of input channels n3 of third input end₄64, the number n1 of input channels at the first input of the 5 th feature aggregation block₅64, number of input channels n2 of second input end₅64, number of input channels n3 of third input end₅32, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of ninth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3_mStep size of 1, zero-filled parameter value of 0, m characteristic aggregation blockThe size of the convolution kernel of the sixteenth convolution layer of (2) is 3 × 3, and the number of convolution kernels is n3_mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation.

Compared with the prior art, the invention has the advantages that:

1) the convolutional neural network constructed by the method is a double-flow end-to-end interaction cycle characteristic remodeling network system structure, information flows of two modes are mutually communicated so as to extract enough complementary information, and meanwhile, the background noise of the two modes is inhibited, so that a convolutional neural network training model obtained by training has better significance detection performance.

2) The convolutional neural network constructed by the method of the invention is provided with the information extraction block, and the information extraction block can further extract the foreground information of the shallow depth map and the foreground information of the deep color map through the pooling operation, thereby being beneficial to the full extraction of the information and leading the trained convolutional neural network training model to be capable of effectively detecting the significant objects.

3) The convolutional neural network constructed by the method of the invention is designed with the characteristic remolding block and the information remolding block, the characteristic remolding block fuses color information by taking depth information as weight, and the information remolding block fuses the fusion information of the characteristic remolding block and adjacent depth information again to obtain complementary context characteristics, so that the convolutional neural network training model obtained by training can effectively detect a significant object.

4) The feature aggregation block is designed in the convolutional neural network constructed by the method, and the local features and the global features of the two modes are fully fused, so that the convolutional neural network training model obtained by training can effectively detect the significant object.

Drawings

FIG. 1 is a schematic diagram of the structure of an end-to-end convolutional neural network constructed by the method of the present invention;

FIG. 2 is a schematic diagram of the structure of an information extraction block in an end-to-end convolutional neural network constructed by the method of the present invention;

FIG. 3 is a schematic diagram of the structure of the feature reconstruction block in the end-to-end convolutional neural network constructed by the method of the present invention;

FIG. 4 is a schematic diagram of the structure of the information reconstruction block in the end-to-end convolutional neural network constructed by the method of the present invention;

FIG. 5 is a schematic diagram of a structure of a feature aggregation block in an end-to-end convolutional neural network constructed by the method of the present invention;

FIG. 6a is an RGB image of the 1 st pair of 3D images to be saliency detected;

FIG. 6b is a depth image of the 1 st pair of 3D images to be saliency detected;

FIG. 6c is a predicted salient image obtained by processing FIGS. 6a and 6b using the method of the present invention;

FIG. 6D is a label image corresponding to the 1 st pair of 3D images to be detected for saliency;

FIG. 7a is an RGB image of the 2 nd pair of 3D images to be saliency detected;

FIG. 7b is a depth image of the 2 nd pair of 3D images to be saliency detected;

FIG. 7c is a predicted salient image obtained by processing FIGS. 7a and 7b using the method of the present invention;

FIG. 7D is a label image corresponding to the 2 nd pair of 3D images to be saliency detected;

FIG. 8a is an RGB image of a3 rd pair of 3D images to be saliency detected;

FIG. 8b is a depth image of the 3 rd pair of 3D images to be saliency detected;

FIG. 8c is a predicted salient image obtained by processing FIGS. 8a and 8b using the method of the present invention;

FIG. 8D is a label image corresponding to the 3 rd pair of 3D images to be saliency detected;

FIG. 9a is an RGB image of the 4 th pair of 3D images to be saliency detected;

FIG. 9b is a depth image of the 4 th pair of 3D images to be saliency detected;

FIG. 9c is a predicted salient image obtained by processing FIGS. 9a and 9b using the method of the present invention;

FIG. 9D is a label image corresponding to the 4 th pair of 3D images to be saliency detected;

FIG. 10a is a PR (accurate-recall) plot of a 3D image for inspection in a NJU2K dataset processed using the method of the present invention;

fig. 10b is a PR (precision-recall) plot obtained by processing a 3D image for detection in an NLPR dataset using the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The invention provides a significance image detection method for interactive cycle characteristic remodeling.

The specific steps of the training phase process are as follows:

Denote the depth image of the k-th pair of original 3D images as

Then, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; each pair of original 3D images comprises an RGB image and a depth image, N is a positive integer, N is more than or equal to 200, if N is 600, k is a positive integer, k is more than or equal to 1 and less than or equal to N, x is more than or equal to 1 and less than or equal to W, y is more than or equal to 1 and less than or equal to H, and W represents the original 3D images and the RGB images and the depth images thereofH represents the height of the original 3D image, its RGB image, depth image, and corresponding tag image, W-H-224 in this embodiment,

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y),

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y),

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y).

Step 1_ 2: constructing an end-to-end convolutional neural network: as shown in fig. 1, the convolutional neural network comprises an input layer, an encoding part, a decoding part and an output layer, wherein the input layer comprises an RGB map input layer and a depth map input layer, the encoding part comprises 10 neural network blocks, and the decoding part comprises 2 information extraction blocks, 5 feature reconstruction blocks, 4 information reconstruction blocks, 5 expansion convolution blocks and 5 feature aggregation blocks; the output layer comprises an output convolutional layer, the size of a convolution kernel of the output convolutional layer is 3 multiplied by 3, the number of the convolution kernels is 1, the step length is 1, and the output convolutional layer is a commonly used convolutional layer.

For an RGB image input layer in the input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original RGB image to the encoding part; wherein, the width of the original RGB image is W, and the height is H.

For a depth map input layer in the input layer, the input end of the input layer receives a three-channel depth map processed by an original depth image by adopting a copying method, and the output end of the input layer outputs the three-channel depth map to a coding part; wherein, the width of the original depth image is W, and the height is H.

Has a height of

Has a height of

The input end of the 4 th neural network block receives all the characteristic maps in S3, the output end of the 4 th neural network block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as S4, and each characteristic map in S4The width of the figure is

Has a height of

Has a height of

Has a height of

Has a height of

The input of the 9 th neural network block receives all the feature maps in D3, the 9 thThe output ends of the 9 neural network blocks output 512 feature maps, a set of the 512 feature maps is marked as D4, and the width of each feature map in D4 is D4

Has a height of

Has a height of

The encoding portion provides all of the feature maps of S1, S2, S3, S4, S5, D1, D2, D3, D4, D5 to the decoding portion.

Has a height of

The first input end of the 2 nd feature reconstruction block receives all the feature maps in S2, the second input end of the 2 nd feature reconstruction block receives all the feature maps in F3, the output end of the 2 nd feature reconstruction block outputs 128 feature maps, the 128 feature maps are recorded as a set of F4, and the width of each feature map in F4 is F4

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input terminal of the 2 nd information extraction block receives all the feature maps in S5, and the output terminal of the 2 nd information extraction block outputs 5The set of 12 feature maps is denoted as F11, and the width of each feature map in F11 is

Has a height of

Has a height of

Has a height of

Has a height of

The 5 th swellingThe input end of the swelling volume block receives all the characteristic maps in D5, the output end of the 5 th swelling volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P5, and the width of each characteristic map in P5 is equal to

Has a height of

Has a height of

Has a height of

The first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, the output end of the 3 rd feature aggregation block outputs 64 feature maps, the collection of the 64 feature maps is marked as A3,each feature map in a3 has a width of

Has a height of

Has a height of

A first input end of a5 th feature aggregation block receives all feature maps in F2, a second input end of the 5 th feature aggregation block receives all feature maps in P1, a third input end of the 5 th feature aggregation block receives all feature maps in A4, an output end of the 5 th feature aggregation block outputs 16 feature maps, a set of the 16 feature maps is marked as A5, and each feature map in A5 has a width W and a height H; the decoding section supplies all the feature maps in a5 to the output layer.

For the output layer, the input end of the output convolutional layer receives all the feature maps in A5, and the output end of the output convolutional layer outputs a feature map with the width W and the height H as a significance detection map.

Wherein,

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y).

And

the value of the loss function in between is recorded as

In this embodiment, the loss function value is obtained using the conventional two-class cross entropy.

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for M times to obtain a convolutional neural network training model, and obtaining N multiplied by M loss function values; dividing the sum of the N loss function values obtained by each execution by N to obtain final loss function values obtained by the execution, and obtaining M final loss function values in total; finding out the final loss function value with the minimum value from the M final loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum final loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network training model; where M > 1, M is 1025 in this example.

The test stage process comprises the following specific steps:

In this embodiment, in step 1_2, the 2 information extraction blocks have the same structure, and as shown in fig. 2, the 2 information extraction blocks are composed of a1 st Convolution block, a first Maximum pooling layer (maxiumo), a first Average pooling layer (averpooling), a2 nd Convolution block, a3 rd Convolution block, and a first upsampling layer, the 1 st Convolution block includes a first Convolution layer (Conv), a first active layer (Act), a second Convolution layer, and a second active layer, the 2 nd Convolution block includes a third Convolution layer and a third active layer, which are connected in sequence, the 3 rd Convolution block includes a fourth Convolution layer and a fourth active layer, which are connected in sequence, an input end of the first Convolution layer in the 1 st information extraction block receives all feature maps in D1, an input end of the first Convolution layer in the 2 nd information extraction block receives all feature maps in S5, and an input end of the first Maximum pooling layer, The input end of the first average pooling layer and the input end of the third convolution layer both receive all the feature maps output by the output end of the fourth active layer, the channel number superposition operation is performed on all the feature maps output by the output end of the first maximum pooling layer and all the feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all the feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, the element multiplication operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, the element addition operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, and the set formed by all the feature maps obtained after the element addition operation for the 1 st information extraction block is F1, for the 2 nd information extraction block, a set formed by all feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as n_iThen the number n of input channels of the 1 st information extraction block₁64, the number n of input channels of the 2 nd information extraction block₂512, the convolution kernel (kernel _ size) of the first convolution layer and the fourth convolution layer in the ith information extraction block has a size of 1 × 1, and the convolution kernelThe number of (filters) is n_iStep size (stride) is 1, zero padding parameter (padding) value is 0, convolution kernel size of the second convolution layer in the ith information extraction block is 3 x 3, and the number of convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 1, i is 1,2, the activation mode of the first activation layer, the second activation layer, the third activation layer and the fourth activation layer is "Relu", the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 × 2, the step length is 2, the value of the zero padding parameter is 0, the magnification (scale factor) of the first up-sampling layer is 2, and the interpolation method is bilinear interpolation (bilinear). Here, the number-of-channels superimposing operation, the element multiplication operation, and the element addition operation are all related art. C in fig. 2 denotes a channel number superimposing operation, + denotes an element adding operation, and x denotes an element multiplying operation.

In this embodiment, in step 1_2, the 5 feature reconstruction blocks have the same structure, as shown in figure 3, it consists of a context attention block and a channel attention block, for the 1 st feature reconstruction block, it performs the first element addition operation on all feature maps in S1 and all feature maps in F1, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, element multiplication operation is carried out on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, performing element addition operation for a second time on all feature maps obtained by multiplying all feature maps in the S1 by elements, wherein a set formed by all feature maps obtained by the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10. Here, the contextual attention block and the channel attention block refer to the DAM module from the papers m.zhang, s. -x.fei, j.liu, s.xu, y.piao, and h.lu, "asymmetry metric two-stream architecture for access rgb-d safety detection," in Proceedings of European Conference on Computer Vision,2020 "(zh\281565639, fisher seuskou, liujie, wink, pubrave day and luchuan," Asymmetric two-stream architecture for accurate rgb-d significance detection ", European Conference corpus of Computer Vision,2020 year). + in fig. 3 represents an element addition operation, and x represents an element multiplication operation.

In this embodiment, in step 1_2, the 4 information reconstruction blocks have the same structure, and as shown in fig. 4, the 4 th convolution block includes a fifth convolution layer and a fifth active layer connected in sequence, the 5 th convolution block includes a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer connected in sequence, the input terminal of the second maximum convolution layer and the input terminal of the second average convolution layer in the 1 st information reconstruction block both receive all the feature maps in F2, the input terminal of the sixth convolution layer receives all the feature maps in D2, the input terminal of the second maximum convolution layer and the input terminal of the second average convolution layer in the 2 nd information reconstruction block both receive all the feature maps in F4, and the input terminal of the sixth convolution layer receives all the feature maps in D3, the input of the second largest pooling layer and the input of the second average pooling layer in the 3 rd information reconstruction block both receive all the feature maps in F6, the input of the sixth convolution layer receives all the feature maps in D4, the input of the second largest pooling layer and the input of the second average pooling layer in the 4 th information reconstruction block both receive all the feature maps in F8, the input of the sixth convolution layer receives all the feature maps in D5, the element subtraction is performed on all the feature maps output by the output of the second largest pooling layer and all the feature maps output by the output of the second average pooling layer, the input of the fifth convolution layer receives the elementPerforming element multiplication on all feature maps output by the output end of the fifth active layer and all feature maps output by the output end of the seventh active layer, performing element addition on all feature maps output by the output end of the fifth active layer and all feature maps obtained after the element multiplication, wherein for the 1 st information reconstruction block, a set formed by all feature maps obtained after the element addition is F3, for the 2 nd information reconstruction block, a set formed by all feature maps obtained after the element addition is F5, for the 3 rd information reconstruction block, a set formed by all feature maps obtained after the element addition is F7, and for the 4 th information reconstruction block, a set formed by all feature maps obtained after the element addition is F9; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1_jThe number of input channels of the second input end is n2_jThe number n1 of input channels at the first input of the 1 st information reproduction block₁64, number of input channels n2 of second input end₁128, the number of input channels n1 at the first input of the 2 nd information reconstruction block₂128, number of input channels n2 of second input end₂256, number n1 of input channels at the first input of the 3 rd information reconstruction block₃256, number of input channels n2 of second input end₃512, the number of input channels n1 at the first input of the 4 th information reconstruction block₄512, number of input channels n2 of second input end₄512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2_jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2_jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2_jThe step length is 1, the value of the zero padding parameter is 1, the activation modes of the fifth activation layer, the sixth activation layer and the seventh activation layer are Relu, the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 multiplied by 2, the step length is 2, the value of the zero padding parameter is 0, and for the secondAnd when the element subtraction operation is carried out on all the characteristic graphs output by the output end of the maximum pooling layer and all the characteristic graphs output by the output end of the second average pooling layer, the element in the characteristic graph output by the output end of the second maximum pooling layer is subtracted by the corresponding element in the corresponding characteristic graph output by the output end of the second average pooling layer. Here, the element subtraction operation, the element multiplication operation, and the element addition operation are all related art. In fig. 4, "-" denotes an element subtraction operation, "+ denotes an element addition operation, and ×" denotes an element multiplication operation.

In this embodiment, in step 1_2, the 5 feature aggregation blocks have the same structure, and as shown in fig. 5, the feature aggregation block is composed of a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block, a 10 th convolution block, an 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer, and a residual fusion block, the 6 th convolution block includes an eighth convolution layer and an eighth active layer that are sequentially connected, the 7 th convolution block includes a ninth convolution layer and a ninth active layer that are sequentially connected, the 8 th convolution block includes a tenth convolution layer and a tenth active layer that are sequentially connected, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer that are sequentially connected, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer that are sequentially connected, the 11 th convolution block includes a thirteenth convolution layer and a thirteenth active layer that are sequentially connected, the 12 th convolution block comprises a fourteenth convolution layer and a fourteenth active layer which are connected in sequence, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth active layer which are connected in sequence, the residual fusion block comprises a sixteenth active layer, a third largest pooling layer and a sixteenth convolution layer which are connected in sequence, an input end of an eighth convolution layer in the 1 st feature aggregation block receives all feature maps in F10, an input end of a ninth convolution layer receives all feature maps in P5, an input end of a second up-sampling layer receives all feature maps in F11, an input end of an eighth convolution layer of the 2 nd feature aggregation block receives all feature maps in F8, an input end of the ninth convolution layer receives all feature maps in P4, an input end of the second up-sampling layer receives all feature maps in A1, an input end of an eighth convolution layer of the 3 rd feature aggregation block receives all feature maps in F6, Feeding of the ninth convolution layerAn input terminal receives all the feature maps in P3, an input terminal of a second upsampling layer receives all the feature maps in A2, an input terminal of an eighth convolutional layer of a4 th feature aggregation block receives all the feature maps in F4, an input terminal of a ninth convolutional layer receives all the feature maps in P2, an input terminal of the second upsampling layer receives all the feature maps in A3, an input terminal of an eighth convolutional layer of a5 th feature aggregation block receives all the feature maps in F2, an input terminal of the ninth convolutional layer receives all the feature maps in P1, an input terminal of the second upsampling layer receives all the feature maps in A4, all the feature maps output from an output terminal of the eighth active layer and all the feature maps output from an output terminal of the ninth active layer are respectively subjected to channel quartering cutting, the channel cutting is respectively performed in four times, a first channel number superposition operation is performed on a1 st copy of all the feature maps output from the output terminal of the eighth active layer and a1 copy of all the feature maps output from the output terminal of the ninth active layer, performing second channel number superposition on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, performing third channel number superposition on the 3 rd part of all the feature maps output by the output end of the eighth active layer and the 3 rd part of all the feature maps output by the output end of the ninth active layer, performing fourth channel number superposition on the 4 th part of all the feature maps output by the output end of the eighth active layer and the 4 th part of all the feature maps output by the output end of the ninth active layer, receiving all the feature maps output by the output end of the second up-sampling layer by the input end of the tenth convolutional layer, receiving all the feature maps obtained after the first channel number superposition operation by the input end of the eleventh convolutional layer, receiving all the feature maps obtained after the second channel number superposition operation by the input end of the twelfth convolutional layer, the input end of the thirteenth convolution layer receives all the characteristic diagrams obtained after the superposition operation of the third channel number, the input end of the fourteenth convolution layer receives all the characteristic diagrams obtained after the superposition operation of the fourth channel number, and the fifth channel superposition is carried out on all the characteristic diagrams output by the output end of the eleventh activation layer, all the characteristic diagrams output by the output end of the twelfth activation layer, all the characteristic diagrams output by the output end of the thirteenth activation layer and all the characteristic diagrams output by the output end of the fourteenth activation layerAdding operation, wherein an input end of a fifteenth convolution layer receives all feature maps obtained after the fifth channel number superposition operation, element multiplication operation is carried out on all feature maps output by an output end of a tenth active layer and all feature maps output by an output end of the fifteenth active layer, first element addition operation is carried out on all feature maps output by an output end of the tenth active layer and all feature maps obtained after the element multiplication operation, an input end of a sixteenth active layer receives all feature maps obtained after the first element addition operation, second element addition operation is carried out on all feature maps output by an output end of the sixteenth convolution layer and all feature maps obtained after the first element addition operation, for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A1, for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A2, for the 3 rd feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is A3, for the 4 th feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is a4, and for the 5 th feature aggregation block, the set formed by all feature maps obtained after the second element addition operation is a 5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1_mThe number of input channels of the second input end is n2_mThe number of input channels of the third input end is n3_mNumber n1 of input channels at the first input of the 1 st feature aggregation block₁512, number of input channels n2 of second input end₁512, number of input channels n3 of third input end₁512, the number of input channels n1 at the first input of the 2 nd feature aggregation block₂512, number of input channels n2 of second input end₂512, number of input channels n3 of third input end₂256, input channel number n1 at the first input of the 3 rd feature aggregation block₃256, number of input channels n2 of second input end₃256, number of input channels n3 of third input end₃128, the number of input channels n1 at the first input of the 4 th feature aggregation block₄128, number of input channels n2 of second input end₄＝128. Number n3 of input channels of third input end₄64, the number n1 of input channels at the first input of the 5 th feature aggregation block₅64, number of input channels n2 of second input end₅64, number of input channels n3 of third input end₅32, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of ninth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3_mStep size of 1, zero padding parameter value of 0, convolution kernel size of sixteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation. Here, the channel number superposition operation, the element multiplication operation, and the element addition operation are all existingProvided is a technique. C in fig. 5 denotes a channel number superimposing operation, + denotes an element adding operation, and x denotes an element multiplying operation.

In this embodiment, the structures of 10 neural network blocks are the same, and the structure of the neural network block in the existing VGG-16 model is adopted; the 5 expanded volume blocks have the same structure and are cited as RFB modules In S.Liu, and D.Huang, "receptor field block net for acid and fast object detection", In Proceedings of the European Conference on Computer Vision,2018, pp.385-400 (Liu Song and Huangdi, "a network of receiving field blocks capable of accurately and rapidly detecting objects", European Computer Vision Conference discourse, page 385-400 In 2018).

To further illustrate the feasibility and effectiveness of the method of the present invention, experiments were conducted on the method of the present invention.

The method is tested by writing codes in a python language of a pytorech library, the experimental equipment is an Intel i5-7500 processor, and cuda acceleration is used under a NVIDIA TITAN XP-12GB video card. In order to ensure the rigor of the experiment, the data sets selected in the experiment are NJU2K and NLPR, which are known public data sets. NJU2K contains 1485 pairs of 3D images, 1400 pairs of 3D images for training, and 85 pairs of 3D images for detection; the NLPR comprises 730 pairs of 3D images, 650 pairs of 3D images for training, and 80 pairs of 3D images for detection.

In this experiment, 4 common objective parameters of the significance evaluation detection method were used as evaluation indexes: s ≠ Structure-measure) used for evaluating the structural similarity of the saliency detection image and the saliency region in the label image; the adpE ↓value, the adpF ↓valueand the MAE ↓averageabsolute Error (Mean Absolute Error) are used for evaluating the detection performance of the significance detection graph, and important indexes used for evaluating the quality of the detection method are calculated through calculating the accuracy rate and the recall rate.

Comparing the significance detection map generated by the method with the label image, and respectively using S ≠ adpE ℃ ↓, adpF ↓, and MAE ↓asevaluation indexes to evaluate the method, wherein the evaluation indexes of the two data sets are listed in Table 1, and the data listed in Table 1 shows that the method is excellent in performance of the two data sets.

TABLE 1 evaluation results of the method of the invention on two data sets

Fig. 6a is an RGB image of the 1 st pair of 3D images to be subjected to saliency detection, fig. 6b is a depth image of the 1 st pair of 3D images to be subjected to saliency detection, fig. 6c is a saliency prediction image obtained by processing the fig. 6a and 6b by using the method of the present invention, and fig. 6D is a label image corresponding to the 1 st pair of 3D images to be subjected to saliency detection; fig. 7a is an RGB image of the 2 nd pair of 3D images to be subjected to saliency detection, fig. 7b is a depth image of the 2 nd pair of 3D images to be subjected to saliency detection, fig. 7c is a saliency prediction image obtained by processing fig. 7a and 7b by using the method of the present invention, and fig. 7D is a label image corresponding to the 2 nd pair of 3D images to be subjected to saliency detection; fig. 8a is an RGB image of a3 rd pair of 3D images to be subjected to saliency detection, fig. 8b is a depth image of the 3 rd pair of 3D images to be subjected to saliency detection, fig. 8c is a saliency prediction image obtained by processing fig. 8a and 8b by using the method of the present invention, and fig. 8D is a label image corresponding to the 3 rd pair of 3D images to be subjected to saliency detection; fig. 9a is an RGB image of a4 th pair of 3D images to be subjected to saliency detection, fig. 9b is a depth image of the 4 th pair of 3D images to be subjected to saliency detection, fig. 9c is a saliency prediction image obtained by processing fig. 9a and 9b by using the method of the present invention, and fig. 9D is a label image corresponding to the 4 th pair of 3D images to be subjected to saliency detection. Fig. 6a and 6b, fig. 7a and 7b, fig. 8a and 8b, and fig. 9a and 9b are representative 3D images containing a plurality of objects, small objects, and complex salient objects, and these representative 3D images are processed by the method of the present invention, and the salient predictive images are correspondingly shown in fig. 6c, fig. 7c, fig. 8c, and fig. 9c, and compared with fig. 6D, fig. 7D, fig. 8D, and fig. 9D, it can be found that the salient regions in these 3D images can be accurately captured by the method of the present invention.

Fig. 10a is a PR (exact-recall) plot of a 3D image for detection in a NJU2K dataset processed using the method of the present invention, and fig. 10b is a PR (exact-recall) plot of a 3D image for detection in an NLPR dataset processed using the method of the present invention. As can be seen from fig. 10a and 10b, the area under the PR curve is large, which indicates that the method of the present invention has good detection performance. Precision in FIG. 10a and FIG. 10b represents "Precision rate" and Recall represents "Recall rate".

Claims

1. A method for detecting a significance image of interactive cycle feature remodeling is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

Denote the depth image of the k-th pair of original 3D images as

Then, forming a training set by the RGB images, the depth images and the corresponding label images of all the original 3D images; wherein N is a positive integer, N is not less than 200, k is a positive integer, k is not less than 1 and not more than 200, x is not less than 1 and not more than W, y is not less than 1 and not more than H, W represents the width of the original 3D image and the RGB image thereof, the depth image and the corresponding label image, H represents the height of the original 3D image and the RGB image thereof, the depth image and the corresponding label image,

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y),

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y),

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y);

for the coding part, the 1 st neural network block, the 2 nd neural network block, the 3 rd neural network block, the 4 th neural network block and the 5 th neural network block are connected in sequence to form a color coding stream, and the 6 th neural network block, the 7 th neural network block and the 8 th neural network block are connected in sequence to form a color coding streamThe network block, the 9 th neural network block and the 10 th neural network block are sequentially connected to form a depth coding stream; the input end of the 1 st neural network block receives an R channel component, a G channel component and a B channel component of an original RGB image output by the output end of the RGB image input layer, the output end of the 1 st neural network block outputs 64 feature maps, a set formed by the 64 feature maps is recorded as S1, and each feature map in S1 has a width W and a height H; the input end of the 2 nd neural network block receives all the feature maps in S1, the output end of the 2 nd neural network block outputs 128 feature maps, the set of the 128 feature maps is marked as S2, and the width of each feature map in S2 is

Has a height of

Has a height of

Has a height of

The input end of the 5 th neural network block receives all the feature maps in S4, the output end of the 5 th neural network block outputs 512 feature maps, and the 512 feature maps are outputThe set of constructs is denoted S5, and the width of each feature map in S5 is

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The first input terminal of the 2 nd feature reconstruction block receives all the feature maps in S2, the second input terminal of the 2 nd feature reconstruction block receives all the feature maps in F3, and the output terminal of the 2 nd feature reconstruction block outputs 128 pieces of dataThe feature map is a set of 128 feature maps denoted as F4, and each feature map in F4 has a width of F4

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

Has a height of

The input end of the 2 nd information extraction block receives all the feature maps in S5, the output end of the 2 nd information extraction block outputs 512 feature maps, the set of the 512 feature maps is marked as F11, and the width of each feature map in F11Degree of

Has a height of

Has a height of

Has a height of

Has a height of

5 thThe input end of each expansion volume block receives all the characteristic maps in D5, the output end of the 5 th expansion volume block outputs 512 characteristic maps, the set of the 512 characteristic maps is marked as P5, and the width of each characteristic map in P5 is equal to

Has a height of

Has a height of

Has a height of

The first input end of the 3 rd feature aggregation block receives all the feature maps in F6, the second input end of the 3 rd feature aggregation block receives all the feature maps in P3, the third input end of the 3 rd feature aggregation block receives all the feature maps in A2, and the 3 rd feature aggregation block receives all the feature maps in A2The output end of the aggregation block outputs 64 feature maps, a set of the 64 feature maps is marked as A3, and the width of each feature map in A3 is equal to

Has a height of

Has a height of

step 1_ 3: inputting three-channel depth maps obtained by copying R channel components, G channel components, B channel components and depth images of RGB images of all original 3D images in a training set into a convolutional neural network for training to obtain significance detection maps corresponding to each pair of original 3D images, and inputting the kth pair of original 3D imagesThe corresponding significance detection map is marked as

Wherein,

to represent

The middle coordinate position is the pixel value of the pixel point of (x, y);

And

the value of the loss function in between is recorded as

the test stage process comprises the following specific steps:

2. The method for detecting a saliency image of an interaction cycle feature remodeling of claim 1, wherein in step 1_2, 2 information extraction blocks have the same structure and are composed of a1 st convolution block, a first maximum pooling layer, a first average pooling layer, a2 nd convolution block, a3 rd convolution block and a first upsampling layer, the 1 st convolution block includes a first convolution layer, a first active layer, a second convolution layer and a second active layer which are connected in sequence, the 2 nd convolution block includes a third convolution layer and a third active layer which are connected in sequence, the 3 rd convolution block includes a fourth convolution layer and a fourth active layer which are connected in sequence, an input end of the first convolution layer in the 1 st information extraction block receives all feature maps in D1, an input end of the first convolution layer in the 2 nd information extraction block receives all feature maps in S5, an input end of the first maximum pooling layer, a second convolution block, and a second active layer which are connected in sequence, and a second convolution block includes a third convolution layer and a third active layer which are, The input end of the first average pooling layer and the input end of the third convolution layer both receive all the feature maps output by the output end of the fourth active layer, the channel number superposition operation is performed on all the feature maps output by the output end of the first maximum pooling layer and all the feature maps output by the output end of the first average pooling layer, the input end of the fourth convolution layer receives all the feature maps obtained after the channel number superposition operation, the input end of the first up-sampling layer receives all the feature maps output by the output end of the fourth active layer, the element multiplication operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps output by the output end of the third active layer, the element addition operation is performed on all the feature maps output by the output end of the first up-sampling layer and all the feature maps obtained after the element multiplication operation, and the set formed by all the feature maps obtained after the element addition operation for the 1 st information extraction block is F1, for the 2 nd information extraction block, a set formed by all feature maps obtained after the element addition operation is F11; wherein, the number of input channels of the ith information extraction block is set as n_iThen the number n of input channels of the 1 st information extraction block₁64, the number n of input channels of the 2 nd information extraction block₂512, the convolution kernel size of the first convolution layer and the fourth convolution layer in the ith information extraction block is 1 × 1, and the number of convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the second convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 0, the size of the convolution kernel of the third convolution layer in the ith information extraction block is 3 multiplied by 3, the number of the convolution kernels is n_iThe step length is 1, the value of the zero padding parameter is 1, i is 1,2, the activation mode of the first activation layer, the second activation layer, the third activation layer and the fourth activation layer is 'Relu', the convolution kernel size of the first maximum pooling layer and the first average pooling layer is 2 × 2, the step length is 2, the value of the zero padding parameter is 0, the magnification of the first up-sampling layer is 2, and the interpolation method is bilinear interpolation.

3. The method for detecting a saliency image of an interactive cyclic feature reconstruction as claimed in claim 1, wherein in step 1_2, 5 feature reconstruction blocks have the same structure and are composed of a context attention block and a channel attention block, and for the 1 st feature reconstruction block, the method performs a first element addition operation on all feature maps in S1 and all feature maps in F1, receives all feature maps obtained after the first element addition operation at an input end of the context attention block, receives all feature maps output from an output end of the context attention block at an input end of the channel attention block, performs an element multiplication operation on all feature maps output from an output end of the channel attention block and all feature maps obtained after the first element addition operation, performs a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the element multiplication operation in S1, the set formed by all feature maps obtained after the element addition operation for the second time is F2; for the 2 nd feature reconstruction block, performing a first element addition operation on all feature maps in S2 and all feature maps in F3, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F4; for the 3 rd feature reconstruction block, the first element addition operation is performed on all feature maps in S3 and all feature maps in F5, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S3 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F6; for the 4 th feature reconstruction block, performing a first element addition operation on all feature maps in S4 and all feature maps in F7, receiving all feature maps obtained after the first element addition operation at an input end of the context attention block, receiving all feature maps output by an output end of the context attention block at an input end of the channel attention block, performing an element multiplication operation on all feature maps output by an output end of the channel attention block and all feature maps obtained after the first element addition operation, performing a second element addition operation on all feature maps obtained after the element multiplication operation and all feature maps obtained after the second element addition operation, wherein a set formed by all feature maps obtained after the second element addition operation is F8; for the 5 th feature reconstruction block, the first element addition operation is performed on all feature maps in S5 and all feature maps in F9, the input end of the context attention block receives all feature maps obtained after the first element addition operation, the input end of the channel attention block receives all feature maps output by the output end of the context attention block, the element multiplication operation is performed on all feature maps output by the output end of the channel attention block and all feature maps obtained after the first element addition operation, the second element addition operation is performed on all feature maps in S5 and all feature maps obtained after the element multiplication operation, and the set formed by all feature maps obtained after the second element addition operation is F10.

4. The method according to claim 1, wherein in step 1_2, the 4 information reconstruction blocks have the same structure and are composed of a second maximum pooling layer, a second average pooling layer, a4 th convolution block and a5 th convolution block, the 4 th convolution block includes a fifth convolution layer and a fifth active layer which are connected in sequence, the 5 th convolution block includes a sixth convolution layer, a sixth active layer, a seventh convolution layer and a seventh active layer which are connected in sequence, an input of the second maximum pooling layer and an input of the second average pooling layer in the 1 st information reconstruction block both receive all the feature maps in F2, an input of the sixth convolution layer receives all the feature maps in D2, an input of the second maximum pooling layer and an input of the second average pooling layer in the 2 nd information reconstruction block both receive all the feature maps in F4, The input terminal of the sixth convolutional layer receives all the feature maps in D3, the input terminal of the second largest pooling layer and the input terminal of the second average pooling layer in the 3 rd information re-modeling block both receive all the feature maps in F6, the input terminal of the sixth convolutional layer receives all the feature maps in D4, the input terminal of the second largest pooling layer and the input terminal of the second average pooling layer in the 4 th information re-modeling block both receive all the feature maps in F8, the input terminal of the sixth convolutional layer receives all the feature maps in D5, the element subtraction operation is performed on all the feature maps output by the output terminal of the second largest pooling layer and all the feature maps output by the output terminal of the second average pooling layer, the input terminal of the fifth convolutional layer receives all the feature maps obtained after the element subtraction operation, the element multiplication operation is performed on all the feature maps output by the output terminal of the fifth active layer and all the feature maps output by the output terminal of the seventh active layer, performing element addition operation on all feature maps output by the output end of the fifth active layer and all feature maps obtained after element multiplication operation, and performing element addition operation on all feature maps obtained after element addition operation on the 1 st information reconstruction blockThe set of the features is F3, the set of all feature maps obtained after the element addition operation is F5 for the 2 nd information reconstruction block, the set of all feature maps obtained after the element addition operation is F7 for the 3 rd information reconstruction block, and the set of all feature maps obtained after the element addition operation is F9 for the 4 th information reconstruction block; wherein the number of input channels of the first input terminal of the jth information reproduction block is set to n1_jThe number of input channels of the second input end is n2_jThe number n1 of input channels at the first input of the 1 st information reproduction block₁64, number of input channels n2 of second input end₁128, the number of input channels n1 at the first input of the 2 nd information reconstruction block₂128, number of input channels n2 of second input end₂256, number n1 of input channels at the first input of the 3 rd information reconstruction block₃256, number of input channels n2 of second input end₃512, the number of input channels n1 at the first input of the 4 th information reconstruction block₄512, number of input channels n2 of second input end₄512, j is 1,2,3,4, the convolution kernel size of the fifth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2_jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the sixth convolution layer in the jth information reconstruction block is 1 × 1, and the number of convolution kernels is n2_jThe step size is 1, the value of the zero padding parameter is 0, the convolution kernel size of the seventh convolution layer in the jth information reconstruction block is 3 x 3, and the number of convolution kernels is n2_jThe step length is 1, the value of the zero padding parameter is 1, the activation mode of the fifth activation layer, the sixth activation layer and the seventh activation layer is 'Relu', the convolution kernel size of the second maximum pooling layer and the second average pooling layer is 2 x 2, the step length is 2, the value of the zero padding parameter is 0, and when element subtraction operation is performed on all feature maps output by the output end of the second maximum pooling layer and all feature maps output by the output end of the second average pooling layer, corresponding elements in the corresponding feature maps output by the output end of the second average pooling layer are subtracted from elements in the feature maps output by the output end of the second maximum pooling layer.

5. According toThe method for detecting a saliency image of an interaction cycle feature reconstruction as claimed in claim 1, wherein in said step 1_2, 5 feature aggregation blocks have the same structure and are composed of a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block, a 10 th convolution block, an 11 th convolution block, a 12 th convolution block, a 13 th convolution block, a second upsampling layer, and a residual fusion block, wherein the 6 th convolution block includes an eighth convolution layer and an eighth active layer which are connected in sequence, the 7 th convolution block includes a ninth convolution layer and a ninth active layer which are connected in sequence, the 8 th convolution block includes a tenth convolution layer and a tenth active layer which are connected in sequence, the 9 th convolution block includes an eleventh convolution layer and an eleventh active layer which are connected in sequence, the 10 th convolution block includes a twelfth convolution layer and a twelfth active layer which are connected in sequence, the 11 th convolution block comprises a thirteenth convolution layer and a thirteenth active layer which are connected in sequence, the 12 th convolution block comprises a fourteenth convolution layer and a fourteenth active layer which are connected in sequence, the 13 th convolution block comprises a fifteenth convolution layer and a fifteenth active layer which are connected in sequence, the residual fusion block comprises a sixteenth active layer, a third maximum pooling layer and a sixteenth convolution layer which are connected in sequence, an input end of an eighth convolution layer in the 1 st feature aggregation block receives all feature maps in F10, an input end of a ninth convolution layer receives all feature maps in P5, an input end of a second up-sampling layer receives all feature maps in F11, an input end of an eighth convolution layer in the 2 nd feature aggregation block receives all feature maps in F8, an input end of a ninth convolution layer receives all feature maps in P4, an input end of a second up-sampling layer receives all feature maps in A1, and an input end of an eighth convolution layer in the 3 rd feature aggregation block receives all feature maps in F6, The input of the ninth convolutional layer receives all the feature maps in P3, the input of the second upsampling layer receives all the feature maps in A2, the input of the eighth convolutional layer of the 4 th feature aggregation block receives all the feature maps in F4, the input of the ninth convolutional layer receives all the feature maps in P2, the input of the second upsampling layer receives all the feature maps in A3, the input of the eighth convolutional layer of the 5 th feature aggregation block receives all the feature maps in F2, and the input of the ninth convolutional layer receives all the features in P1Receiving all the feature maps in A4 by the input end of the second up-sampling layer, respectively carrying out channel quartering cutting on all the feature maps output by the output end of the eighth active layer and all the feature maps output by the output end of the ninth active layer, respectively dividing the cut feature maps into four parts in sequence, carrying out first channel number superposition operation on the 1 st part of all the feature maps output by the output end of the eighth active layer and the 1 st part of all the feature maps output by the output end of the ninth active layer, carrying out second channel number superposition operation on the 2 nd part of all the feature maps output by the output end of the eighth active layer and the 2 nd part of all the feature maps output by the output end of the ninth active layer, carrying out third channel number superposition operation on the 3 rd part of all the feature maps output by the output end of the eighth active layer and the 3 rd part of all the feature maps output by the output end of the ninth active layer, carrying out the 4 th part of all the feature maps output by the output end of the eighth active layer and the fourth channel number superposition operation of all the feature maps output by the output end of the ninth 4, performing fourth channel number superposition operation, receiving all feature maps output by the output end of the second up-sampling layer by the input end of the tenth convolution layer, receiving all feature maps obtained after the first channel number superposition operation by the input end of the eleventh convolution layer, receiving all feature maps obtained after the second channel number superposition operation by the input end of the twelfth convolution layer, receiving all feature maps obtained after the third channel number superposition operation by the input end of the thirteenth convolution layer, receiving all feature maps obtained after the fourth channel number superposition operation by the input end of the fourteenth convolution layer, performing fifth channel number superposition operation on all feature maps output by the output end of the eleventh active layer, all feature maps output by the output end of the twelfth active layer, all feature maps output by the output end of the thirteenth active layer and all feature maps output by the output end of the fourteenth active layer, receiving all feature maps obtained after the fifth channel number superposition operation by the input end of the fifteenth convolution layer, performing element multiplication operation on all feature maps output by the output end of the tenth active layer and all feature maps output by the output end of the fifteenth active layer, performing first element addition operation on all feature maps output by the output end of the tenth active layer and all feature maps obtained after the element multiplication operation, and obtaining a result after the input end of the sixteenth active layer receives the first element addition operationPerforming a second element addition operation on all feature maps output by the output end of the sixteenth convolutional layer and all feature maps obtained after the first element addition operation, wherein for a1 st feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a1, for a2 nd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a2, for A3 rd feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is A3, for a4 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a4, and for a5 th feature aggregation block, a set formed by all feature maps obtained after the second element addition operation is a 5; wherein the number of input channels of the first input end of the mth feature aggregation block is set to be n1_mThe number of input channels of the second input end is n2_mThe number of input channels of the third input end is n3_mNumber n1 of input channels at the first input of the 1 st feature aggregation block₁512, number of input channels n2 of second input end₁512, number of input channels n3 of third input end₁512, the number of input channels n1 at the first input of the 2 nd feature aggregation block₂512, number of input channels n2 of second input end₂512, number of input channels n3 of third input end₂256, input channel number n1 at the first input of the 3 rd feature aggregation block₃256, number of input channels n2 of second input end₃256, number of input channels n3 of third input end₃128, the number of input channels n1 at the first input of the 4 th feature aggregation block₄128, number of input channels n2 of second input end₄128, number of input channels n3 of third input end₄64, the number n1 of input channels at the first input of the 5 th feature aggregation block₅64, number of input channels n2 of second input end₅64, number of input channels n3 of third input end₅32, the convolution kernel size of the eighth convolution layer in the mth feature aggregation block is 3 × 3, and the number of convolution kernels is n3_mStep size of 1, zero padding parameter value of 1, volume of ninth convolutional layer in mth characteristic aggregation blockThe size of the product kernel is 3 multiplied by 3, and the number of convolution kernels is n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of tenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of eleventh convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of the twelfth convolution layer in the mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of thirteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of fourteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mStep size of 1, zero padding parameter value of 1, convolution kernel size of fifteenth convolution layer in mth characteristic aggregation block of 3 x 3, number of convolution kernels of n3_mStep size of 1, zero padding parameter value of 0, convolution kernel size of sixteenth convolution layer in mth characteristic aggregation block of 3 x 3, convolution kernel number of n3_mAnd/2, the step length is 1, the value of the zero padding parameter is 0, m is 1,2,3,4,5, the activation modes of an eighth activation layer, a ninth activation layer, a tenth activation layer, an eleventh activation layer, a twelfth activation layer, a thirteenth activation layer, a fourteenth activation layer, a fifteenth activation layer and a sixteenth activation layer are Relu, the convolution kernel size of the third maximum pooling layer is 5 x 5, the step length is 1, the value of the zero padding parameter is 2, the amplification factor of the second up-sampling layer is 2, and the interpolation method is bilinear interpolation.