CN111429465B

CN111429465B - Parallax-cleaning-based binary residual binocular significant object image segmentation method

Info

Publication number: CN111429465B
Application number: CN202010191229.3A
Authority: CN
Inventors: 周武杰; 陈昱臻; 雷景生; 郭翔; 王海江; 何成; 周扬
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2023-05-23
Anticipated expiration: 2040-03-18
Also published as: CN111429465A

Abstract

The invention discloses a parallax-purification-based binary residual binocular significant object image segmentation method. Training phase: collecting object images and establishing a training set; constructing a two-dimensional residual error network: inputting the color map and the parallax map into a residual double-current network for pre-training, and outputting a double-current prediction segmentation map; inputting the color map into a residual error single-flow network for pre-training, and outputting a single-flow prediction segmentation map; calculating two loss values, carrying out back propagation to train two residual double-single-flow networks, and adjusting network model parameters; testing: inputting the parallax image into a parallax purifying selector to output a positive sample and a negative sample to obtain a positive sample, and combining the double-current significance prediction image and the single-current significance prediction image to obtain a prediction result set. The invention has the characteristics of high efficiency and complete extraction of the residual error network, reduces the downsampling operation, avoids the dispersion of information, has a result closer to the situation of real human observation objects, solves the problem of noise in 3D information, and obtains a more accurate segmentation result.

Description

Parallax-cleaning-based binary residual binocular significant object image segmentation method

Technical Field

The invention relates to a binocular vision-based stereoscopic image processing method, in particular to a parallax-purification-based two-dimensional residual error binocular obvious object image segmentation method.

Background

There has long been a desire for algorithms that mimic the habit of humans viewing objects in real situations. One branch of research in which detecting objects with significance is very common and has achieved great success in behavioral simulation. The human visual system arranges the received object information in order when judging the received visual information. So that objects that it deems important are preferentially treated.

In recent years, with the continuous development of deep learning technology and 3D equipment and technology, significant object detection has made a major breakthrough. The existing 3D technology mainly uses a depth map and a disparity map, which provide distance information of objects. This is very important information, as it solves the critical problem of humans habitually observing objects that are closer in distance. However, there is an unavoidable problem of noise of the information regardless of the information. Some may even be subject to insignificant object distance nearest to and thus interference and contamination to the deep learning algorithm. Therefore, designing an algorithm that can determine whether the quality of the disparity map is sufficient, and whether the information contained in the disparity map can bring positive assistance to the whole neural network is a good improvement approach.

Disclosure of Invention

The invention aims to provide a parallax-purification-based binary residual binocular obvious object image segmentation method, which can achieve the purposes of rapidly detecting and accurately detecting a fine object.

The technical scheme adopted for solving the technical problems is as follows:

the training phase process comprises the following specific steps:

step 1_1: collecting an object image or establishing a training set based on a public data set, wherein the object image comprises a color image and a parallax image;

the color image and the parallax image are all acquired at the same time. The parallax map is obtained by converting an acquisition depth map.

The object is an animal, a human body, a static object and the like.

Step 1_2: constructing a parallax purification-based two-type residual error network:

the two-type residual error network based on parallax purification mainly comprises three parts, namely a parallax purification (DDU) selector, a residual error double-current network and a residual error single-current network;

step 1_3: the color image and the parallax image corresponding to the training set are used as training samples to be input into a residual double-current network for pre-training, and a double-current prediction segmentation image is output

Meanwhile, the color images in the training set are used as training samples to be input into a residual error single-flow network for pre-training, and a single-flow prediction segmentation image is output +. >

k represents the kth training sample;

step 1_4: computing a dual stream predictive segmentation map using a loss function

Uniflow predictive segmentation map

Respectively obtaining a Loss value Loss1 of a residual double-current network and a Loss value Loss2 of a residual single-current network with known marked label graphs, respectively training two independent residual double-current networks and residual single-current networks by back propagation of the two Loss values Loss1 and Loss2, respectively obtaining two pre-trained residual double-current networks and residual single-current networks, and respectively adjusting network model parameters W optimal for the double-current residual networks _op1 And single-stream residual network optimized network model parameter W _op2 ；

The loss function of the implementation adopts the loss function of cross entropy.

The specific steps of the test stage process are as follows:

step 2_1: as shown in FIG. 1, for an object image { X (i, j) } to be significantly divided, wherein 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W represents an image width of the object image { X (i, j) }, H represents an image height of the object image { X (i, j) }, and X (i, j) represents a pixel value of a pixel point having a coordinate position of (i, j) in the object image { X (i, j) }; inputting a disparity map in an object image into a trained disparity purification selector to output a result of mutually obtaining positive and negative samples:

If the output is judged to be a positive sample, the color image and the parallax image in the object image are input into a residual double-current network together for prediction processing, and a double-current significance prediction image { X } is obtained by output _Pre (i, j) } wherein X _Pre (i, j) represents a double flow significance prediction map { X } _Pre The coordinate position in (i, j) is the pixel value of the pixel point of (i, j);

if the output is judged to be negative, only inputting the color image in the object image into a residual single-stream network for prediction processing, and outputting to obtain a single-stream significance prediction image { S } _Pre {i,j}}，S _Pre { i, j } represents a single-stream significance prediction map { S } _Pre The coordinate position in { i, j } } } is the pixel value of the (i, j) pixel point;

step 2_2: final merge of the double-flow significance prediction map { X ] _Pre (i, j) } Single stream significance prediction map { S _Pre The { i, j } obtains a prediction result set { Pre { i, j }, wherein the prediction result set { Pre { i, j } is the result of dividing the original object image.

The parallax purifying selector specifically comprises a binarization processing module, a similarity processing module, a decision tree and a threshold judging module; the binarization processing module and the similarity processing module are connected to the input end of the decision tree together in the training stage, the output end of the decision tree is always connected with the threshold judgment module, and the training is processed by adopting the following modes: on the one hand, the disparity map in the object image and the corresponding label map are input into a similarity processing module for processing, and a similarity value S between the disparity map and the label map is obtained by calculating an S-measure value _m As a label for decision tree training, the label graph originally has a division of a target area and a background area, and on the other hand, the disparity graph is binarized by using an Otsu algorithm to obtain the target area and the background area of the disparity graph (1 is the target area, 0 is the backgroundRegion), respectively calculating the variances H of all pixels of the original non-binarized parallax image corresponding to the target region and the binarized pixels of the parallax image _e And the pixels after the binarization of the parallax map are the variances E of all the pixels of the original non-binarized parallax map corresponding to the background region in the region _m The method comprises the steps of carrying out a first treatment on the surface of the Finally, constructing a decision tree with the tree depth of 3 by two variances H _e Value sum E _m As input to the decision tree, similarity value S _m As the supervision of the decision tree, training is carried out so as to obtain a decision tree capable of estimating the disparity map S-measure value; finally, judging as follows according to the set threshold judgment module: when the S-measure value is smaller than or equal to 0.45, the disparity map is a negative sample with unqualified quality, and when the S-measure value is larger than 0.45, the sample is a positive sample with qualified quality; thus, a parallax purifying selector capable of judging whether the quality of the parallax map meets the standard is obtained; specifically, a similarity threshold is preset to obtain a similarity value S _m Comparing with a similarity threshold, if the similarity value S _m And if the similarity threshold is greater than or equal to the similarity threshold, the sample is positive, otherwise, the sample is negative.

If { X (i, j) } is an object image to be significantly divided, the parallax map is directly input into the parallax barrier selector, and the result of obtaining the positive/negative sample is selected through the processing of the parallax barrier selector.

As shown in fig. 2, the residual double-current network comprises eight Res2net modules, eight reduce convolution modules, four GFU modules, two convolution modules and one upsampling neural network module; the color image is input to a first convolution module, the output of the first convolution module is sequentially transmitted and processed by a first Res2net module, a second Res2net module and a third Res2net module and then is input to a fourth Res2net module, and the output of the first Res2net module, the second Res2net module, the third Res2net module and the fourth Res2net module is respectively transmitted by a fourth reduction convolution module, a third reduction convolution module, a second reduction convolution module and a first reduction convolution module; the parallax images are input into a second convolution module, the output of the second convolution module is sequentially transmitted and processed by an eighth Res2net module, a seventh Res2net module and a sixth Res2net module and then is input into a fifth Res2net module, and the output of the eighth Res2net module, the seventh Res2net module, the sixth Res2net module and the fifth Res2net module are respectively connected with a fourth GFU module, a third GFU module, a second GFU module and a first GFU module through the respective eighth reduction convolution module, the seventh reduction convolution module, the sixth reduction convolution module and the fifth reduction convolution module; the output of the first GFU module is connected with the second GFU module, the output of the second GFU module is connected with the third GFU module, the output of the third GFU module is connected with the fourth GFU module, the output of the fourth GFU module is connected with the up-sampling neural network module, and the up-sampling neural network module outputs a double-flow significance prediction graph.

As shown in fig. 3, the residual single-stream network includes a convolution unit, four Res2net units, four reduce convolution units, four GFU units and an up-sampling neural network unit; the color map is input into a convolution unit, the output of the convolution unit is sequentially connected to a fourth Res2net unit through a first Res2net unit, a second Res2net unit and a third Res2net unit, the output of the first Res2net unit, the output of the second Res2net unit, the output of the third Res2net unit and the output of the fourth Res2net unit are respectively connected to a first GFU unit, a second GFU unit, a third GFU unit and a fourth GFU unit through the first reduction convolution unit, the second reduction convolution unit, the third reduction convolution unit and the fourth reduction convolution unit, the output of the first GFU unit is connected to the second GFU unit, the output of the second GFU unit is connected to the third GFU unit, the output of the third GFU unit is connected to the fourth GFU unit, the output of the fourth GFU unit is connected to the up-sampling neural network unit, and the up-sampling neural network unit outputs the double-flow significance prediction map.

The convolution module and the convolution unit are both a convolution layer.

As shown in fig. 4, each Res2net module/each Res2net unit structure of the residual double-flow network is the same, each Res2net unit structure comprises 14 convolution blocks, two points and layers and one superimposed layer, the input of the first convolution block is used as the input of the Res2net module/Res 2net unit, the output of the first convolution block is divided into four parts according to the output sequence, and is respectively input into a second convolution block, a third convolution block, a fourth convolution block and a fifth convolution block, the output of the second convolution block is connected and input into a ninth convolution block, the output of the third convolution block is connected and input into a sixth convolution block, the output of the sixth convolution block is connected and input into a seventh convolution block through a first point and a layer, the output of the seventh convolution block is connected and input into an eleventh convolution block, the output of the seventh convolution block and the output of the fifth convolution block are connected and the output of the fifth convolution block through a second point and a layer, the output of the seventh convolution block is connected and the eighth convolution block, and the eighth convolution block is connected and the thirteenth convolution block are connected and the eighth convolution block; the input of the first convolution block is connected and input to a fifteenth convolution block, and the output of the fifteenth convolution block and the output of the fourteenth convolution block are connected through the first superposition layer and then output and serve as the output of a Res2net module/Res 2net unit.

The point and layer are the pixel values of the same pixel point corresponding to the positions of the two input images, and the superimposed layer is the front and back connection processing of the two input images.

The convolution blocks are all a convolution layer.

As shown in fig. 5, each GFU module in the residual double-current network includes seven convolution layers, two superimposed layers, two dot multiplication layers and two Sigmoid layers, and the color image and the parallax image are sequentially input to the outputs of the GFU module after passing through the convolution modules, the Res2net module and the GFU module, and are respectively connected to the first convolution layer and the fourth convolution layer, and the outputs of the first convolution layer and the fourth convolution layer are output after being connected through the second superimposed layer; on one hand, the output of the second superimposed layer sequentially passes through the second convolution layer and the third convolution layer and then is output, the output of the first convolution layer and the output of the third convolution layer are connected together and then are input into a first dot product layer M1, and the output of the first dot product layer M1 passes through a first Sigmoid layer S1 and then is input into the third superimposed layer; on the other hand, the output of the second superimposed layer sequentially passes through the fifth convolution layer and the sixth convolution layer and then is output, the output of the fourth convolution layer and the output of the sixth convolution layer are connected together and then are input into a second dot product layer M1, and the output of the second dot product layer M1 passes through a second Sigmoid layer S1 and then is input into a third superimposed layer; the output of the third superimposed layer is processed by a seventh convolution layer and then is output and used as the output of the GFU module; for the second GFU module, the third GFU module or the fourth GFU module, the output of the previous GFU module itself is also input to the third overlay layer via the first upsampling block connection.

As shown in fig. 6, each GFU unit in the residual single-stream network includes four convolution layers, an overlapping layer, a dot multiplication layer and a Sigmoid layer, the color map is sequentially input to an eighth convolution layer through output connections of the convolution unit, the Res2net unit and the GFU unit, which are input to the GFU unit, the output of the eighth convolution layer is input to the tenth convolution layer through connection of the ninth convolution layer, the output of the eighth convolution layer and the output of the tenth convolution layer are connected together and then are input to a third dot multiplication layer M3, and the output of the third dot multiplication layer M3 is input to a fourth overlapping layer through a third Sigmoid layer S3; for the second GFU unit, the third GFU unit or the fourth GFU unit, the output of the previous GFU unit itself is also input to the fourth superimposed layer via the second upsampling block connection.

The dot multiplication layer is used for multiplying the pixel values of the same pixel points corresponding to the positions of the two input images.

The invention adopts a mode of a parallax purifying (DDU) selector to optimize the parallax map which is completely trusted and utilized even if error and redundant noise information are in general conditions; then, an efficient characteristic extraction and coding network structure is constructed by utilizing a two-type residual error network, and the position and edge information of the object are effectively extracted; and then, a GFU (gate fusion unit) module is designed, the extracted features are further screened by utilizing the principle of a gate mechanism, the partial error features are forgotten, and meanwhile, the GFU module can also effectively normalize the fusion process of the features, so that the correlation between objective evaluation results and subjective perception is effectively improved.

Compared with the prior art, the invention has the advantages that:

the method adopts a coding structure based on a two-type residual error network (res 2 net) as a characteristic extraction mode, so that the method can have the characteristic of extracting the network efficiently and completely. And meanwhile, the operation of partial downsampling is reduced, so that information dispersion is avoided.

2) The method adopts the parallax map to optimize the deep learning algorithm. The defect of insufficient information of the color map under the conditions of similar background and object colors, contrast and the like is overcome. Thereby making the result more closely approximate to the case of a real human viewing object.

3) The present invention incorporates a parallax purifying (DDU) selector. And (3) putting the disparity maps which meet the quality and can bring positive influence into a double-flow residual error network for prediction, and removing the disparity maps which do not meet the quality, wherein the single-flow residual error network is used for prediction. Thereby solving the problem of noise existing in the 3D information.

4) The method adopts a GFU (gate fusion unit) module, and better fuses the high-low level characteristics after information processing and the characteristics of each mode to obtain more accurate results.

Drawings

FIG. 1 is a general block diagram of the present invention;

FIG. 2 is a diagram of a residual dual stream network architecture;

FIG. 3 is a diagram of a residual single flow network architecture;

FIG. 4 is a network block diagram of a Res2net module/unit;

fig. 5 is a block diagram of GFU (door fusion unit) module of a residual double-flow network;

fig. 6 is a block diagram of GFU (gate fusion unit) module of the residual single flow network;

FIG. 7a is a color image of an implementation scene one;

FIG. 7b is a disparity map of an implementation scenario one;

FIG. 7c is a label of implementation scenario one;

FIG. 7d shows the result of significant prediction from FIG. 7a and its corresponding FIG. 7b using the method of the present invention.

FIG. 8a is a color image of an implementation scene two;

FIG. 8b is a disparity map for implementation scenario two;

FIG. 8c is a label implementing scenario two;

FIG. 8d shows the result of significant prediction from FIG. 8a and its corresponding FIG. 8b using the method of the present invention.

FIG. 9a is a color image of an implementation scene three;

fig. 9b is a disparity map implementing scene three;

FIG. 9c is a label implementing scenario three;

FIG. 9d is a significant prediction result obtained by the prediction of FIG. 9a and its corresponding FIG. 9b using the method of the present invention;

fig. 10 is two graphs evaluating the performance of the algorithm on an NJU2000 dataset, wherein:

FIG. 10 (a) is a PR graph of predicted outcome;

Fig. 10 (b) is a ROC graph of the predicted result.

Detailed Description

The invention is described in further detail below with reference to the embodiments of the drawings.

The specific embodiment and the implementation process of the invention comprise the following steps:

step 1_1: establishing a database with a color map, a disparity map and a labeled map; then scaling all images in the database to 256×256 size by bilinear interpolation; randomly extracting eighty percent of the color images and the data of the corresponding parallax images and the label images as training sets, and recording the kth color image in the training sets as

The disparity map corresponding to the training set is marked as +.>

Its corresponding label is marked as

Wherein K is a positive integer, K is 1-K and K represents the content of the databaseThe total number of color pictures is also the number of parallax pictures and label pictures thereof contained in the database, and K is more than or equal to 1588,>

pixel value of pixel point representing coordinate (x, y) in kth color map,/->

A pixel value representing a pixel point of coordinates (x, y) in the kth disparity map,

pixel values representing pixel points of coordinates (x, y) in the kth label drawing;

step 1_2: constructing a parallax purification-based two-type residual error network: as shown in fig. 1, the parallax-cleaned two-type residual error network mainly consists of three parts, namely: a parallax purge (DDU) selector, a residual double-stream network, a residual single-stream network; the residual dual-stream network includes a first convolution block, a first Res2net block, a second Res2net block, a third Res2net block, and a fourth Res2net block, as shown in fig. 2. A second convolution block, an eighth Res2net block, a seventh Res2net block, a sixth Res2net block, and a fifth Res2net block. A first reduction convolution block, a second reduction convolution block, a third reduction convolution block, a fourth reduction convolution block, a fifth reduction convolution block, a sixth reduction convolution block, a seventh reduction convolution block, and an eighth reduction convolution block. A first GFU (gate fusion unit) module, a second GFU module, a third GFU module, a fourth GFU module and a first upsampling module. As shown in fig. 3, the residual single stream network includes a first convolution block, a first Res2net block, a second Res2net block, a third Res2net block, and a fourth Res2net block. A first reduction convolution block, a second reduction convolution block, a third reduction convolution block, and a fourth reduction convolution block. A first GFU (gate fusion unit) module, a second GFU module, a third GFU module, a fourth GFU module and a first upsampling module. The Res2net block is shown in fig. 4, and includes a first convolution block, a second convolution block, a third convolution block, a fourth convolution block, a fifth convolution block, a sixth convolution block, a seventh convolution block, an eighth convolution block, a ninth convolution block, a tenth convolution block, an eleventh convolution block, a twelfth convolution block, a thirteenth convolution block, a fourteenth convolution block, and a fifteenth convolution block. A first superimposed layer C1, a second superimposed layer C2 and a third superimposed layer C3. And the GFU module in the residual double stream includes a first upsampling block, a first convolution block, a second convolution block, a third convolution block, a fourth convolution block, a fifth convolution block, a sixth convolution block, and a seventh convolution block as shown in fig. 5. A first superimposed layer C1 and a second superimposed layer C2. The first dot product layer M1 and the second dot product layer M2. The first Sigmoid layer S1 and the second Sigmoid layer S2. And the first upsampled block, the first convolved block, the second convolved block, the third convolved block, and the fourth convolved block contained in the GFU module in the residual single stream as shown in fig. 6. The first superimposed layer C1. The first dot product layer M1. The first Sigmoid layer S1.

The working principle and the overall structure of the present invention will now be described with reference to fig. 1 as follows: before training or testing is started. The disparity map corresponding to one map is selected by a disparity purification selector. If the parallax cleaning selector judges that the sample belongs to the positive sample. The parallax map and the corresponding color map 1 are input into a residual double-current network to obtain a result, and if the parallax purifying selector judges that the parallax map belongs to a negative sample. The disparity map is discarded and only the color map is used to input into the residual single stream network to get the result. In order to obtain the parallax purification selector, the parallax map and the corresponding label in the training set divided in step 1_1 in the database need to be fetched. Calculating an S-measure value S between the two _m . Then, the Otsu algorithm is used for carrying out the value conversion on the parallax map 2, the binarized pixel value is the same as the area of the label, and the sum H of the pixel variances of the area corresponding to the parallax map is calculated _e . Calculating the sum E of the pixel variances of the corresponding areas in the disparity map _m . Finally by H _e Value sum E _m As training samples, S _m As a label, a tree depth of 3 is constructedDecision tree. Thus, a parallax purifying selector capable of judging whether the quality of the parallax map meets the standard is obtained.

The description of the residual double-current network, the residual single-current network, and the Res2net block and GFU block contained in the residual single-current network will be given one by one. The Res2net block is first explained. The structure of Res2net blocks in both single-stream and residual double-stream networks is the same. Let the feature map input of the input Res2net block be the feature map of the wide W, high H, X channels. The first volume block is composed of a convolution layer, a local normalization layer and an activation layer, wherein the convolution layer is input into an X-channel feature diagram and the output of the convolution layer is the X-channel feature diagram; the set of the X feature images is denoted as R ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3 multiplied by 3, the number of the convolution kernels is X, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map R is output ₁ The width and height of each feature map in (a) is W, H, respectively.

R is R ₁ The four equal divisions are respectively used as the inputs of a second convolution block, a third convolution block, a fourth convolution block and a fifth convolution block. The second, third, fourth and fifth convolution blocks have exactly the same structure, all of which are input as

Channel characteristics map, output is->

The channel characteristic diagram comprises a convolution layer, a local normalization layer and an activation layer; output +. >

The sets of feature maps are respectively denoted as R ₂ ，R ₃ ，R ₄ ，R ₅ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, and the number of convolution kernels is +.>

The padding parameter of the convolution layer is 1, and the activation is performedThe function is 'Relu', and the feature map R is output ₂ ，R ₃ ，R ₄ ，R ₅ The width and height of each feature map in (a) is W, H, respectively.

For its sixth volume block, its input is R ₃ The output is composed of a convolution layer, a local normalization layer and an activation layer of the X channel characteristic diagram; will output

The set of feature maps is denoted as R ₆ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, and the number of convolution kernels is +.>

The padding parameter of the convolution layer is 1, the step size is 1, the activation function is 'Relu', and the width and the height of each feature map in the output feature map are W, H respectively.

For the first add layer, its input receives R ₄ And R is ₆ And the characteristic matrix is subjected to point summation and output

The set of feature maps is denoted as A ₁ ；A ₁ The width of each feature graph is W, and the height is H;

for its seventh volume block, its input is A ₁ The output is

The channel characteristic diagram comprises a convolution layer, a local normalization layer and an activation layer; output +.>

The set of feature maps is denoted as R ₇ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, and the number of convolution kernels is +. >

The padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and the output isThe width and height of each of the feature maps are W, H, respectively.

For the second add layer, its input receives R ₇ And R is ₅ And the characteristic matrix is subjected to point summation and output

The set of feature maps is denoted as A ₂ ；A ₂ The width of each feature graph is W, and the height is H;

for its eighth volume block, its input is A ₂ The output is

The set of feature maps is denoted as R ₈ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, and the number of convolution kernels is +.>

For its ninth, tenth, eleventh, and twelfth volume blocks, its input is R ₂ ，R ₆ ，R ₇ ，R ₈ The outputs are all

The channel characteristic diagram comprises a convolution layer, a local normalization layer and an activation layer; the number of channels of the four outputs is +.>

Is denoted by R ₉ ，R ₁₀ ，R ₁₁ ，R ₁₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernels of the convolution layers are 3 multiplied by 3, and the number of the convolution kernels is +.>

The padding parameters of the convolution layers are 1, the step sizes are 1, the activation functions are 'Relu', and the width and the height of each feature map in the output feature map are W, H respectively.

For its thirteenth volume block, its input is R ₉ ，R ₁₀ ，R ₁₁ ，R ₁₂ Sequentially arranging (identical to superposition layer operation) feature graphs with X channels, and outputting a convolution layer, a local normalization layer and an activation layer of the X channel feature graphs; the set of the X feature images is denoted as R ₁₃ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 1 multiplied by 1, the number of the convolution kernels is X, the padding parameter of the convolution layer is 0, the step length is 2, the activation function is 'Relu', and the width and the height of each feature map in the output feature map are respectively

For its fourteenth volume block, its input is R ₁₃ The output is composed of a convolution layer, a local normalization layer and an activation layer of the X channel characteristic diagram; the set of the X feature images is denoted as R ₁₄ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is X, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and the width and the height of each feature map in the output feature map are respectively

For the fifteenth volume block, the characteristic diagram with the channel number of X of the original input is input, and the characteristic diagram is output as a convolution layer, a local normalization layer and an activation layer of the X channel characteristic diagram; the set of the X feature images is denoted as R ₁₅ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 1 multiplied by 1, the number of the convolution kernels is X, the padding parameter of the convolution layer is 0,step length is 2, the activation function is 'Relu', and the width and the height of each feature map in the output feature map are respectively

For the first superimposed layer, the input receives R ₁₄ And R is ₁₅ . And (5) sequentially connecting the two feature maps. Res with output of 2X channel _n Where n represents what number of Res2net blocks.

The network architecture of the dual-flow GFU module is described initially below. The feature map 1 inputted to the GFU module is typically a mode map obtained by feature extraction of a color map, a mode map obtained by feature extraction of a parallax map, and an output obtained by the last GFU module. It is noted, however, that both dual and single streams. If it is the first GFU block, then there is no n-1 th GFU output. There is no such input and no other differences in structure. Only GFU modules in the general case are discussed below. Assuming that the RGB modal input is rgb_input, the disparity map modal input is dir_input, and the feature maps of the x channels, which are all wide w and high h, are set in the current module. Then for its first volume block, the output is made from an RGB_input feature map input as x-channel

The set of feature maps is denoted as G ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, and the number of convolution kernels is +.>

The padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a characteristic diagram G is output ₁ The width and height of each feature map in (a) is w and h respectively.

For its fourth volume block, the one input isThe dir_input characteristic diagram of the x channel is output as

The channel characteristic diagram comprises a convolution layer, a local normalization layer and an activation layer; the set of the output x feature maps is denoted as G ₄ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, and the number of convolution kernels is +.>

The padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a characteristic diagram G is output ₄ The width and height of each feature map in (a) is w and h respectively.

For the first superimposed layer, the input receives G ₁ And G ₄ . And (5) sequentially connecting the two feature maps. C with output of x-channel ₁ 。

For its second volume block, the second volume block is represented by one input as C ₁ The output is composed of a convolution layer, a local normalization layer and an activation layer of the x-channel feature map; the set of the output x feature maps is denoted as G ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3 multiplied by 3, the number of the convolution kernels is x, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map G is output ₂ The width and height of each feature map in (a) is w and h respectively.

For its fifth volume block, the input is C ₁ The output is composed of a convolution layer, a local normalization layer and an activation layer of the 1-channel feature map; the set of the output x feature maps is denoted as G ₅ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3 multiplied by 3, the number of the convolution kernels is x, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map G is output ₅ The width and height of each feature map in (a) is w and h respectively.

For its third volume block, the third volume block is represented by an input G ₂ The output is composed of a convolution layer, a local normalization layer and an activation layer of the x-channel feature map; the set of 1 feature images is denoted as G ₃ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolutions of the convolutions are of the same typeThe product size is 3 multiplied by 3, the number of convolution kernels is 1, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a characteristic diagram G is output ₃ The width and height of the feature map in (a) are w and h, respectively.

For its sixth volume block, the input is G ₅ The output is composed of a convolution layer, a local normalization layer and an activation layer of the 1-channel feature map; the set of 1 feature images is denoted as G ₆ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 1, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and the feature map G is output ₆ The width and height of each feature map in (a) is w and h respectively.

For the first dot product layer A1, the input receives G ₁ And G ₃ . The characteristic diagram of the x channel is output after dot multiplication, and is marked as M ₁ The width and height of each feature map are w and h, respectively.

For the second dot product layer A2, the input receives G ₄ And G ₆ . The characteristic diagram of the x channel is output after dot multiplication, and is marked as M ₂ The width and height of each feature map are w and h, respectively.

For the first sigmoid active layer S1, the input receives M ₁ . The characteristic diagram which is output as an x-channel after the operation of activating the function as Sigmoid is carried out is recorded as S ₁ The width and height of each feature map are w and h, respectively.

For the second sigmoid activation layer S2, the input receives M ₂ . The characteristic diagram which is output as an x-channel after the operation of activating the function as Sigmoid is carried out is recorded as S ₂ The width and height of each feature map are w and h, respectively.

For the first up-sampling block, its input receives the output of the n-1 th GFU block. Since the output of the n-1 th GFU block is an x-channel, the size relative to the current n-th GFU block is characterized only by

The features of (a) are thus Up-sampled by bilinear interpolation at this layer, and finally the feature map of the x-channel, w and h for width and height respectively, is output, denoted as Up ₁ 。

For the second superimposed layer, the input receives S ₁ 、S ₂ And Up ₁ . And (5) sequentially connecting the two feature maps. C with 3x channel output ₂ 。

For its seventh volume block, the input is C ₂ The output is composed of a convolution layer, a local normalization layer and an activation layer of the x-channel feature map; the set of the output x feature maps is denoted as G ₇ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3 multiplied by 3, the number of the convolution kernels is x, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map G is output ₇ The width and height of each feature map in (a) is w and h respectively.

Similarly, for a single-stream GFU module, it is also assumed that the RGB modal input is rgb_input, and in this current module, the feature maps of the channels with width w, height h and x are set. The first volume block is composed of a convolution layer, a local normalization layer and an activation layer, wherein the convolution layer is input into an RGB_input characteristic diagram of an x channel, and the output of the RGB_input characteristic diagram is the characteristic diagram of the x channel; the set of the output x feature maps is denoted as G ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3 multiplied by 3, the number of the convolution kernels is x, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map G is output ₁ The width and height of each feature map in (a) is w and h respectively.

For its second volume block, the second volume block is represented by an input G ₁ The output is composed of a convolution layer, a local normalization layer and an activation layer of the x-channel feature map; the set of the output x feature maps is denoted as G ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3 multiplied by 3, the number of the convolution kernels is x, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map G is output ₂ The width and height of each feature map in (a) is w and h respectively.

For its third volume block, the third volume block is represented by an input G ₂ The output is composed of a convolution layer, a local normalization layer and an activation layer of the x-channel feature map; the set of 1 feature images is denoted as G ₃ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 1, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and the feature map G is output ₃ The width and height of the feature map in (a) are w and h, respectively.

For the first superimposed layer, the input receives S ₁ And Up1. And (5) sequentially connecting the two feature maps. The output is C1 for the 2x channel.

For the fourth volume block, the four volume blocks consist of a convolution layer, a local normalization layer and an activation layer, wherein the input of the convolution layer is C1, and the output of the convolution layer is a 2x channel characteristic diagram; the set of the output x feature maps is denoted as G ₄ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, volumeThe number of the product kernels is x, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a characteristic diagram G is output ₇ The width and height of each feature map in (a) is w and h respectively.

Finally, the partial blocks in the double-stream residual and the single-stream residual are introduced, and the connection relation between the partial blocks and the Res2net block and the GFU block is shown in fig. 2 and 3. For a residual dual stream network: the first rolling block is composed of a convolution layer, a local normalization layer and an activation layer, wherein the input of the first rolling block is a 3-channel feature map, and the output of the first rolling block is a 64-channel feature map; the set of 64 feature maps is denoted as D ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 64, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and the feature map D is output ₁ The width and height of each feature map in (a) is W, H, respectively.

For the second convolution block, the second convolution block consists of a convolution layer, a local normalization layer and an activation layer, wherein the input of the convolution layer is a 1-channel feature map, and the output of the convolution layer is a 64-channel feature map; the set of 64 feature maps is denoted as D ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 64, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and the feature map D is output ₂ The width and height of each feature map in (a) is W, H, respectively.

For the first to eight reduce blocks, they remain a convolution block, just named for its function of reducing the number of channels. The method comprises the steps of sequentially inputting channel characteristic graphs of one to eight reduce blocks, namely 1028, 512, 256, 128, 1028, 512, 256 and 128, and sequentially outputting convolution layers, local normalization layers and activation layers of the channel characteristic graphs of 256, 128, 64, 32, 256, 128, 64 and 32; the set of the output characteristic diagrams is respectively marked as Reduce _1～8 The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 256, 128, 64, 32, 256, 128, 64, 32 in turn, the padding parameter of the convolution layer is 1, the step size is 1, the activation function is 'Relu', and the feature map Reduce is output _1～8 The width and the height of each feature map in (a) are respectively

W、H；/>

W、H；/>

For the first up-sampling block, which comprises a convolution layer and an up-sampling layer, the padding parameter of the convolution layer with the convolution kernel size of 3×3 and the number of convolution kernels of 1 is 1, the step size is 1, the activation function is "Relu", and the input end receives the output of the fourth GFU block. The layer carries out Up-sampling of bilinear interpolation, and finally outputs a characteristic diagram with 1 channel, width and height of w and h respectively, which is marked as Up ₁ 。

For residual single stream networks: the first volume block is composed of a convolution layer, a local normalization layer and an activation layer, wherein the input of the first volume block is a 1-channel feature map, and the output of the first volume block is a 64-channel feature map; the set of the output 64 feature maps is denoted as F ₁ The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 64, the padding parameter of the convolution layer is 1, the step length is 1, the activation function is 'Relu', and a feature map F is output ₁ The width and height of each feature map in (a) is W, H, respectively.

For the first to fourth reduce blocks, they remain a convolution block, just named for its function of reducing the number of channels. The method consists of a convolution layer, a local normalization layer and an activation layer of channel feature graphs with the number of input channel feature graphs of one to eight reduce blocks being 1028, 512, 256 and 128 in sequence and the number of output channel feature graphs of 256, 128, 64 and 32 in sequence; the set of the output characteristic diagrams is respectively marked as Reduce _1～4 The method comprises the steps of carrying out a first treatment on the surface of the Wherein the convolution kernel size of the convolution layer is 3×3, the number of the convolution kernels is 256, 128, 64, the padding parameter of the 32 convolution layers is 1, the step size is 1, the activation function is 'Relu', and the feature map Reduce is output _1～4 The width and the height of each feature map in (a) are respectively

W、H；

For the first up-sampling block, which contains one convolution layer and one up-sampling layer, the convolution kernel size is 3×3. The number of convolution kernels is 1, the padding parameter of the convolution layer is 1, the step size is 1, the activation function is 'Relu', and the input end receives the output of the fourth GFU block. The layer carries out Up-sampling of bilinear interpolation, and finally outputs a characteristic diagram with 1 channel, width and height of w and h respectively, which is marked as Up ₁ 。

Step 1_3: training stage. And taking all the color images in the training set and the corresponding images of the parallax image 1 as training images, and pre-training the residual double-current network. And simultaneously taking all the color pictures as input as training images, and pre-training the residual single-stream network. At this time, two outputs are respectively recorded as

Where k represents the kth training sample.

Step 1_4: calculation of

And the Loss value of the real manually marked label graph is calculated to obtain a Loss function Loss1 of the residual double-current network and Loss2 of the residual single-current network. Finally, performing back propagation through Loss1 and Loss2 to train two independent networks; the loss functions used herein are all cross entropy loss functions.

Step 1_5: step 1_3 and step 1_4 are repeatedly performed a total of O times. Thereby obtaining two pre-trained neural network models and adjusting the optimal model parameters W _op 。

The specific steps of the test stage process are as follows:

step 2_1: the notation { X (i, j) } represents an object image to be significantly segmented; wherein 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W represents the width of { X (I, j) }, H represents the height of { X (I, j) }, X (I, j) represents the pixel value of the pixel point whose coordinate position is (I, j) in { I (I, j) };

step 2_2: inputting the parallax image into a trained parallax purifying selector, and if the parallax purifying selector judges that the parallax purifying selector is a positive sample, inputting the corresponding color image 1 into a residual double-current network for prediction to obtain a display corresponding to { X (i, j) }The sexual predictive map is denoted as { X ] _Pre (i, j); wherein X is _Pre (i, j) represents { X } _Pre Pixel values of pixel points having coordinate positions (i, j) in (i, j). If the color image is negative, only inputting the color image into a residual single-stream network for prediction to obtain a prediction image, and recording the prediction image as { S } _Pre { i, j }; wherein S is _Pre { i, j }, represent { S } _Pre Pixel values of pixel points having a coordinate position (i, j) in { i, j }. The method comprises the steps of carrying out a first treatment on the surface of the Merging { X ] _Pre (i, j) } and { S _Pre { i, j }, thereby yielding a set of all predictors { Pre { i, j }. The RGB diagrams, depth diagrams and labels of part of the experiments, and the corresponding result diagrams in the prediction result set are sequentially shown in fig. 7a,7b,7c and 7d;8a,8b,8c,8d and 9a,9b,9c,9 d.

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

The neural network model was built using the pytorch1.1.0 library in python. The segmentation effect of the significant images predicted by the method of the invention (1588 images are trained and 397 test images are taken) is analyzed on an NJUD database. Here, the segmentation performance of the model was evaluated using 6 commonly used objective parameters of the evaluation semantic segmentation method, namely, PR curve and ROC curve, and AUC, meanF, maxF, MAE, as evaluation indexes. Results such as fig. 10 (a), 10 (b), and AUC were obtained. From the PR curve in FIG. 10 (a), the ROC curve in FIG. 10 (b), and the data listed, the segmentation results obtained by the method of the present invention are better, indicating that it is feasible and efficient to predict significance and segment objects using the method of the present invention.

AUC＝0.976,MeanF＝0.837,MaxF＝0.876,MAE＝0.070

Comparing the label graph with the predicted result as shown in fig. 7c and 7d; fig. 8c, 8d; FIGS. 9c and 9d show that the segmentation accuracy of the significant tag image obtained by the method of the present invention is high.

Claims

1. The utility model provides a two-dimensional residual binocular remarkable object image segmentation method based on parallax purification, falls into two stages of training stage and test stage, its characterized in that:

the training phase process comprises the following specific steps:

step 1_3: the color map and the disparity map in the training set are used as training samples to be input into a residual double-current network for pre-training, and a double-current prediction segmentation map is output

Meanwhile, the color images in the training set are used as training samples to be input into a residual error single-flow network for pre-training, and a single-flow prediction segmentation image is output +.>

And uniflow predictive segmentation map->

Respectively obtaining a Loss value Loss1 of a residual double-current network and a Loss value Loss2 of a residual single-current network with the Loss values between the labeled label graphs, respectively training the two residual double-current networks and the residual single-current network by back propagation through the two Loss values Loss1 and Loss2, obtaining the two pre-trained residual double-current networks and the residual single-current network, and respectively adjusting network model parameters W optimal for the double-current residual network _op1 And single-stream residual network optimized network model parameter W _op2 ；

The specific steps of the test stage process are as follows:

step 2_1: for the object image { X (i, j) } to be significantly segmented, inputting the disparity map in the object image to a trained disparity purification selector to output the result of mutually obtaining positive and negative samples:

if the output is judged to be a positive sample, the color image and the parallax image in the object image are input into a residual double-current network together for prediction processing, and a double-current significance prediction image { X } is obtained by output _Pre (i,j)}；

If the output is judged to be negative, only inputting the color image in the object image into a residual single-stream network for prediction processing, and outputting to obtain a single-stream significance prediction image { S } _Pre {i,j}}；

Step 2_2: final merge of the double-flow significance prediction map { X ] _Pre (i, j) } Single stream significance prediction map { S _Pre The { i, j } obtains a prediction result set { Pre { i, j }, wherein the prediction result set { Pre { i, j } is the result of dividing the original object image;

the parallax purifying selector specifically comprises a binarization processing module, a similarity processing module, a decision tree and a threshold judging module; the binarization processing module and the similarity processing module are connected to the input end of the decision tree together in the training stage, the output end of the decision tree is always connected with the threshold judgment module, and the training is processed by adopting the following modes: on the one hand, the disparity map in the object image and the corresponding label map are input into a similarity processing module for processing, and a similarity value S between the disparity map and the label map is obtained by calculating an S-measure value _m As a label for decision tree training, on the other hand, binarizing the disparity map by using an Otsu algorithm to obtain a target area and a background area of the disparity map, and respectively calculating variances H of all pixels of an original disparity map, which is not binarized and corresponds to the target area, of the target area, which are obtained by binarizing the disparity map _e And the pixels after the binarization of the parallax map are the variances E of all the pixels of the original non-binarized parallax map corresponding to the background region in the region _m The method comprises the steps of carrying out a first treatment on the surface of the Finally, constructing a decision tree with the tree depth of 3 by two variances H _e Value sum E _m As input to the decision tree, similarity value S _m Monitoring as decision treeSupervision is conducted, and training is conducted, so that a decision tree capable of estimating the parallax map S-measure value is obtained; finally, judging as follows according to the set threshold judgment module: when the S-measure value is smaller than or equal to 0.45, the disparity map is a negative sample with unqualified quality, and when the S-measure value is larger than 0.45, the sample is a positive sample with qualified quality;

the residual double-current network comprises eight Res2net modules, eight reduce convolution modules, four GFU modules, two convolution modules and an up-sampling neural network module; the color image is input to a first convolution module, the output of the first convolution module is sequentially transmitted and processed by a first Res2net module, a second Res2net module and a third Res2net module and then is input to a fourth Res2net module, and the output of the first Res2net module, the second Res2net module, the third Res2net module and the fourth Res2net module is respectively transmitted by a fourth reduction convolution module, a third reduction convolution module, a second reduction convolution module and a first reduction convolution module; the parallax images are input into a second convolution module, the output of the second convolution module is sequentially transmitted and processed by an eighth Res2net module, a seventh Res2net module and a sixth Res2net module and then is input into a fifth Res2net module, and the output of the eighth Res2net module, the seventh Res2net module, the sixth Res2net module and the fifth Res2net module are respectively connected with a fourth GFU module, a third GFU module, a second GFU module and a first GFU module through the respective eighth reduction convolution module, the seventh reduction convolution module, the sixth reduction convolution module and the fifth reduction convolution module; the output of the first GFU module is connected with and input to the second GFU module, the output of the second GFU module is connected with and input to the third GFU module, the output of the third GFU module is connected with and input to the fourth GFU module, the output of the fourth GFU module is connected with and input to the up-sampling neural network module, the up-sampling neural network module outputs the double-flow significance prediction graph;

The residual single-flow network comprises a convolution unit, four Res2net units, four reduce convolution units, four GFU units and an up-sampling neural network unit; the color map is input into a convolution unit, the output of the convolution unit is sequentially connected to a fourth Res2net unit through a first Res2net unit, a second Res2net unit and a third Res2net unit, the output of the first Res2net unit, the output of the second Res2net unit, the output of the third Res2net unit and the output of the fourth Res2net unit are respectively connected to a first GFU unit, a second GFU unit, a third GFU unit and a fourth GFU unit through the first reduction convolution unit, the second reduction convolution unit, the third reduction convolution unit and the fourth reduction convolution unit, the output of the first GFU unit is connected to the second GFU unit, the output of the second GFU unit is connected to the third GFU unit, the output of the third GFU unit is connected to the fourth GFU unit, the output of the fourth GFU unit is connected to the up-sampling neural network unit, and the up-sampling neural network unit outputs the double-flow significance prediction map.

2. The method for segmentation of a bi-level residual binocular salient object image based on parallax cleaning according to claim 1, wherein: each Res2net module/each Res2net unit structure of the residual double-flow network is the same, each Res2net unit structure comprises 14 convolution blocks, two points and layers and an overlapping layer, the input of a first convolution block is used as the input of the Res2net module/Res 2net unit, the output of the first convolution block is divided into four parts according to the output sequence and is respectively input into a second convolution block, a third convolution block, a fourth convolution block and a fifth convolution block, the output of the second convolution block is connected and input into a ninth convolution block, the output of the third convolution block is connected and input into a sixth convolution block, the output of the sixth convolution block is connected and input into a tenth convolution block, the output of the seventh convolution block is connected and input into an eleventh convolution block through a first point and a layer, the output of the seventh convolution block is connected and the output of the fifth convolution block is connected and input into an eighth convolution block through a second point and a layer, the output of the seventh convolution block is connected and the output of the eighth convolution block is connected and the thirteenth convolution block is connected with the thirteenth convolution block; the input of the first convolution block is connected and input to a fifteenth convolution block, and the output of the fifteenth convolution block and the output of the fourteenth convolution block are connected through the first superposition layer and then output and serve as the output of a Res2net module/Res 2net unit.

3. The method for segmentation of a bi-level residual binocular salient object image based on parallax cleaning according to claim 2, wherein: the point and layer are the pixel values of the same pixel point corresponding to the positions of the two input images, and the superimposed layer is the front and back connection processing of the two input images.

4. The method for segmentation of a bi-level residual binocular salient object image based on parallax cleaning according to claim 1, wherein: each GFU module in the residual double-current network comprises seven convolution layers, two superposition layers, two dot multiplication layers and two Sigmoid layers, wherein the outputs of the color image and the parallax image which are correspondingly input to the GFU module are respectively connected and input to a first convolution layer and a fourth convolution layer, and the outputs of the first convolution layer and the fourth convolution layer are output after being connected through a second superposition layer; on one hand, the output of the second superimposed layer sequentially passes through the second convolution layer and the third convolution layer and then is output, the output of the first convolution layer and the output of the third convolution layer are connected together and then are input into a first dot product layer M1, and the output of the first dot product layer M1 passes through a first Sigmoid layer S1 and then is input into the third superimposed layer; on the other hand, the output of the second superimposed layer sequentially passes through the fifth convolution layer and the sixth convolution layer and then is output, the output of the fourth convolution layer and the output of the sixth convolution layer are connected together and then are input into a second dot product layer M1, and the output of the second dot product layer M1 passes through a second Sigmoid layer S1 and then is input into a third superimposed layer; the output of the third superimposed layer is processed by a seventh convolution layer and then is output and used as the output of the GFU module; for the second GFU module, the third GFU module or the fourth GFU module, the output of the previous GFU module itself is also input to the third overlay layer via the first upsampling block connection.

5. The method for segmentation of a bi-level residual binocular salient object image based on parallax cleaning according to claim 1, wherein: each GFU unit in the residual single-stream network comprises four convolution layers, an overlapping layer, a dot multiplication layer and a Sigmoid layer, wherein the output of the color picture, which is correspondingly input to the GFU unit, is connected with the eighth convolution layer, the output of the eighth convolution layer is connected with the tenth convolution layer through the ninth convolution layer and then is input to a third dot multiplication layer M3, and the output of the third dot multiplication layer M3 is input to a fourth overlapping layer through a third Sigmoid layer S3; for the second GFU unit, the third GFU unit or the fourth GFU unit, the output of the previous GFU unit itself is also input to the fourth superimposed layer via the second upsampling block connection.

6. The parallax-cleaning-based two-dimensional residual binocular salient object image segmentation method of claim 5, characterized by: the dot multiplication layer is used for multiplying the pixel values of the same pixel points corresponding to the positions of the two input images.