CN113538485B

CN113538485B - Contour detection method for learning biological visual pathway

Info

Publication number: CN113538485B
Application number: CN202110784619.6A
Authority: CN
Inventors: 林川; 张哲一; 谢智星; 陈永亮; 张晓�; 张贞光; 吴海晨; 李福章; 潘勇才; 韦艳霞
Original assignee: Guangxi University of Science and Technology
Current assignee: Guangxi University of Science and Technology
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2022-04-22
Anticipated expiration: 2041-08-25
Also published as: CN113538485A

Abstract

The invention aims to provide a contour detection method for learning biological visual pathways, which comprises the following steps: constructing a deep neural network structure, wherein the deep neural network structure is as follows: the system comprises an encoding network, a decoding network and a feedforward fusion module; the coding network is a network structure combining VGG16 and FENet; the original image is processed by an encoding network, a decoding network and a feedforward fusion module in sequence to obtain a final output contour. The invention can make the encoder obtain richer contour characteristic information and improve the contour detection performance.

Description

Contour detection method for learning biological visual pathway

Technical Field

The invention relates to the field of image processing, in particular to a contour detection method for learning biological visual pathways.

Background

Contour detection aims at extracting the boundary between the background and the object in an image, usually as a key step of front-end processing of various middle and high-level computer vision tasks, and is one of the basic tasks in the field of computer vision research. In recent years, deep learning has been rapidly developed, and some scholars design a contour detection model based on a Convolutional Neural Network (CNN), wherein the contour detection model consists of an encoder and a decoder, the encoder commonly adopts VGG16 or ResNet, and the decoder architecture design is the research focus. End-to-end contour extraction can be realized by CNN-based models, and experiments prove that the models achieve remarkable effect on a Berkeley segmentation data set (BSDS 500).

Although the end-to-end contour detection method based on the CNN achieves a significant effect, the main innovation points of the current models are the design of a decoder, and the models lack guidance of a visual mechanism. The decoder functions to restore the original resolution image by fusing the output features of the encoder.

Disclosure of Invention

The invention aims to provide a contour detection method for learning a biological visual pathway, which starts from enhancing the characteristic expression capability of an encoder and is inspired by the biological visual pathway and a related visual mechanism thereof, and designs a bionic contour enhancer. The intensifier is combined with the encoder, so that the encoder can obtain richer contour characteristic information, and the purpose of improving the contour detection performance is achieved.

The technical scheme of the invention is as follows:

the method for detecting the contour of the biological visual pathway comprises the following steps:

A. constructing a deep neural network structure, wherein the deep neural network structure is as follows:

the system comprises an encoding network, a decoding network and a feedforward fusion module; the coding network is a network structure combining VGG16 and FENet; the FENet network is a self-creation structure; the FENet Chinese is named as a feature enhancement network;

the VGG16 network takes the pooling layer as a boundary and is divided into stages S1, S2, S3, S4 and S5;

the FENet includes four sub-networks: a single antagonistic feature subnetwork, a dual antagonistic feature subnetwork, a V1 exporter subnetwork, a V2 exporter subnetwork; the single antagonistic characteristic subnetwork simulates a single antagonistic receptive field mechanism in the retina/LGN, and the double antagonistic characteristic subnetwork simulates a double antagonistic receptive field mechanism in V1;

B. inputting an original image into a VGG16 network, and performing convolution processing in S1, S2, S3, S4 and S5 stages in sequence to respectively obtain output results S1, S2, S3, S4 and S5, wherein the output result S1 is sent to a decoding network;

processing an original image by a formula 1 to obtain four inputs of R-G, G-R, B-Y and Y-B;

SO_i＝C_m-ωC_n (1)

wherein i represents R-G, G-R, B-Y, Y-B; m and n both represent R, G, B, Y components; omega is a coefficient and takes the value of 0.7;

inputting R-G, G-R, B-Y and Y-B into a single antagonistic characteristic subnetwork for processing to obtain an output result a, adding the output result a and the output result S2 for fusion to obtain a fusion result a, and inputting the fusion result a into a decoding network;

inputting R-G, G-R, B-Y and Y-B into a dual-antagonistic characteristic subnetwork for processing to obtain an output result B, adding the output result B and the output result S3 for fusion to obtain a fusion result B, and inputting the fusion result B into a decoding network;

the edge response of a V1 area is obtained by an original image through an SCO algorithm, the edge response is input into a V1 output sub-network for processing, an output result c is obtained, the output result c and an output result S4 are added and fused, a fusion result c is obtained, and the fusion result c is input into a decoding network;

the edge response of a V2 area is obtained by an original image through an SED algorithm, the edge response is input into a V2 output sub-network for processing, an output result d is obtained, the output result d and an output result S5 are added and fused, a fusion result d is obtained, and the fusion result d is input into a decoding network;

C. respectively inputting the output result a and the output result b into a feedforward fusion module;

the output result S1, the fusion result a, the fusion result b, the fusion result c and the fusion result d are processed by a decoding network to obtain a decoding output result, the decoding output result is input into a feedforward fusion module, and the loss of the decoding output result is calculated;

D. in the feed-forward fusion module, after an output result a and an output result b respectively pass through a 1x1-1 convolution layer, the original resolution is restored through upsampling, the loss of the original resolution is calculated, finally, the original resolution is multiplied by weight, the obtained result and a decoding output result are added and fused to obtain a final output contour, and the loss of the final output contour is calculated.

The SCO algorithm is described in the following documents: K. yang, S. -B.Gao, C. -F.Guo, C. -Y.Li, Y. -J.Li, Boundary detection using double-open and spatial sparse constraint, IEEE Transactions on Image Processing,24(2015) 2565-.

The SED algorithm is described in the following documents: akbania, C.A. Parraga, Feedback and Surround Modulated Boundary Detection, International Journal of Computer Vision,126(2018) 1367-.

The convolution expression related to each step is m × n-k conv + ReLU, wherein m × n represents the size of a convolution kernel, k represents the number of output channels, conv represents a convolution formula, and ReLU represents an activation function; m, n and k are preset values; the convolution expression of the final fusion layer is m x n-k conv.

The VGG16 network is obtained by the original VGG16 network through the following structural adjustment:

the pooling layers between S4 and S5 were removed, and the three convolutional layers of S5 were changed to the void convolutional layers with void rates of 2, 4, and 8 in sequence.

The sub-network of single antagonistic features comprises: R-G, G-R, B-Y, Y-B four groups of single antagonistic convolution treatment stages, SEM multiscale enhancement module, 3X 3-128 convolutional layer;

the R-G, G-R, B-Y and Y-B single antagonistic convolution treatment stages are the same and respectively pass through a 3x 3-3 convolution layer, a 3x 3-64 convolution layer, a maximum pooling layer and a 3x 3-128 convolution layer in sequence;

the sub-network processing procedure for the single antagonistic feature is as follows:

adding and fusing the features processed in the R-G and G-R single-antagonistic convolution processing stage, and processing by a multi-scale enhancement module to obtain a single-antagonistic enhancement result a; adding and fusing the characteristics processed in the B-Y, Y-B single antagonistic convolution processing stage, and processing by a multi-scale enhancement module to obtain a single antagonistic enhancement result B;

and splicing the single antagonism enhancement result a and the single antagonism enhancement result b, and then matching the number of channels through a 3x 3-128 convolution layer to obtain a fusion result a.

The dual antagonistic feature subnetwork comprises: R-G, G-R, B-Y, Y-B dual-antagonistic convolution processing stages, SEM multi-scale enhancement module, 1 × 1-256 convolution layer;

the R-G, G-R, B-Y, Y-B dual-antagonistic convolution processing stages are the same, the input of each stage is divided into two paths, the two paths in each stage respectively pass through a 9 x 9-3 convolution layer, a 9 x 9-64 convolution layer, a 2 x 2 maximum pooling layer, a 9 x 9-128 convolution layer, a 2 x 2 maximum pooling layer and a 9 x 9-256 convolution layer in sequence, and are subtracted after being multiplied by the trainable weight normalized by a sigmoid function to respectively obtain R-G, G-R and B-Y, Y-B dual-antagonistic convolution processing results;

adding and fusing R-G and G-R dual-antagonism convolution processing results, and processing the results through an SEM multi-scale enhancement module to obtain a dual-antagonism enhancement result a; adding and fusing the results of the B-Y, Y-B dual-antagonistic convolution processing, and processing by an SEM multi-scale enhancement module to obtain a dual-antagonistic enhancement result B;

and splicing the double-antagonism enhancement result a and the double-antagonism enhancement result b, and then matching the number of channels through a 1x 1-256 convolution layer to obtain a fusion result b.

The V1 output sub-network comprises three groups of 2X 2 maximum pool layers, an SEM multi-scale enhancement module and a 3X 3-512 convolution layer which are connected in sequence;

the edge response of the V1 area is subjected to 2 × 2 maximum pooling for three times in the V1 output sub-network, then the multi-scale features are extracted through an SEM multi-scale enhancement module, and finally the fusion result c is obtained through the number of matching channels of the 3 × 3-512 convolutional layers.

The V2 output sub-network comprises three groups of 2X 2 maximum pool layers, an SEM multi-scale enhancement module and a 3X 3-512 convolution layer which are connected in sequence;

the edge response of the V2 area is subjected to 2 × 2 maximum pooling for three times in the V2 output sub-network, then the multi-scale features are extracted through an SEM multi-scale enhancement module, and finally the fusion result d is obtained through the number of matching channels of 3 × 3-512 convolutional layers.

The decoding network is a self-created RDNet network named as a residual error decoder network in Chinese;

the decoding network is of a 4-layer structure consisting of a plurality of unit modules R, a first layer comprises 4 unit modules R, a second layer comprises 3 unit modules, a third layer comprises 2 unit modules R, and a fourth layer comprises 1 unit module R;

respectively inputting the fusion result d and the fusion result c into a first unit module R of the first layer, and processing by the unit module R to obtain a processing result R1;

the processing result dc and the fusion result b are respectively input into a second unit module R of the first layer, and are processed by the unit module R to obtain a processing result R2;

inputting the processing result R2 and the fusion result a into the third unit module R of the first layer, and processing by the unit module R to obtain a processing result R3;

the processing result R3 and the output result S1 are respectively inputted into the fourth unit module R of the first layer, and are processed by the unit module R to obtain a processing result R4;

the processing result R1 and the processing result R2 are respectively input into the first unit module R of the second layer, and the processing result R5 is obtained after the processing of the unit module R;

the processing result R5 and the processing result R3 are respectively inputted into the second unit module R of the second layer, and are processed by the unit module R to obtain a processing result R6;

the processing result R6 and the processing result R4 are respectively inputted into the third unit module R of the second layer, and are processed by the unit module R to obtain a processing result R7;

the processing result R5 and the processing result R6 are respectively input into the first unit module R of the third layer, and are processed by the unit module R to obtain a processing result R8;

the processing result R8 and the processing result R7 are respectively inputted into the second unit module R of the third layer, and the processing result R9 is obtained after the processing of the unit module R;

the processing result R8 and the processing result R9 are respectively input into the unit module R of the fourth layer, the unit module R processes the processing result R10, and the processing result R10 obtains the decoding output result through 1 × 1-1 convolution.

The unit module R comprises two input channels, wherein a channel 1 inputs an image with a larger size, and a channel 2 inputs an image with a smaller size;

sequentially carrying out 3 multiplied by 3 convolution, ReLU function activation, batch normalization layer processing and multiplication by a trainable weight normalized by a sigmoid function in the channel 1 on the image to obtain an output result of the channel 1;

the image is sequentially subjected to 3 multiplied by 3 convolution, ReLU function activation, batch normalization layer multiplication and sigmoid function normalization trainable weight in a channel 2, and the size of the result is consistent with that of the channel 1 output result through up-sampling, so that the channel 2 output result is obtained;

the number of output channels of the 3x3 convolution layers in the channel 1 and the channel 2 is consistent with the small number of channels in the two inputs;

and adding and fusing the output result of the channel 1 and the output result of the channel 2 to obtain the output result of the current unit module R.

In the steps C and D, the formula for calculating the loss is as follows:

the total loss is as follows:

in the above formula, θ_iAnd theta_fuseWeights representing the loss and the final predicted loss of the three sub-network outputs, P_iRepresenting three different outputs, P_fuseRepresenting the final edge prediction, Y represents the true edge map;

l(P_fusey) is as follows:

for a real edge map Y ═ Y (Y)_j，j＝1，...，|Y|)，y_jE {0, 1}, and Y + = { Y is defined_j，y_jEta and Y- ═ Y_j，y_j＝0}，Y⁺And Y^-Respectively representing a positive sample set and a negative sample set, and other pixels are ignored entirely.

Thus l (P)_fuseY) is calculated as follows:

in the formula (3), P represents prediction, P_jRepresenting the value processed by a sigmoid function at pixel j. α and β are used to balance the positive and negative samples, and λ is a weight that controls the magnitude of the coefficient.

The method is inspired by a biological visual channel and a related visual mechanism thereof, forms a simulated bionic contour enhancement encoder, and can effectively enhance the contour characteristics of the encoder, so that a decoding network obtains richer characteristic information and the contour detection performance is improved.

Drawings

Fig. 1 is an overall structural view of a deep neural network according to embodiment 1 of the present invention;

fig. 2 is an overall structural diagram of a coding network according to embodiment 1 of the present invention;

FIG. 3 is a block diagram of a single antagonistic feature subnetwork of example 1 of the present invention;

FIG. 4 is a block diagram of a dual antagonistic feature subnetwork of example 1 of the present invention;

FIG. 5 is a structural diagram of a V1 export sub-network according to embodiment 1 of the present invention;

FIG. 6 is a structural diagram of a V2 export sub-network according to embodiment 1 of the present invention;

FIG. 7 is a block diagram of a feedforward fusion module according to embodiment 1 of the present invention;

fig. 8 is a structural diagram of a decoding network of embodiment 1 of the present invention;

fig. 9 is a structural diagram of a unit module R in the decoding network according to embodiment 1 of the present invention;

FIG. 10 is a graph showing the comparison between the contour detection effects of the embodiment 1 of the present invention and that of the reference 1;

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a contour detection method for learning a biological visual pathway, which comprises the following steps:

A. a deep neural network structure is constructed, and is shown in figures 1-9, and the deep neural network structure is specifically as follows:

the system comprises an encoding network, a decoding network and a feedforward fusion module; the coding network is a network structure combining VGG16 and FENet; the decoding network is an RDNet network;

the FENet includes four sub-networks: a single antagonistic feature subnetwork, a dual antagonistic feature subnetwork, a V1 exporter subnetwork, a V2 exporter subnetwork;

SO_i＝C_m-ωC_n (1)

The decoding network is of a 4-layer structure consisting of a plurality of unit modules R, the first layer comprises 4 unit modules R, the second layer comprises 3 unit modules, the third layer comprises 2 unit modules R, and the fourth layer comprises 1 unit module R;

In the steps C and D, the formula for calculating the loss is as follows:

the total loss is as follows:

in the above formula, θ_iAnd theta_fuseWeights representing losses of three sub-network outputs and final predicted losses, respectivelyWeight of (1), P_iRepresenting three different outputs, P_fuseRepresenting the final edge prediction, Y represents the true edge map;

l(P_fusey) is as follows:

for a real edge map Y ═ Y (Y)_j，j＝1，...，|Y|)，y_jE {0, 1}, and Y is defined⁺＝{y_j，y_jEta and Y- ═ Y_j，y_j＝0}，Y⁺And Y^-Respectively representing a positive sample set and a negative sample set, and other pixels are ignored entirely.

Thus l (P)_fuseY) is calculated as follows:

Example 2

Comparing the edge detection results of the method of this embodiment with the method of the following document 1;

document 1: S.Xie and Z.Tu, "Hollistically-connected edge detection," in International conference on Computer Vision,2015, pp.1395-1403.

The parameters used in document 1 are, as in the original text, the parameters that have been guaranteed to be optimal for the model.

For quantitative performance evaluation of the final profile, we used the same performance measurement criteria as in document 1, and the detailed evaluation is shown in equation (3).

Wherein P represents precision and R represents recall. The larger the value of F, the better the performance.

Fig. 10 shows two natural images selected from the berkeley segmented data set (BSDS500), corresponding real contours, contours detected by the method of document 1, and contours detected by the method of embodiment 1.

From the experimental effect, the detection method of example 1 is superior to the detection method of document 1.

Claims

1. A contour detection method for learning biological visual pathways is characterized by comprising the following steps:

the system comprises an encoding network, a decoding network and a feedforward fusion module; the coding network is a network structure combining VGG16 and FENet;

SO_i＝C_m-ωC_n (1)

D. in the feed-forward fusion module, after an output result a and an output result b respectively pass through a 1x1-1 convolution layer, the original resolution is restored through upsampling, the loss of the original resolution is calculated, finally, the original resolution is multiplied by weight, the obtained result and a decoding output result are added and fused to obtain a final output contour, and the loss of the final output contour is calculated;

the R-G, G-R, B-Y and Y-B single antagonistic convolution processing stages are the same and respectively pass through a 3x 3-3 convolution layer, a 3x 3-64 convolution layer, a maximum pooling layer and a 3x 3-128 convolution layer in sequence;

splicing the single antagonism enhancement result a and the single antagonism enhancement result b, and then matching the number of channels through a 3x 3-128 convolution layer to obtain a fusion result a;

2. The method for detecting the contour of a learned biological visual pathway as set forth in claim 1, wherein:

3. The method for detecting the contour of a learned biological visual pathway as set forth in claim 1, wherein:

4. The method for detecting the contour of a learned biological visual pathway as set forth in claim 1, wherein:

5. The method for detecting the contour of a learned biological visual pathway as set forth in claim 1, wherein:

inputting the processing result R1 and the fusion result b into the second unit module R of the first layer, and processing by the unit module R to obtain a processing result R2;

6. The method for detecting the profile of a learned biological visual pathway as set forth in claim 5, wherein:

the unit module R comprises two input channels;

7. The method for detecting the profile of a learned biological visual pathway as set forth in claim 5, wherein:

in the steps C and D, the formula for calculating the loss is as follows:

the total loss is as follows:

l(P_fusey) is as follows:

for a real edge map Y ═ Y (Y)_j，j＝1，...，|Y|)，y_jE {0, 1}, and Y is defined⁺＝{y_j，y_jEta and Y^-＝{y_j，y_j＝0}，Y⁺And Y^-Respectively representing a positive sample set and a negative sample set, and neglecting all other pixels;

thus l (P)_fuseY) is calculated as follows:

in the formulae (3) and (4), P represents prediction, P_jRepresenting the value processed by a sigmoid function at pixel j, α and β are used to balance the positive and negative samples, and λ is the weight that controls the magnitude of the coefficient.