CN113538484B

CN113538484B - Deep-refinement multiple-information nested edge detection method

Info

Publication number: CN113538484B
Application number: CN202110746455.8A
Authority: CN
Inventors: 林川; 王蕤兴; 张贞光; 陈永亮; 谢智星; 吴海晨; 李福章; 潘勇才; 韦艳霞
Original assignee: Guangxi University of Science and Technology
Current assignee: Guangxi University of Science and Technology
Priority date: 2021-07-01
Filing date: 2021-07-01
Publication date: 2022-06-10
Anticipated expiration: 2041-07-01
Also published as: CN113538484A

Abstract

The invention aims to provide a deep-refinement multiple-information nested edge detection method, which comprises the following steps: constructing a deep neural network structure, wherein the deep neural network structure is as follows: an encoding network, a decoding network; the encoding network is a VGG16 network, all full connection layers and pool5 pooling layers are removed from the VGG16 network, and only a VGG16 network body is reserved; the decoding network is divided into three layers, wherein the first layer comprises a compression module, a remodeling module and an adjustment module; the second layer comprises an information extraction and fusion module a, an information extraction and fusion module b, information extraction and fusion module information c and an extraction and fusion module d; the third layer is a network module for transversely subdividing the contour.

Description

Deep-refinement multiple-information nested edge detection method

Technical Field

The invention relates to the field of image processing, in particular to a method for detecting a multi-information nested edge of deep thinning.

Background

Contour detection is an important component of image processing and computer vision. Correctly detecting object contours from complex backgrounds is a very important and difficult task. Among the conventional image processing methods, Canny operators, active contour models, contour models based on machine learning, and the like are used for contour detection. These methods mainly use brightness, color, and contrast information in the image to detect, and are difficult to distinguish between object contours and other cluttered boundaries. Therefore, when the contrast ratio in the image is changed relatively greatly and the background interference is relatively much, the methods have difficulty in obtaining satisfactory results. The above algorithm requires considerable domain expertise and elaborate processing algorithm design to convert the raw image data into suitable representations or feature vectors to construct a contour classifier or contour model. In recent years, deep learning techniques have become an efficient way to automatically learn feature representations from raw data. By means of deep learning tools, in particular convolutional neural networks, the contour detection task has remarkable performance improvement.

In recent years, research related to deep learning has formed a relatively complete system. The HED shows the detection effect of a five-layer side view of the VGG16 network, and finds that the outline effect of a shallow layer is poor, the shallow layer contains a large amount of textures and noise, the error rate is increased in the transmission process, and the experiment effect is greatly influenced. The conventional deep learning algorithm only directly adds or fuses the convolutional layers, and lacks the theoretical support of a biological visual mechanism, while the bionic algorithm describes cell response by using a mathematical model and is not enough to simulate a complex transmission mode among layers in the visual mechanism.

Disclosure of Invention

The invention aims to provide a method for detecting the nested edges of deeply refined multiple information, which overcomes the defects of the prior art and can make the outline clearer and more accurate.

The technical scheme of the invention is as follows:

the method for detecting the deeply refined multiple information nested edges comprises the following steps:

A. the method comprises the following steps of constructing a deep neural network structure, wherein the deep neural network structure comprises an encoding network and a decoding network, and the specific structure is as follows:

the encoding network is a VGG16 network, all full connection layers and pool5 pooling layers are removed from the VGG16 network, and only a VGG16 network body is reserved; the decoding network is divided into three layers, wherein the first layer comprises a compression module, a reshaping module and an adjusting module; the second layer is an information extraction and fusion module a, an information extraction and fusion module b, information extraction and fusion module information c and an information extraction and fusion module d; the third layer is a network module for transversely subdividing the contour;

B. the original image is subjected to network convolution processing of VGG16 to obtain 5 side output images of VGG16, and then the 5 side output images of VGG16 are respectively input into a compression module and an information extraction and fusion module a;

in the information extraction and fusion module a, carrying out convolution processing on the 1 st to 5 th side output images again to enable the number of output channels to be consistent, and obtaining a convolution image again of the 1 st to 5 th side output images; secondly, unifying the resolution of the deconvolution images of the 2 nd to 5 th side output images by taking the 1 st side output image deconvolution image as a reference, obtaining resolution adjustment images of the 2 th to 5 th side output images, fusing the 1 st side output image deconvolution image with the resolution adjustment images of the 2 to 5 th side output image deconvolution images, obtaining an information extraction fused image a, and inputting the information extraction fused image a into a transverse subdivision profile network module;

C. in the compression module: performing secondary convolution on the 1 st to 5 th side face output images, wherein 3 × 3 convolution is adopted for the secondary convolution of the 1 st and 2 th layers of convolution images, 1 × 1 convolution is adopted for the secondary convolution of the 3 st, 4 th and 5 th layers of convolution images, and the number of characteristic channels is unified; combining the 1, 2, 3, 4 and 5 layers of convolution images after secondary convolution two by two in sequence to form 4 groups, pooling the high-resolution output image in each group to be the same as the low-resolution output image by using the maximum value, then adding to obtain four primary combined images which are respectively 1-2, 2-3, 3-4 and 4-5 combined images, and respectively inputting the four primary combined images into a remodeling module and an information extraction and fusion module b;

in the information extraction and fusion module b, combining the images of 1-2, 2-3, 3-4 and 4-5, respectively carrying out convolution processing again to ensure that the number of output channels is consistent, and obtaining the convolution images of 1-2, 2-3, 3-4 and 4-5 again; secondly, respectively taking the 2-3, 3-4 and 4-5 deconvolution images as a reference, unifying the resolution to obtain 2-3, 3-4 and 4-5 resolution adjustment images, fusing the 1-2 deconvolution images with the 2-3, 3-4 and 4-5 resolution adjustment images to obtain an information extraction fused image b, and inputting the information extraction fused image b into a transverse subdivision contour network module;

D. the remolding module is provided with two layers, and the treatment process of the first layer is as follows: performing three parallel convolutions on the 1-2 and 2-3 combined images by using 1 × 1, 3 × 3 and 5 × 5 respectively; fusing the three-time parallel convolution results of the 1-2 combined images to obtain fused 1-2 combined images; fusing the three-time parallel convolution results of the 2-3 combined images to obtain fused 2-3 combined images; performing 1 × 1 convolution on the 3-4 and 4-5 combined images; combining the fused 1-2 combined image, the fused 2-3 combined image, the convolved 3-4 combined image and the convolved 4-5 combined image two by two in sequence to form 3 groups, pooling the high-resolution output image in each group to be the same as the low-resolution output image by using the maximum value, adding to obtain a 1-3 combined image, a 2-4 combined image and a 3-5 combined image, and respectively inputting the images into a second layer and an information extraction and fusion module c;

the treatment process at the second layer is as follows: performing three parallel convolutions on the 1-3 and 2-4 combined images by using 1 × 1, 3 × 3 and 5 × 5 respectively; fusing the three-time parallel convolution results of the 1-3 combined images to obtain fused 1-3 combined images; fusing the three-time parallel convolution results of the 2-4 combined images to obtain fused 2-4 combined images; applying 1 x 1 convolution to the 3-5 combined images; unifying the fused 1-3 combined image 1-3, the fused 2-4 combined image and the convolved 3-5 combined image, pooling the high-resolution output image to be the same as the low-resolution output image by using the maximum value, combining and adding to obtain a 1-4 combined image and a 2-5 combined image, and inputting the images into an adjusting module;

in the information extraction and fusion module c, carrying out convolution processing on the combined images of 1-3, 2-4 and 3-5 respectively to ensure that the number of output channels is consistent, and obtaining convolution images of 1-3, 2-4 and 3-5 again; secondly, unifying the resolution of the 2-4 and 3-5 re-convolution images by taking the 1-3 re-convolution images as a reference respectively to obtain 2-4 and 3-5 resolution adjustment images, fusing the 1-3 re-convolution images with the 2-4 and 3-5 resolution adjustment images to obtain an information extraction fused image c, and inputting the information extraction fused image c into a transverse subdivision contour network module;

E. in an adjusting module, combining 1-4 images and 2-5 images to unify resolution, converting an output image with low resolution into the same output image with high resolution by using a bilinear difference value, combining and adding to obtain a 1-5 combined image, and inputting the combined image into an information extraction and fusion module d;

in the information extraction and fusion module d, carrying out convolution processing on the 1-5 combined image again to obtain a 1-5 convolution image again, and inputting the 1-5 convolution image into the transverse subdivision contour network module;

F. in the transverse subdivision outline network module, the following operations are carried out:

f1, performing convolution and activation on the information extraction fused image a, the information extraction fused image b, the information extraction fused image c and the information extraction fused image d respectively, multiplying the convolution and activation by self-adaptive random weight to obtain a primary weight image a, a primary weight image b, a primary weight image c and a primary weight image d, combining the four images in pairs in sequence to form 3 groups, sampling the low-resolution output image in each group to the high-resolution output image by using a bilinear difference value, and then adding to obtain a primary addition weight image a, a primary addition weight image b and a primary addition weight image c;

f2, respectively performing convolution and activation on the primary addition weight image a, the primary addition weight image b and the primary addition weight image c, multiplying the convolution and activation by self-adaptive random weights to obtain a secondary weight image a, a secondary weight image b and a secondary weight image c, combining the three images in pairs in sequence to form 2 groups, performing up-sampling on a low-resolution output image to a high-resolution output image in each group by using a bilinear difference value, and then adding to obtain a secondary addition weight image a and a secondary addition weight image b;

f3, respectively performing convolution and activation on the secondary addition weight image a and the secondary addition weight image b, multiplying the convolution and activation by self-adaptive random weight to obtain a tertiary weight image a and a tertiary weight image b, unifying the resolution of the two images, performing upsampling on the output image with low resolution to the same as the high resolution output image by using a bilinear difference value, then adding, finally, changing the number of characteristic channels to 1 by 1 convolution, and outputting to obtain a final edge image.

The convolution expression involved in each step is m x n-k conv + relu, wherein m x n represents the size of a convolution kernel, k represents the number of output channels, conv represents a convolution formula, and relu represents an activation function; m, n and k are preset values; the convolution expression of the final fusion layer is m x n-k conv.

The VGG16 network comprises 5 stages, which are stage I to stage V, wherein more than one convolution layer is arranged in each stage;

the input response of the first convolution layer of the stage I is an original image, and the input responses of other convolution layers of the stage I are the output responses of the convolution layer at the stage; in stages II-V, except the input response of the first convolutional layer in the stage, the input responses of other convolutional layers in the stage are the output responses of the last convolutional layer; the output response of the last convolution layer in the stages I to IV is used as the input response of the first convolution layer in the next stage after the maximum value pooling on the one hand; on the other hand, the information is input into a compression module and an information extraction and fusion module a as an input response; the output response of the last convolutional layer in the stage V is input into a compression module and an information extraction and fusion module a after being subjected to maximum value pooling;

the convolutions in the VGG16 network are all 3 × 3 convolutions.

The second convolution in steps B to E is 1 × 1 convolution.

In the step C, the number of the unified feature channels is 200.

In the steps B-E, the number of the feature channels of the information extraction fusion image a is 64, the number of the feature channels of the information extraction fusion image B is 100, the number of the feature channels of the information extraction fusion image c is 200, and the number of the feature channels of the information extraction fusion image d is 300.

In the steps B-E, the method for unifying the resolution in the information extraction fusion module a, the information extraction fusion module B, the information extraction fusion module c, and the information extraction fusion module d is as follows: the low resolution output map is transformed to the high resolution output map using bilinear differencing.

In the step F1-3, the convolution is 3 × 3 convolution, the activation is performed by using the following ReLU function, and the weight parameter range of the adaptive random weight is 0 to 1;

the maximum pooling is 2 x 2 maximum pooling.

The invention designs an edge detection method based on a novel decoding network, which is suitable for most networks and can show good results. On the NYUD-V2 data set, with VGG16 as the encoding network, F-score with ODS of 0.773 was obtained, which is 1.6% higher than LRCNet. The method provided by the invention provides a new idea for the subsequent contour detection research and is further beneficial to improving other visual tasks.

Drawings

Fig. 1 is a network diagram of VGG16 provided in embodiment 1 of the present invention;

FIG. 2 is a graph showing the comparison between the contour detection effects of the embodiment 1 of the present invention and that of the reference 1;

in fig. 1, "3 × 3-64", "3 × 3-128" and the like indicate parameters of the convolution kernel, where "3 × 3" indicates the size of the convolution kernel, and "-64", "128" and the like indicate the number of convolution kernels, that is, the number of output characteristic channels is 64 or 128 and the like.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples.

Example 1

The method for detecting the nested edges of the multiple information with the deep refinement provided by the embodiment comprises the following steps:

in the information extraction and fusion module a, carrying out convolution processing on the 1 st to 5 th side output images again to enable the number of output channels to be consistent, and obtaining a convolution image again of the 1 st to 5 th side output images; secondly, respectively taking the re-convolved images of the 2 nd-5 th side output images as reference, unifying the resolution to obtain resolution adjustment images of the re-convolved images of the 2 th-5 th side output images, fusing the resolution adjustment images of the re-convolved images of the 1 st side output images and the re-convolved images of the 2 th-5 th side output images through a concat function to obtain an information extraction fused image a, and inputting the information extraction fused image a into a transverse subdivision contour network module;

C. in the compression module: performing secondary convolution on the 1 st to 5 th side face output images, wherein 3 × 3 convolution is adopted for the secondary convolution of the 1 st and 2 th layers of convolution images, 1 × 1 convolution is adopted for the secondary convolution of the 3 st, 4 th and 5 th layers of convolution images, and the number of characteristic channels is unified; combining the 1, 2, 3, 4 and 5 layers of convolution images after secondary convolution two by two in sequence to form 4 groups, pooling the high-resolution output image in each group to be the same as the low-resolution output image by using the 2 x 2 maximum value, then adding to obtain four primary combined images which are respectively 1-2, 2-3, 3-4 and 4-5 combined images, and respectively inputting the four primary combined images into a remodeling module and an information extraction and fusion module b;

in the information extraction and fusion module b, combining the images of 1-2, 2-3, 3-4 and 4-5, respectively carrying out convolution processing again to ensure that the number of output channels is consistent, and obtaining the convolution images of 1-2, 2-3, 3-4 and 4-5 again; secondly, respectively taking the 1-2 deconvoluted images as the reference for the 2-3, 3-4 and 4-5 deconvoluted images, unifying the resolution to obtain 2-3, 3-4 and 4-5 resolution adjustment images, fusing the 1-2 deconvoluted images and the 2-3, 3-4 and 4-5 resolution adjustment images through a concat function to obtain an information extraction fused image b, and inputting the information extraction fused image b into a transverse subdivision contour network module;

D. the remolding module is provided with two layers, and the treatment process of the first layer is as follows: performing three parallel convolutions on the 1-2 and 2-3 combined images by using 1 × 1, 3 × 3 and 5 × 5 respectively; fusing the three-time parallel convolution results of the 1-2 combined images through a concat function to obtain fused 1-2 combined images; fusing the three-time parallel convolution results of the 2-3 combined images through a concat function to obtain fused 2-3 combined images; 1 x 1 convolution was used for 3-4, 4-5 binding images; combining the fused 1-2 combined image, the fused 2-3 combined image, the convolved 3-4 combined image and the convolved 4-5 combined image two by two in sequence to form 3 groups, pooling the high-resolution output image in each group to be the same as the low-resolution output image by using a 2 x 2 maximum value, adding to obtain a 1-3 combined image, a 2-4 combined image and a 3-5 combined image, and respectively inputting the 1-3 combined image, the 2-4 combined image and the 3-5 combined image into a second layer and an information extraction fusion module c;

the processing procedure at the second layer is as follows: performing three parallel convolutions on the 1-3 and 2-4 combined images by using 1 × 1, 3 × 3 and 5 × 5 respectively; fusing the three-time parallel convolution results of the 1-3 combined images through a concat function to obtain fused 1-3 combined images; fusing the three-time parallel convolution results of the 2-4 combined images through a concat function to obtain fused 2-4 combined images; applying 1 x 1 convolution to the 3-5 combined images; unifying the fused 1-3 combined image 1-3, the fused 2-4 combined image and the convolved 3-5 combined image, pooling the high-resolution output image to be the same as the low-resolution output image by using a 2 x 2 maximum value, combining and adding to obtain a 1-4 combined image and a 2-5 combined image, and inputting the images into an adjusting module;

in the information extraction and fusion module c, the combined images of 1-3, 2-4 and 3-5 are respectively subjected to convolution processing again, so that the number of output channels is consistent, and the deconvolution images of 1-3, 2-4 and 3-5 are obtained; secondly, unifying the resolution of the 2-4 and 3-5 re-convolution images by taking the 1-3 re-convolution images as a reference respectively to obtain 2-4 and 3-5 resolution adjustment images, fusing the 1-3 re-convolution images with the 2-4 and 3-5 resolution adjustment images through a concat function to obtain an information extraction fused image c, and inputting the information extraction fused image c into a transverse subdivision contour network module;

The VGG16 network includes 5 stages, which are stage I to stage V, each stage is provided with more than one convolution layer;

the input response of the first convolution layer of the stage I is an original image, and the input responses of other convolution layers of the stage I are the output responses of the convolution layer at the stage; in stages II-V, except the input response of the first convolutional layer in the stage, the input responses of other convolutional layers in the stage are the output responses of the last convolutional layer; on one hand, the output response of the last convolution layer in the stages I to IV is used as the input response of the first convolution layer in the next stage after 2-by-2 maximum pooling; on the other hand, the information is input into a compression module and an information extraction and fusion module a as an input response; the output response of the last convolution layer in the stage V is input into the compression module and the information extraction fusion module a after being subjected to 2 x 2 maximum value pooling;

the convolutions in the VGG16 network are all 3 × 3 convolutions.

The re-convolution in steps B-E is 1 x 1 convolution.

In the step C, the number of the uniform characteristic channels is 200.

In the steps B-E, the number of the characteristic channels of the information extraction fusion image a is 64, the number of the characteristic channels of the information extraction fusion image B is 100, the number of the characteristic channels of the information extraction fusion image c is 200, and the number of the characteristic channels of the information extraction fusion image d is 300.

example 2

Comparing the edge detection results of the method of this embodiment with the method of the following document 1;

document 1: HED: S.Xie and Z.Tu, "Hollistingy-nested edge detection," in International conference on Computer Vision,2015, pp.1395-1403;

document 2: LRCNet: lin, l.cui, f.li, and y.cao, "bacterial reference Network for content Detection," neuro-typing, vol.409, 2020;

training and edge detection were performed based on the neural network model of example 1. The training and testing of the present invention was done using the published PyTorch framework. The invention initializes the network of the invention using the VGG16 model that has been pre-trained in ImageNet. In training, the convolution kernel is initialized with a zero mean gaussian distribution with a standard deviation of 0.01 and a bias term of 0. The Stochastic Gradient Descent (SGD) hyper-parameter, global learning rate set to 1e-6, momentum and weight decay set to 0.9 and 0.0002, respectively. When the NYUD dataset is employed, the tolerance maxDist is adjusted to 0.011.

We used Precision-regression (PR) curves and harmonic mean F values to evaluate the performance of the contour detection model. The F value is defined as follows:

F＝2PR/(P+R)

wherein P and R represent the degree of accuracy and regression, respectively,

here TPFP, and FN represent the correct number of contour pixels, the number of false detections, and the number of missed detections, respectively.

Experimental data:

NYUD-V2 data set. As shown in table 1, the network of the present invention has better detection results than other learning networks. In embodiment 1 of the present invention, when VGG16 is used as the coding network, the ODS obtained by combining the HHA image and the RGB image is 0.773. Compared with LRC, the improvement is 1.6 percent respectively. From the results of the experiments in table 1, the detection method of the present invention (DDM) is superior to the detection methods of documents 1(HED) and 2 (LRCNet).

TABLE 1 comparison of the effects of F-score in other networks

Claims

1. A deep-refinement multiple information nested edge detection method is characterized by comprising the following steps:

A. constructing a deep neural network structure, wherein the deep neural network comprises a coding network and a decoding network, and the specific structure is as follows:

in the information extraction and fusion module a, carrying out convolution processing on the 1 st to 5 th side output images again to enable the number of output channels to be consistent, and obtaining a convolution image again of the 1 st to 5 th side output images; secondly, respectively taking the re-convolved images of the 2 nd to 5 th side output images as reference, unifying the resolution to obtain a resolution adjustment image of the re-convolved images of the 2 th to 5 th side output images, fusing the re-convolved image of the 1 st side output image and the resolution adjustment image of the re-convolved images of the 2 th to 5 th side output images to obtain an information extraction fused image a, and inputting the information extraction fused image a into a transverse subdivision contour network module;

the processing procedure at the second layer is as follows: performing three parallel convolutions on the 1-3 and 2-4 combined images by using 1 × 1, 3 × 3 and 5 × 5 respectively; fusing the three-time parallel convolution results of the 1-3 combined images to obtain fused 1-3 combined images; fusing the three-time parallel convolution results of the 2-4 combined images to obtain fused 2-4 combined images; 1 × 1 convolution was applied to the 3-5 binding image; unifying the fused 1-3 combined image 1-3, the fused 2-4 combined image and the convolved 3-5 combined image, pooling the high-resolution output image to be the same as the low-resolution output image by using the maximum value, combining and adding to obtain a 1-4 combined image and a 2-5 combined image, and inputting the images into an adjusting module;

in the information extraction and fusion module c, the combined images of 1-3, 2-4 and 3-5 are respectively subjected to convolution processing again, so that the number of output channels is consistent, and the deconvolution images of 1-3, 2-4 and 3-5 are obtained; secondly, unifying the resolution of the 2-4 and 3-5 re-convolution images by taking the 1-3 re-convolution images as a reference respectively to obtain 2-4 and 3-5 resolution adjustment images, fusing the 1-3 re-convolution images with the 2-4 and 3-5 resolution adjustment images to obtain an information extraction fused image c, and inputting the information extraction fused image c into a transverse subdivision contour network module;

2. The method of deep-refinement multiple-information nested edge detection as claimed in claim 1, characterized in that: the convolution expression involved in each step is m x n-k conv + relu, wherein m x n represents the size of a convolution kernel, k represents the number of output channels, conv represents a convolution formula, and relu represents an activation function; and m, n and k are preset values.

3. The method of deep-refinement multiple-information nested edge detection as claimed in claim 2, characterized in that: the VGG16 network includes 5 stages, which are stage I to stage V, each stage is provided with more than one convolution layer;

the input response of the first convolution layer of the stage I is an original image, and the input responses of other convolution layers of the stage I are the output responses of the convolution layer at the stage; in stages II-V, except the input response of the first convolutional layer in the stage, the input responses of other convolutional layers in the stage are the output responses of the last convolutional layer; the output response of the last convolution layer in the stages I to IV is used as the input response of the first convolution layer in the next stage after the maximum value pooling on the one hand; on the other hand, the information is input into a compression module and an information extraction and fusion module a as an input response; and the output response of the last convolutional layer in the stage V is input into the compression module and the information extraction and fusion module a after being subjected to maximum value pooling.

4. The method of deep-refinement multiple-information nested edge detection as claimed in claim 3, characterized in that:

the convolutions in the VGG16 network are all 3 × 3 convolutions.

5. The method for detecting the multiple information nested edges of the deep refinement as claimed in claim 1, characterized in that: the second convolution in steps B to E is 1 × 1 convolution.

6. The method of deep-refinement multiple-information nested edge detection as claimed in claim 1, characterized in that: in the step C, the number of the unified feature channels is 200.

7. The method of deep-refinement multiple-information nested edge detection as claimed in claim 1, characterized in that: in the steps B-E, the number of the feature channels of the information extraction fusion image a is 64, the number of the feature channels of the information extraction fusion image B is 100, the number of the feature channels of the information extraction fusion image c is 200, and the number of the feature channels of the information extraction fusion image d is 300.

8. The method of deep-refinement multiple-information nested edge detection as claimed in claim 1, characterized in that: in the steps B-E, the method for unifying the resolution in the information extraction fusion module a, the information extraction fusion module B, the information extraction fusion module c, and the information extraction fusion module d is as follows: the low resolution output map is the same using bilinear differencing to the high resolution output map.

9. The method of deep-refinement multiple-information nested edge detection as claimed in claim 1, characterized in that:

10. the method of deep-refinement multiple-information nested edge detection as claimed in claim 8, characterized in that: the maximum pooling is 2 x 2 maximum pooling.