CN111028235B

CN111028235B - Image segmentation method for enhancing edge and detail information by utilizing feature fusion

Info

Publication number: CN111028235B
Application number: CN201911094462.3A
Authority: CN
Inventors: 朱和贵; 苗艳
Original assignee: 东北大学
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2023-08-22
Anticipated expiration: 2039-11-11
Also published as: CN111028235A

Abstract

The invention provides an image segmentation method for enhancing edge and detail information by utilizing feature fusion, and relates to the technical field of computer vision. The method utilizes a convolutional neural network to extract characteristics of an input image; inputting the extracted features into a decoding structure added with more feature fusion, and enriching edge and detail information while recovering the resolution of the image to obtain a dense feature map; outputting maximum values of different classifications by a normalization method; and calculating a cross entropy loss function, and updating the weight in the network by using a random gradient descent method. The method can restore the position and boundary detail information lost in the encoding stage while restoring the resolution of the feature map, enrich the information of the picture, obtain the dense feature map, make up the sparse feature map brought by direct up-sampling, make the divided boundary and detail clearer, and promote the division effect on the detail tiny objects.

Description

Image segmentation method for enhancing edge and detail information by utilizing feature fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to an image segmentation method for enhancing edge and detail information by utilizing feature fusion.

Background

Along with the continuous progress of science and technology and the rapid development of national economy, artificial intelligence gradually enters the field of view of people, plays an increasingly heavy role in the production and life of human beings, is widely applied to various fields, image semantic segmentation is an important research direction of the artificial intelligence, is a very important means for realizing automatic scene understanding, and can be applied to various fields such as an automatic driving system, unmanned and application.

The image semantic segmentation technology is an important branch in the field of computer vision in machine learning, and the image semantic segmentation is to process an input image, automatically segment and identify the content in the image. Prior to applying deep learning to the field of computer vision, classifiers that construct semantic segmentation of images are typically either using a forest of texels, or a random forest. With the appearance and the vigorous development of the deep convolutional neural network, a very effective method is provided for semantic segmentation, the CNN is well developed when being applied to the semantic segmentation, the development of the semantic segmentation is promoted, and the CNN is remarkably achieved when being applied to various fields.

Many classical segmentation methods appear after deep learning is applied to semantic segmentation, such as a full convolution network FCN, a SegNet network with an encoder-decoder structure and a deep Lab with hole convolution, but as the hierarchy of the CNN network deepens, the continuous pooling and downsampling can lead the position information and boundary detail information of pictures to be lost, the process is irreversible, the removed information cannot be completely recovered, so that the feature map after upsampling in the decoding stage can be sparse due to the loss of the information, and the methods have certain limitations.

The full convolutional network FCN and the conventional SegNet network lose position and edge details due to downsampling, lost information is not reproduced when upsampling is performed in a decoding stage, the obtained feature map is sparse, and although the SegNet network recovers the position information through pooling indexes and enriches boundary and detail information by utilizing convolution operation, a great amount of information loss still exists.

The hole convolution is a convolution layer capable of obtaining dense feature images, but the calculation cost of using the hole convolution is relatively high, and a large amount of memory is occupied by processing a large amount of high-resolution feature images.

The problem of the existing image semantic segmentation method is that the maintenance of edge detail characteristics and position information still needs to be further improved, and the segmentation accuracy is still to be improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing an image segmentation method for enhancing edge and detail information by utilizing feature fusion to realize segmentation of images aiming at the defects of the prior art.

In order to solve the technical problems, the invention adopts the following technical scheme: an image segmentation method for enhancing edge and detail information by utilizing feature fusion, comprising the following steps:

step 1: processing the images in the training data set to obtain images with uniform resolution;

step 1.1: scaling and cutting the images in the training data set to enable the input images to have uniform sizes;

step 1.2: fixing the resolution of the input image to 360×480;

step 2: inputting the image into a coding structure for feature extraction; the coding structure is the same as the SegNet network, the first 13 layers of VGG-16 are adopted, and meanwhile, the maximum pooling index is added during pooling to memorize the maximum value of pixels in an image and the position of the maximum value;

the convolution kernel size of each convolution layer of the coding structure is 3×3, and the feature map after each convolution layer is named conv_i_j, where i=1, 2,3,4,5, j=1, 2 when i=1, 2, and j=1, 2,3 when i=3, 4,5; meanwhile, each convolution layer is followed by a connection Batch Normalisation and a ReLU activation function; adding a maximum pooling index into each pooling layer, realizing downsampling by using 2 multiplied by 2 non-overlapping maximum pooling, and memorizing the position of the maximum value of the pixel through the maximum pooling index, wherein the characteristic diagram obtained by each pooling layer is represented by pool_r, wherein r=1, 2,3,4 and 5;

the specific method for memorizing the maximum value of the pixels in the image and the positions thereof by adding the maximum pooling index during pooling is as follows:

for the input feature map X ε R ^h×w×c Wherein h and w are the height and width of the feature map respectively, c is the number of channels, and the feature map is obtained through 2X 2 non-overlapping maximization poolingWherein the value of pixel (i, j) is shown in the following formula:

the position corresponding to the maximum value of the pixel point is recorded as (m _i ,n _j ) The following formula is shown:

step 3: inputting a pooling feature map pool_5 obtained through the coding structure into a decoding structure added with more feature fusion, releasing the maximum value of pixels in situ by utilizing the maximum pooling index, filling the rest positions with 0, and realizing up-sampling by 2 times to obtain a sparse feature map upsampling5;

the decoding structure comprises three-layer convolution structures and two-layer convolution structures; each convolutional layer in the decoding structure is followed by a concatenation Batch Normalisation and a ReLU activation function;

the obtained sparse feature map upsampling5 is characterized in that the value of each pixel is shown in the following formula:

wherein Z is _u,v Pixel values of pixel points (u, v) in the sparse feature map upsampling5;

step 4: performing one-time feature fusion operation through a decoding structure, fusing a sparse feature map upsampling5 with convolution feature maps conv_5_1 and conv_5_2, and fusing a feature map obtained by fusion with a pooling feature map pool_4 with a corresponding size to obtain a fusion feature map F ₁ ；

The fusion process is to add pixel values at corresponding positions in the feature map;

will fuse the feature map F ₁ Inputting the data into a first three-layer convolution structure to perform convolution operation to obtain a dense feature map conv_decoding 5, and compensating information loss caused by pooling and downsampling;

step 5: performing four feature fusion operations through the decoding structure, and repeatedly performing up-sampling, feature fusion and convolution operations until the resolution of the feature map is restored to the original size;

step 5.1: performing secondary feature fusion through the decoding structure to recover image information;

step 5.1.1: 2 times up-sampling the conv_decoding 5 by using the maximum pooling index stored when generating the pooling feature map pool_4 to obtain a sparse feature map upsampling4;

step 5.1.2: the sparse feature map upsampling4 is compared with a convolution feature map conv_4_1, conv/u which is extracted from a coding structure and has the same resolution4_2, pooling the feature images pool_3 to obtain a fused feature image F ₂ ；

Step 5.1.3: will fuse the feature map F ₂ Inputting the dense feature map conv_decoding 4 into a second three-layer convolution structure to perform convolution operation;

step 5.2: performing third feature fusion through the decoding structure to recover image information;

step 5.2.1: performing 2-time up-sampling on the feature map conv_decoding 4 by using a maximum pooling index stored when generating pooling feature map pool_3 to obtain sparse feature map upsampling3;

step 5.2.2: performing feature fusion on the sparse feature map upsampling3 and the convolution feature maps conv3_1, conv3_2 and pooling feature map pool_2 which are extracted from the coding structure and have the same resolution to obtain a fusion feature map F ₃ ；

Step 5.2.3: will fuse the feature map F ₃ Inputting the dense feature map conv_decoding 3 into a third three-layer convolution structure to perform convolution operation;

step 5.3: performing fourth feature fusion through the decoding structure to recover the detail information of the image;

step 5.3.1: performing 2-time up-sampling on the feature map conv_decoding 3 by using a maximum pooling index stored when generating pooling feature map pool_2 to obtain sparse feature map upsampling2;

step 5.3.2: feature fusion is carried out on the sparse feature map upsampling2, the convolution feature map conv_2_1 and the pooling feature map pool_1 to obtain a fusion feature map F ₄ ；

Step 5.3.3: according to the symmetry of SegNet network, the feature map F is fused ₄ Inputting the two-layer characteristic images into a first two-layer convolution structure to carry out convolution operation to obtain a dense characteristic image conv_decoding 2;

step 5.4: performing fifth feature fusion through the decoding structure to recover the edge information of the image;

step 5.4.1: performing 2-time up-sampling on the feature map conv_decoding 2 by using a maximum pooling index stored when generating pooling feature map pool_1 to obtain sparse feature map upsampling1;

step (a)5.4.2: feature fusion is carried out on the sparse feature map upsampling1 and the convolution feature map conv_1_1 to obtain a fusion feature map F ₅ ；

Step 5.4.3: will fuse the feature map F ₅ Inputting the two-layer convolution structure to a second two-layer convolution structure to perform convolution operation to obtain a dense feature map conv_decoding 1;

step 6: inputting the dense feature map conv_decoding 1 into a Softmax layer to obtain the maximum probability of pixel classification in the image;

step 7: the cross entropy loss function is calculated through the maximum probability of pixel classification in the image, and the convolution kernel parameters of each convolution layer and pooling layer in the coding structure and the decoding structure are updated through a random gradient descent method, so that the image segmentation is realized.

The technical principle of the method of the invention is as follows: improving the decoding stage on the basis of the original SegNet network, and recovering the image position and boundary detail information while recovering the resolution of the feature map to obtain a dense feature map; since the features of the image are extracted by the convolution layer and the pooling layer in the coding structure, and the convolution layers and the pooling layers with different depths extract information with different scales, global low-level semantic information such as edges, directions, textures, chromaticity and the like is extracted in the shallow structure, and local high-level semantic information such as the shape of an object is extracted in the deep structure, the more abstract the features extracted by the network layer are, and in order to extract the more abstract high-level features, the model selects the maximum pooling rather than the average pooling in the coding structure.

Because the maximum value of the pixel extracted from the feature map and the position thereof are critical, edge detail information can be lost when pooling is carried out, and position information can be lost because of the reduction of the resolution of the feature map, so that a pooling index is added into an encoding structure to memorize the position of the maximum value of the pixel, a decoding structure releases the maximum value of the pixel at the original position through the pooling index, and the rest positions are filled with 0, thereby realizing up-sampling by 2 times, recovering important position information and reducing errors.

However, as the network level of the decoding structure deepens, the extracted features are more and more abstract, much edge detail information is lost, and each layer is lost with information of different scales, all the positions of the up-sampled feature images except the maximum value are 0 in the decoding structure, namely the obtained feature images are sparse, the lost information is not reproduced in the up-sampled feature images, so that feature fusion is added in the decoding structure to restore information, and the sparse feature images obtained after each up-sampling are overlapped with the feature images after convolution of the corresponding size of the encoding stage and after pooling. In this way, the method inputs each up-sampled characteristic diagram into the fusion structure, gradually recovers the lost information in the encoding stage, inputs the fusion result into the convolution layer to further enrich the information, and obtains denser characteristic diagrams, so that the segmentation effect is better, and the precision is higher.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the image segmentation method utilizing the feature fusion enhanced edge and detail information provided by the invention can recover the position and boundary detail information lost in the encoding stage while recovering the resolution of the feature image, enrich the information of the image, obtain a dense feature image, make up the sparse feature image brought by direct up-sampling, make the segmented boundary and detail clearer, improve the segmentation effect on fine detail objects, and improve the average segmentation precision and the mIOU.

Drawings

Fig. 1 is a flowchart of an image segmentation method using feature fusion to enhance edge and detail information according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In this embodiment, an image segmentation method using feature fusion to enhance edge and detail information, as shown in fig. 1, includes the following steps:

step 1.2: fixing the resolution of the input image to 360×480;

the convolution kernel size of each convolution layer of the coding structure is 3×3, so that the image size is kept unchanged, the feature map after each convolution layer is named conv_i_j, wherein i=1, 2,3,4,5, j=1, 2 when i=1, 2, and j=1, 2,3 when i=3, 4,5; meanwhile, each convolution layer is followed by a connection Batch Normalisation and a ReLU activation function; batch Normalisation is to accelerate the convergence speed of the model and alleviate the gradient dispersion problem in the deep network to a certain extent, so that the deep network model is easier and more stable to train; selecting a ReLU activation function can solve gradient disappearance and alleviate network overfitting; adding a maximum pooling index into each pooling layer, realizing downsampling by using 2 multiplied by 2 non-overlapping maximum pooling, and memorizing the position of the maximum value of the pixel through the maximum pooling index, wherein the characteristic diagram obtained by each pooling layer is represented by pool_r, wherein r=1, 2,3,4 and 5;

the coding structure uses the front 13 layers of VGG-16 to extract the features of pictures, and uses a convolution layer and a pooling layer to extract the features of images with different scales, wherein the front 4 layers of the structure can be regarded as a shallow structure, the obtained low-level semantic information can be regarded as a deep structure, the obtained high-level abstract information can be regarded as a deep structure, and the features with different scales can be obtained through the coding structure;

the value of each pixel in the obtained sparse feature map sampling5 is shown in the following formula:

wherein Z is _u,v And (5) uploading pixel values of the pixel points (u, v) in the sampling5 for the sparse feature map.

Step 4: because the feature map obtained by up-sampling is sparse, performing a feature fusion operation through the decoding structure; the convolution feature graphs extracted from the coding structure and having the same resolution as the sparse feature graph upsampling5 are conv_5_1, conv_5_2 and conv_5_3, and because pool_5 is obtained by direct pooling of conv_5_3, part of information is recovered in the process of 2 times up sampling, and meanwhile, in order to reduce training parameters of a model, only the sparse feature graph upsampling5 is fused with the convolution feature graphs conv_5_1 and conv_5_2, and the fused feature graphs are fusedFusing the pooling feature map pool_4 with the corresponding size to obtain a fused feature map F ₁ ；

to maintain the symmetry of the original SegNet network, feature map F is fused ₁ Inputting the information into a first three-layer convolution structure to perform convolution operation to obtain a dense feature map conv_decoding 5, further enriching the information of the picture, and compensating the information loss caused by pooling and downsampling;

step 4 is equivalent to the first feature fusion operation, and the method of the invention needs to perform five times of feature fusion in the decoding process, and is divided into three different fusion forms according to the difference of up-sampling depth, wherein the previous three fusion forms are the same, and the following four times of feature fusion are also needed.

Step 5: performing four feature fusion operations through the decoding structure, and repeatedly performing up-sampling, feature fusion and convolution operations until the resolution of the feature map is restored to the original size to obtain a dense feature map conv_decoding 1;

step 5.1.1: after the step 4, the resolution of the feature map conv_decoding 5 is the same as that of the pooling feature map pool_4, and the conv_decoding 5 is up-sampled by 2 times by utilizing the maximum pooling index stored when the pooling feature map pool_4 is generated, so as to obtain a sparse feature map upsampling4;

step 5.1.2: merging the sparse feature map upsampling4 with convolution feature maps conv_4_1 and conv_4_2 with the same resolution extracted from the coding structure, and pooling feature map pool_3 to obtain a merged feature map F ₂ ；

the first three feature fusion is the coding feature graphs corresponding to the three stages, and has the same fusion structure, the feature graphs participating in fusion have lower resolution and local abstract features, so that the same fusion form is used for recovering the local abstract features.

step 5.3.2: since the resolution of the feature map has been restored to the original map after step 5.3.1At this time, the corresponding feature graphs comprise conv_2_1, conv_2_2 and pool_1, so that in order to reduce the parameters of model training, only sparse feature graph upsampling2 is subjected to feature fusion with convolution feature graph conv_2_1 and pooling feature graph pool_1 to obtain a fusion feature graph F ₄ ；

different from the previous three feature fusion, the feature fusion corresponds to two stages of coding feature graphs for recovering detail information, so that the fusion forms are different;

step 5.4.2: since the resolution of the feature map is restored to the original size after the step 5.4.1, the feature maps with the same resolution obtained by the coding structure have convolution features conv_1_1 and conv_1_2, and in order to reduce the parameters of model training, only the sparse feature map upsampling1 and the convolution feature map conv_1_1 are subjected to feature fusion to obtain a fusion feature map F ₅ ；

this feature fusion only has one stage of coded feature map to participate in the fusion and is used for recovering edge information.

Step 6: the dense feature map conv_decode1 is input to the Softmax layer to get the maximum probability of pixel classification in the image.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. An image segmentation method for enhancing edge and detail information by utilizing feature fusion is characterized in that: the method comprises the following steps:

Will fuse the feature map F ₁ The method comprises the steps of inputting the data into a first three-layer convolution structure of a decoding structure to carry out convolution operation to obtain a dense feature map conv_decoding 5, and compensating information loss caused by pooling and downsampling;

step 5: performing four feature fusion operations through the decoding structure until the resolution of the feature map is restored to the original size to obtain a dense feature map conv_decoding 1;

Step 5.3.3: root of Chinese characterAccording to the symmetry of SegNet network, feature map F will be fused ₄ Inputting the two-layer characteristic images into a first two-layer convolution structure to carry out convolution operation to obtain a dense characteristic image conv_decoding 2;

step 5.4.2: feature fusion is carried out on the sparse feature map upsampling1 and the convolution feature map conv_1_1 to obtain a fusion feature map F ₅ ；

2. An image segmentation method using feature fusion to enhance edge and detail information as defined in claim 1, wherein: the specific method of the step 1 is as follows:

step 1.2: the resolution of the input image is fixed to 360×480.

3. An image segmentation method using feature fusion to enhance edge and detail information as defined in claim 1, wherein: the specific method for memorizing the maximum value of the pixels in the image and the positions thereof by adding the maximum pooling index during pooling in the step 2 is as follows:

for inputFeature map X ε R ^h×w×c Wherein h and w are the height and width of the feature map respectively, c is the number of channels, and the feature map is obtained through 2X 2 non-overlapping maximization poolingWherein the value of pixel (i, j) is shown in the following formula:

4. a method of image segmentation using feature fusion to enhance edge and detail information as claimed in claim 3, wherein: the value of each pixel in the sparse feature map upsamping 5 obtained in the step 3 is shown in the following formula:

5. An image segmentation method using feature fusion to enhance edge and detail information as defined in claim 1, wherein: and 4, performing addition operation on pixel values at corresponding positions in the feature map in the fusion process.