CN110059698B

CN110059698B - Semantic segmentation method and system based on edge dense reconstruction for street view understanding

Info

Publication number: CN110059698B
Application number: CN201910359119.0A
Authority: CN
Inventors: 陈羽中; 林洋洋; 柯逍; 黄腾达
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2022-12-23
Anticipated expiration: 2039-04-30
Also published as: CN110059698A

Abstract

The invention relates to a semantic segmentation method and a semantic segmentation system based on edge dense reconstruction for street view understanding, wherein the method comprises the following steps: preprocessing input images of the training set to standardize the images and obtain preprocessed images with the same size; extracting general features by using a convolutional network, then acquiring three-level context space pyramid fusion features, and extracting coding features by using the two parts which are cascaded as a coding network; acquiring half-input-size coding features by using the coding features, acquiring edge features based on a convolutional network, reconstructing image resolution by taking a dense network fused with the edge features as a decoding network by combining the half-input-size coding features, and acquiring the decoding features; calculating semantic segmentation loss and edge loss of auxiliary supervision, and training the deep neural network by taking minimized weighting and loss of the semantic segmentation loss and the edge loss of the auxiliary supervision as targets; and performing semantic segmentation on the image to be segmented by using the deep neural network model, and outputting a segmentation result. The method and the system are beneficial to improving the accuracy and the robustness of the image semantic segmentation.

Description

Semantic segmentation method and system based on edge dense reconstruction and used for street view understanding

Technical Field

The invention relates to the technical field of computer vision, in particular to a semantic segmentation method and a semantic segmentation system based on edge dense reconstruction for street view understanding.

Background

Image semantic segmentation is an important branch of computer vision in the field of artificial intelligence, and is an important ring for understanding images in machine vision. The image semantic segmentation is to accurately classify each pixel in the image into the category to which the pixel belongs, so that the pixel is consistent with the visual representation content of the image, and therefore, the image semantic segmentation task is also called as a pixel-level image classification task.

Because the image semantic segmentation and the image classification have certain similarity, various image classification networks are often used as backbone networks of the image semantic segmentation networks after the final full connection layer is removed, and can be replaced mutually. Sometimes, larger-sized features are obtained by removing the pooling layer in the backbone network or modifying with a punctured convolution and the like, and finally, semantic segmentation results are obtained by using a convolution layer with a convolution kernel of 1. Compared with image classification, the difficulty of image semantic segmentation is higher, because the classification needs to be combined with fine local information to determine the category of each pixel point, the backbone network is often used to extract more global features, and then the shallow features in the backbone network are combined to reconstruct the feature resolution to restore the original image size. Based on the feature size getting smaller and then larger, the former is often called an encoding network and the latter is called a decoding network. Meanwhile, in the encoding process, in order to better capture the characteristics of objects with different sizes, different receptive fields and scale information are often combined, for example, a pyramid pooling technology with a hole space is adopted, but the technology expands the interval of convolution kernels, ignores internal pixel points, and cannot combine more global context information to make up the deficiency of self expression capability. Meanwhile, in the existing semantic segmentation method, the resolution is often restored simply based on the previous-level features in the decoding process, and then the shallow features of the corresponding size are combined to make up for the information loss in the encoding process, so that the effective features in the resolution reconstruction process cannot be effectively reused, and the problem of fuzzy object boundaries after the image resolution reconstruction cannot be solved in a targeted manner.

Disclosure of Invention

The invention aims to provide a semantic segmentation method and a semantic segmentation system based on edge dense reconstruction for street view understanding, and the method and the system are favorable for improving the accuracy and the robustness of image semantic segmentation.

In order to achieve the purpose, the technical scheme of the invention is as follows: a semantic segmentation method based on edge dense reconstruction for street view understanding comprises the following steps:

step A: preprocessing an input image of a training set, firstly, subtracting an image mean value of the input image from the image to standardize the input image, and then randomly shearing the image with uniform size to obtain a preprocessed image with the same size;

and B: extracting general features F with convolutional networks _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion characteristic F _tspp Used for capturing multi-scale context information and then extracting coding features F by using the two parts which are cascaded as a coding network _encoder ；

And C: extended coding feature F _encoder Obtaining half input size coding feature F from size to half input image size _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

The dense network is a decoding network, image resolution reconstruction is carried out, and decoding characteristics F are calculated _decoder ；

Step D: using decoding features F _decoder And edge features

Respectively acquiring a semantic segmentation probability graph and an edge probability graph, calculating edge image labels by using semantic image labels in a training set, respectively calculating by using the semantic segmentation probability graph and the edge probability graph and respective corresponding labels to obtain semantic segmentation loss and edge loss for auxiliary supervision, and training the whole deep neural network by using minimum weighting and loss of the semantic segmentation probability graph and the edge probability graph as targets;

step E: and performing semantic segmentation on the image to be segmented by using the trained deep neural network model, and outputting a segmentation result.

Further, in the step B, a convolution network is used for extracting the general features F _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion characteristic F _tspp Then, the two parts are cascaded to be used as a coding network to extract the coding characteristics F _encoder The method comprises the following steps:

step B1: extraction of generic features F from preprocessed images using convolutional networks _backbone ；

And step B2: using 1 × 1 convolution to feature F _backbone Performing feature dimension reduction to obtain features

And step B3: to F is aligned with _backbone The whole image is subjected to average pooling, then the original size is restored by using nearest neighbor interpolation, and image-level features F are obtained by 1 × 1 convolution _image ；

And step B4: with a porosity of r _as By convolution kernel of F _backbone Performing a convolution with a hole to obtain a feature

Then concatenate the three level context features

F _image And

then using 1 × 1 convolution to perform feature fusion to obtain porosity r _as Three level context fusion feature of

In the convolution process, batch standardization is used for keeping the same distribution of input, and a linear rectification function is used as an activation function; the calculation formula of the convolution with the hole is as follows:

wherein, the first and the second end of the pipe are connected with each other,

is shown at output coordinate m _as Porosity of site used is r _as Is processed by the punctured convolution of (1) _as [m _as +r _as ·k _as ]Representing an input x _as At coordinate m _as At a position of porosity of r _as And the coordinates of the convolution kernel with holes are k _as Input reference pixel, w, corresponding to _as [k _as ]Representing the punctured convolution kernel as k _as A weight of the location;

and step B5: repeating the above steps using different porosities until n is obtained _tspp A feature, then n is _tspp A feature of

And F _image Splicing to obtain a three-level context space pyramid fusion feature F _tspp ；

Step B6: using 1 × 1 convolution to feature F _tspp Dimension reduction is carried out, then, the discriminant in deep learning is used for regularization, and the final coding feature F is obtained _encoder 。

Further, in the step C, the coding feature F is enlarged _encoder Obtaining half input size coding feature F from size to half input image size _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

The dense network is a decoding network, image resolution reconstruction is carried out, and decoding characteristics F are calculated _decoder The method comprises the following steps:

step C1: defining the ratio of the size of the initial input image to the size of the feature as the output step of the feature, and encoding the feature F using nearest neighbor interpolation _encoder Obtaining a characteristic diagram F with the output step of 2 _us ；

And step C2: selecting middle layer characteristics with output stride of os from convolution network for extracting general characteristics

Dimension reduction using 1 × 1 convolution and then expansion using bilinear interpolation

Multiplying edge features

And C3: splicing feature F _us And

after dimension reduction is carried out by using 1 × 1 convolution, the decoding feature F is obtained by using 3 × 3 convolution to extract features _decoder ；

And C4: selecting an output step os smaller than that in the step C2, finishing the extraction of decoding characteristics if all the output steps are processed, or splicing F _us And F _decoder As new F _us And repeating steps C2 to C3.

Further, in the step D, the decoding characteristic F is used _decoder And edge features

Respectively acquiring a semantic segmentation probability graph and an edge probability graph, calculating edge image labels by using semantic image labels in a training set, and respectively calculating to obtain semantic segmentation loss and edges for assisting supervision by using the semantic segmentation probability graph, the edge probability graph and respective corresponding labelsLoss, training the whole deep neural network with the aim of minimizing the weighted sum of the loss and the loss, comprising the following steps:

step D1: using bilinear interpolation to convert feature F _decoder And all the features

Scaled to the same size as the input image and derived the semantic segmentation probability and edge probability by a 1 × 1 convolution calculation using softmax as the activation function, the softmax calculation formula is as follows:

wherein σ _c Is the probability of class c, e is the natural index, γ _c And gamma _k Respectively representing the unactivated characteristic values of the categories C and k, wherein C is the total number of the categories;

step D2: carrying out one-hot coding on the semantic segmentation labels of the training set, and then calculating to obtain edge labels, wherein an edge label calculation formula is as follows:

wherein, y _edge (i, j, c) and

edge labeling and semantic labeling for coordinate (i, j) location class c, (i) _u ,j _u ) Representing an 8 neighborhood U in (i, j) coordinates ₈ Sgn () is a sign function;

and D3: respectively calculating the cross entropy of the pixel level by using probability graphs and corresponding labels of semantic segmentation and edges to obtain corresponding semantic segmentation loss L _s And edge loss with assistance to supervision

The weight sum loss L is then calculated:

wherein the content of the first and second substances,

as edge features

Corresponding loss value, α _os Is composed of

The weight occupied in the final loss;

and finally, updating the model parameters by utilizing back propagation iteration through a random gradient descent optimization method to train the whole deep neural network by minimizing weighting and loss L, so as to obtain a final deep neural network model.

The invention also provides a semantic segmentation system based on edge dense reconstruction for street view understanding, which comprises the following steps:

the preprocessing module is used for preprocessing the input images of the training set, and comprises the steps of subtracting the image mean value of the images from the images to standardize the images, and randomly shearing the images in a uniform size to obtain preprocessed images in the same size;

a coding feature extraction module for extracting general features F by using a convolution network _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion characteristic F _tspp Used for capturing multi-scale context information and then extracting coding features F by using the two parts which are cascaded as a coding network _encoder ；

A decoding feature extraction module for expanding the coding features F _encoder The size is half of the size of the input image, and a half-input-size coding feature F is obtained _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

The dense network is a decoding network, image resolution reconstruction is carried out, and decoding characteristics F are extracted _decoder ；

Neural network training module for using the decoding feature F _decoder And edge features

Respectively acquiring a semantic segmentation probability map and an edge probability map, calculating edge image labels by using semantic image labels in a training set, respectively calculating semantic segmentation loss and edge loss for auxiliary supervision by using the semantic segmentation probability map and the edge probability map and respective corresponding labels, and training the whole deep neural network by using minimum weighting and loss of the semantic segmentation probability map and the edge probability map as targets to obtain a deep neural network model; and

and the semantic segmentation module is used for performing semantic segmentation on the image to be segmented by utilizing the trained deep neural network model and outputting a segmentation result.

Compared with the prior art, the invention has the beneficial effects that: firstly, three-level context space pyramid fusion features are used in multi-scale feature capture after a backbone network in a coding network, and the characteristics of original different receptive fields are optimized by pertinently utilizing internal features and global features, so that the coding feature expression capability is enriched. And then combining the edge features derived from the intermediate layer features and supplemented with supervision in a decoding network, adjusting edge parts which are easy to generate deviation in the feature resolution reconstruction process in a targeted manner, optimizing semantic segmentation results among different objects, and reconstructing the feature resolution in a dense network manner to better reuse the reconstructed features. Compared with the prior art, the method can obtain stronger context information expression capability after coding, can more effectively correct the problem of boundary ambiguity between objects by combining edge supervision in the decoding process, and simultaneously utilizes the reuse performance of a dense network structure to more effectively utilize the characteristics so as to make the network easier to train, thereby finally obtaining more accurate semantic segmentation results.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.

Fig. 2 is a schematic system structure according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the embodiments.

The invention provides a semantic segmentation method based on edge dense reconstruction for street view understanding, which comprises the following steps as shown in figure 1:

step A: preprocessing the input images of the training set, firstly, subtracting the image mean value of the images from the images to standardize the images, and then randomly shearing the images in a uniform size to obtain preprocessed images in the same size.

And B, step B: extracting general features F using general convolutional networks _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion characteristic F _tspp For capturing multi-scale context information, and then extracting the coding features F by using the two parts cascaded in the step B as a coding network _encoder (ii) a The method specifically comprises the following steps:

step B1: general feature F is extracted from the preprocessed image by using a general convolution network (the embodiment adopts an xception network provided in a depeplabv 3+ network) _backbone ；

And step B3: to F _backbone The whole image is subjected to average pooling, then the original size is restored by using nearest neighbor interpolation, and image-level features F are obtained by 1 × 1 convolution _image ；

Then concatenate the three level context features

F _image And

then using 1 × 1 convolution to perform feature fusion to obtain porosity r _as Three-level context fusion feature of

is expressed in the output coordinate m _as Porosity of site of use r _as Processed result of the punctured convolution of (1), x _as [m _as +r _as ·k _as ]Representing an input x _as At the coordinate m _as Position at porosity of r _as And the coordinates of the convolution kernel with holes are k _as Input reference pixel, w, corresponding to _as [k _as ]Representing the punctured convolution kernel as k _as A weight of the location;

and step B5: repeating the above steps using different porosities until n is obtained _tspp The number of features (3 features in this example, the porosity is 6, 12, 18 respectively) is then determined _tspp A feature of

And step B6: using 1 × 1 convolution to feature F _tspp Dimension reduction is carried out, then, regularization is carried out by dropout in deep learning, and the final coding feature F is obtained _encoder 。

Step C: enlarging the coding feature F _encoder Obtaining half input size coding feature F from size to half input image size _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

The dense network is a decoding network, image resolution reconstruction is carried out, and decoding characteristics F are calculated _decoder (ii) a The method specifically comprises the following steps:

Multiplying edge features

And C3: splicing feature F _us And

after dimension reduction by using 1 × 1 convolution, using 3 × 3 convolution to extract features to obtain decoding features F _decoder ；

Step D: using decoding features F _decoder And edge features

Respectively acquiring a semantic segmentation probability map and an edge probability map, calculating edge image labels by using semantic image labels in a training set, respectively calculating semantic segmentation loss and edge loss for auxiliary supervision by using the semantic segmentation probability map and the edge probability map and respective corresponding labels, and training the whole deep neural network by using minimum weighting and loss of the semantic segmentation probability map and the edge probability map as targets; the method specifically comprises the following steps:

step D1: using bilinear interpolation to interpolate feature F _decoder And all the features

step D2: carrying out single-hot coding on the semantic segmentation labels of the training set, and then calculating to obtain edge labels, wherein the edge label calculation formula is as follows:

wherein, y _edge (i, j, c) and

edge labeling and semantic labeling for coordinate (i, j) location class c, (i) _u ,j _u ) Representing 8 neighborhoods U in (i, j) coordinates ₈ Sgn () is a sign function;

The weight sum loss L is then calculated:

as edge features

Corresponding loss value, α _os Is composed of

The weight occupied in the final loss, α _os Satisfy the requirement of

And each alpha _os Equal;

The invention also provides a semantic segmentation system for street view understanding, which is used for implementing the method, and as shown in fig. 2, the semantic segmentation system comprises:

the preprocessing module is used for preprocessing the input images of the training set, and comprises subtracting the image mean value of the images to standardize the images, and randomly shearing the images in uniform size to obtain preprocessed images in the same size;

A decoding feature extraction module for enlarging the encoding feature F _encoder The size is half of the size of the input image, and a half-input-size coding feature F is obtained _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

Neural network training module for using the decoded features F _decoder And edge features

and the semantic segmentation module is used for performing semantic segmentation on the image to be segmented by using the trained deep neural network model and outputting a segmentation result.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A semantic segmentation method based on edge dense reconstruction for street view understanding is characterized by comprising the following steps:

and B: extracting general features F with convolutional networks _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion feature F _tspp For capturing multi-scale context information and then extracting coding features F _encoder ；

And C: enlarging the coding feature F _encoder Obtaining half input size coding feature F from size to half input image size _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

Step D: using decoding features F _decoder And edge features

Respectively acquiring a semantic segmentation probability map and an edge probability map, calculating edge image labels by using semantic image labels in a training set, respectively calculating semantic segmentation loss and edge loss for auxiliary supervision by using the semantic segmentation probability map and the edge probability map and respective corresponding labels, and training the whole deep neural network by using minimum weighting and loss of the semantic segmentation probability map and the edge probability map as targets;

step E: performing semantic segmentation on an image to be segmented by using the trained deep neural network model, and outputting a segmentation result;

in the step B, extracting general characteristics F by using a convolution network _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion feature F _tspp Then extracting the coding feature F _encoder The method comprises the following steps:

And step B4: with a porosity of r _as Of the convolution kernelTo F _backbone Performing a convolution with a hole to obtain a feature

Then concatenate the three level context features

F _image And

then using 1 × 1 convolution to perform feature fusion to obtain the porosity of r _as Three level context fusion feature of

is shown at output coordinate m _as Porosity of site used is r _as Processed result of the punctured convolution of (1), x _as [m _as +r _as ·k _as ]Representing an input x _as At the coordinate m _as At a position of porosity of r _as And the coordinates of the convolution kernel with holes are k _as Input reference pixel, w, corresponding to time _as [k _as ]Representing the punctured convolution kernel as k _as A weight of the location;

Step B6: using 1 × 1 convolution to feature F _tspp Dimension reduction is carried out, then, the discriminant in deep learning is used for regularization, and the final coding feature F is obtained _encoder ；

In said step C, the coding feature F is enlarged _encoder Obtaining half input size coding feature F from size to half input image size _us Selecting intermediate layer features from the convolutional network

Computing edge features

Combining half-input size coding features F _us To fuse edge features

step C1: defining the ratio of the size of the initial input image to the size of the feature as the output stride of the feature, and encoding the feature F using nearest neighbor interpolation _encoder Obtaining a characteristic diagram F with the output step of 2 _us ；

And C2: selecting middle layer characteristics with output stride of os from convolution network for extracting general characteristics

Multiplying edge features

And C3: splicing feature F _us And

2. The method of claim 1, wherein in step D, a decoding feature F is used _decoder And edge features

Respectively acquiring a semantic segmentation probability graph and an edge probability graph, calculating edge image labels by using semantic image labels in a training set, respectively calculating semantic segmentation loss and edge loss for auxiliary supervision by using the semantic segmentation probability graph and the edge probability graph and respective corresponding labels, and training the whole deep neural network by using minimized weighting and loss of the semantic segmentation loss and the edge probability graph as targets, wherein the method comprises the following steps:

wherein, y _edge (i, j, c) and

The weight sum loss L is then calculated:

wherein alpha is _os Is composed of

The weight occupied in the final loss;

3. A semantic segmentation system based on edge dense reconstruction for street view understanding for implementing the method of claim 1, comprising:

a coding feature extraction module for extracting general features F by using a convolution network _backbone Based on the general feature F _backbone Obtaining three-level context space pyramid fusion characteristic F _tspp For capturing multi-scale context information and then extracting coding features F _encoder ；

Computing edge features

Combining half-input size coding features F _us To fuse edge features

Respectively obtaining a semantic segmentation probability map and an edge probability map, calculating edge image labels by using the semantic image labels in a training set, and utilizing the semantic segmentation probability map, the edge probability map and the respective semantic segmentation probability mapCorresponding labels are respectively calculated to obtain semantic segmentation loss and edge loss for auxiliary supervision, and the whole deep neural network is trained by taking minimum weighting and loss of the semantic segmentation loss and the edge loss for auxiliary supervision as targets to obtain a deep neural network model; and