CN112396607A

CN112396607A - Streetscape image semantic segmentation method for deformable convolution fusion enhancement

Info

Publication number: CN112396607A
Application number: CN202011291950.6A
Authority: CN
Inventors: 张珣; 秦晓海; 刘宪圣; 张浩轩; 江东; 张迎春; 付晶莹
Original assignee: Institute of Geographic Sciences and Natural Resources of CAS; Beijing Technology and Business University
Current assignee: Institute of Geographic Sciences and Natural Resources of CAS; Beijing Technology and Business University
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-02-23
Anticipated expiration: 2040-11-18
Also published as: CN112396607B

Abstract

The invention discloses a streetscape image semantic segmentation method with deformable convolution fusion enhancement, which comprises a training stage and a testing stage, wherein a streetscape image semantic segmentation deep neural network model is constructed, so that the network model obtains more small target characteristic information while a streetscape image large target object is segmented, the problems of small-scale target loss and discontinuous segmentation during streetscape image semantic segmentation are solved, the image segmentation effect is improved, the integral robustness of the model is better, and the streetscape image processing precision is higher.

Description

Streetscape image semantic segmentation method for deformable convolution fusion enhancement

Technical Field

The invention belongs to the technical field of computer vision, relates to an image processing technology, and particularly relates to a streetscape image semantic segmentation method based on deformable convolution fusion enhancement.

Background

Image semantic segmentation is an important branch of computer vision in the field of artificial intelligence, and is an important ring for understanding and analyzing images in machine vision. The image semantic segmentation is to accurately classify each pixel in the image into the category to which the pixel belongs, so that the pixel is consistent with the visual representation content of the image, and therefore, the image semantic segmentation task is also called as a pixel-level image classification task. At present, semantic segmentation is widely applied to scenes such as automatic driving and unmanned aerial vehicle drop point judgment.

Convolutional neural networks have been successful in image classification, localization, and scene understanding. With the proliferation of tasks such as augmented reality and autonomous driving of vehicles, many researchers have turned their attention to scene understanding, where one of the main steps is semantic segmentation, i.e., classification of each pixel in a given image. Semantic segmentation has important significance in mobile and robot related applications

Different from image classification, the difficulty of image semantic segmentation is higher, because the classification of each pixel point needs to be determined by combining detailed local information with global context information, a backbone network is often used for extracting global features, and then the feature resolution is reconstructed by combining shallow features in the backbone network to restore the original image size. The resolution size of the feature map is changed by first decreasing and then increasing, and the former is generally called an encoding network, and the latter is called a decoding network. The classical semantic segmentation methods include Full Connected Network (FCN) and deep Lab series Network, and the methods have good expression of pixel precision, average pixel precision and average cross-over ratio on a road scene segmentation database. The conventional network is subsampled, wherein the upsampling is to recover the small-sized high-dimensional feature map so as to make pixel prediction and obtain the classification information of each point. Although the FCN does upsampling, the lost information cannot be recovered completely without loss; on the basis, the DeepLab series network adds a hole convolution algorithm to expand the receptive field, improves the information loss, but does not well control the problem of information loss. Therefore, these methods affect the accuracy of semantic segmentation of images due to information loss, and the segmentation effect is worse especially on the identification of small target objects.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a streetscape image semantic segmentation method for deformable convolution fusion enhancement, which is used for constructing a streetscape image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information while a large target of the streetscape image is segmented, thereby solving the problems of small-scale target loss and discontinuous segmentation during streetscape image semantic segmentation, ensuring better overall robustness of the model, ensuring higher streetscape image processing precision and improving the image segmentation effect.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a deformable convolution fusion enhanced streetscape image semantic segmentation method is characterized by comprising a training stage and a testing stage, and comprises the following steps:

1) constructing an image training set: the method comprises an original image and a corresponding semantic label image, and the group of image data is input into a network constructed by the user, so that the user can participate in training.

1_1) selecting N original street view image data and corresponding semantic segmentation label gray level images, forming a training set, and selecting the nth original street view image data in the training setThe street view image is marked as { Jⁿ(i, J) }, matching the training set with { J }ⁿ(i, j) } corresponding semantic segmentation label images are recorded as

The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of a pixel point in the image; i is more than or equal to 1 and less than or equal to W, J is more than or equal to 1 and less than or equal to H, and W represents { Jⁿ(i, J) }, H denotes { JⁿHeight of (i, J) }, Jⁿ(i, J) represents { JⁿThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j); meanwhile, in order to evaluate the designed model well, the real segmentation label images corresponding to the original street view images and the corresponding semantic label images in the training set need to be used as training targets and recorded as training targets

Then, semantic segmentation label gray level images corresponding to each original street view image in the training set are subjected to single hot coding technology

Processing into a one-hot coded image; in specific implementation, 19 classes are divided into street view image object classes, and the real semantic segmentation label image corresponding to the original street view image is divided into

The set of constructs is denoted as

2) Constructing and training a streetscape image semantic segmentation deep neural network model: the streetscape image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises two parts, namely a multi-resolution fusion enhancement network introducing deformable convolution, wherein the first stage is a first part, and the second stage and the third stage are a second part. In the first stage, deformable convolution is connected in series with an Xception Module to form a fusion sub-network A, and the sub-network A is connected in series and repeated three times, so that more deep semantic feature information can be obtained. Wherein, the Xceptation Module is a basic residual Module of a semantic segmentation network Deeplab V3 +; the second stage is a double-branch parallel network, each branch is a subnet, and each branch is composed of three Xceptance modules which are connected in series and used for feature extraction and feature fusion; the third stage is a three-branch parallel network, each branch is a subnet, and is composed of three Xceptance modules connected in series, and the function of the three Xceptance modules is the same as that of the second stage. The first stage is specifically composed of three repeated deformable convolution modules connected in series with an Xception Module, wherein the sizes of convolution kernels are all 3x3, and the Xception Module in the invention is composed of 3 convolution layers with 3x3 depth separable convolutions, step length of 1 and filling of 1. The method comprises the following steps:

2_1) the input layer of the street view image semantic segmentation deep convolutional neural network model is used for receiving R, G, B three-channel components of an original input image and outputting the components to a hidden layer;

for the input layer, the input end of the input layer receives R, G, B three-channel components of an original input image with width W and height H, and the output end of the input layer outputs R, G, B three-channel components of the original input image to the hidden layer;

2_2) the first stage of the hidden layer comprises a fusion enhancement Module constructed by connecting a deformable convolution Module with an Xception Module in series, three fusion enhancement modules are repeatedly stacked, and a plurality of feature maps are generated in sequence through the three fusion enhancement modules;

the hidden layer first stage is composed of three fusion enhancement modules, and each fusion enhancement Module is mainly composed of a deformable convolution Module and a lightweight Xception Module connected in series. The position information of the spatial sampling is further adjusted in the module by adopting a deformable convolution unit, and the displacement can be obtained by learning in a target task without an additional supervision signal. The offset added in the deformable convolution unit is a part of a deep convolution neural network model structure segmented by street view image semantics, in an input feature map, an original part obtains a part of feature regions through a sliding window, and after the deformable convolution is introduced, the original convolution network is divided into two paths to share the feature map. One of the two paths uses a parallel standard convolution to learn offset, and meanwhile, the gradient back propagation can also normally learn end to end, thereby ensuring the successful integration of the deformable convolution. After learning of the offset is added, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which needs to be identified currently, and the visual effect is that the positions of sampling points of the convolution kernels at different positions can be changed in a self-adaptive mode according to the image content, so that the method is suitable for geometric deformation such as the shape, the size and the like of different objects in the image content. The concrete expression is as follows: the displacement required by the deformable convolution is obtained through the output of a parallel standard convolution, and then the displacement is acted on a convolution kernel to achieve the effect of the deformable convolution. When the picture pixels are integrated, the pixels need to be subjected to offset operation, the generation of the offset can generate a floating point type, the offset needs to be converted into an integer type, the inverse propagation cannot be carried out if the offset is directly rounded, and at the moment, a bilinear difference value mode is adopted to obtain the corresponding pixels.

The convolution includes two steps: 1) sampling on an input feature map x using a regular grid R; 2) the sum of the sampled values weighted with w. Grid R defines the size and expansion of the receptive field. As in equation (1):

R＝{(-1,-1),(-1,0),...,(0,1),(1,1)} (1)

defining a 3x3, dilation-1 convolution kernel;

the definition of convolution is: for each pixel position p in the output₀The general convolution calculation is as in equation (2):

wherein, y (p)₀) Is p₀A feature map corresponding to the position; p is a radical of₀For each position on the output feature map y; x is the input feature map, and R represents a grid of receptive fields, here 3 × 3 for example. For each position p on the output profile y₀，p_kIs an enumeration of positions in R. In the deformable convolution, the regular grid R is increased by an offset { Δ p_k1., K }, where K ═ R |. Equation (2) becomes (3):

wherein p is₁For each position on the output feature map y after merging into the deformable convolution, y (p)₁) Is p₁The position corresponds to the deformed characteristic diagram; Δ p_kIs an offset 2K in both x and y directions;

and recording the original image pixel value as V, dividing the original convolution process into two paths, learning the offset in the x and y directions by the previous path, and outputting the offset as H x W2K, wherein K | R | represents the number of pixels in the grid, 2K means the offset in the x and y directions, and the image pixel value is recorded as U at the moment. After the offset exists, the pixel value index of the image in U is added to V, and for each convolution window, the window is not a regular sliding window, but a window after translation, and the calculation process is consistent with the convolution. The position of the sample now becomes an irregular position due to the offset Δ p_kUsually a decimal number, we therefore calculate the corresponding integer type by bilinear interpolation, as shown in equation (4):

where p represents an arbitrary position on the feature map (for equation (3), p ═ p₁+p_k+Δp_k) Q is the total number of integral spatial locations in the enumerated feature map x, and G (,) is the bilinear interpolation kernel.

After all the pixel positions are obtained, a new picture M is obtained, and the new picture M is input to the Xception Module as input data.

The Xceptation Module is used as a basic residual structure of Deeplab V3+, a residual learning unit of the Xceptation Module extracts features through 3x3 depth separable convolutions, convolution calculation is respectively carried out on the features channel by channel and point by point, compared with the conventional convolution operation, the parameter number and the operation cost are lower, and the Xceptation Module is also a main reason for introducing the Xceptation Module. When the input feature and the output feature have different dimensions (the number of channels of the feature map), the dimensions of the input feature need to be adjusted through 1 × 1 convolution, and then the input feature and the output feature of the residual learning unit are added and fused, so that the final feature map is obtained. And when the dimensions of the input features and the output features are the same, adding (fusing) the input features and the output feature graph of the residual error learning unit to obtain the finally extracted features. The Xceptation Module combines the idea of depth separable convolution and the basic residual Module Bottleneck residual structure (including 1 convolution of 1X1, one convolution of 3X3 and one convolution of 1X1, wherein the convolution of 1X1 is used for adjusting the feature dimension, and the convolution of 3X3 is used for extracting the feature), and the standard convolution is split into channel convolution and space convolution by utilizing the depth separable convolution to reduce the parameters of model training; and eliminating the problem of gradient explosion caused by the deepening of the network hierarchy by using a residual structure.

For the first stage, the input end of the fusion enhancement Module of the 1 st deformable convolution series Xception Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature map, and the set formed by the output feature maps is recorded as R₁(ii) a The input end of the fusion enhancement Module of the 2 nd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module series Xception Module₁The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is recorded as R₂(ii) a The input end of the fusion enhancement Module of the 3 rd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module series Xception Module₂Output of the output terminalThe generated feature map is represented by R₃(ii) a Wherein R is₃Has a width W, a height H, and a number of channels C, so that R is₃Can be recorded as (H, W, C). The first stage comprises a feature extraction branch, and after repeating 3 times of the deformable convolution and connecting the Xceptance Module fusion enhancement Module in series, downsampling operation with the step length of 1 and 2 is respectively carried out to obtain two new feature diagram sets which are respectively marked as R₄And R₅(ii) a Wherein R is₄Each feature map in (H, W, C), R₅Each characteristic map in (H/2, W/2, 2C).

2_3) the second and third stages constitute a second part of the hidden layer; high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the network has good semantic expression capability while the high-resolution spatial resolution is kept. The method comprises the following specific steps:

after the first stage, the second stage generates two parallel networks S₁And S₂，S₁Is composed of 3 lightweight Xception modules connected in series. The input feature layer and the output feature layer of each Xception Module have the same width and height, S₁Input terminal receiving R₄All characteristic maps of₁The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₆Wherein R is₆Each feature map in (a) is (H, W, C); s₂Is composed of 3 lightweight Xscene modules in series, the input feature layer and the output feature layer of each Xscene Module have the same width and height, S₂Input terminal receiving R₅All characteristic maps of₂The output end outputs the generated feature map, and the feature map set is recorded as R₇Wherein R is₇Each characteristic map in (H/2, W/2, 2C); two parallel networks S passing through the second stage₁And S₂Respectively carrying out down-sampling operations with the step length of 1 and 2 to obtain five new feature diagram sets which are respectively marked as R₈、R₉、R₁₀R₁₁And R₁₂. Wherein R is₈Each feature map in (H, W, C), R₉Each characteristic diagram in (H/2, W/2,2C), R₁₀Each characteristic diagram in (H/2, W/2,2C), R₁₁Each characteristic diagram in (H/4, W/4,4C), R₁₂Each feature map in (H/4, W/4, 4C).

After the second stage, the third stage generates three parallel networks S₃、S₄And S₅Wherein S is₃Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xception Module are consistent when R is equal₇Partial up-sampling is carried out to obtain a new feature diagram formed set which is marked as R₁₃Wherein R is₁₃Each feature map in (a, b) is (H, W, C). The simultaneous information fusion layer will R₈And R₁₃Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R₁₄Wherein R is₁₄Each feature map in (a, b) is (H, W, C). S₃Input terminal receiving R₁₄All characteristic maps of₃The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₁₅Wherein R is₁₅Each feature map in (a) is (H, W, C); s₄Is composed of 3 lightweight Xscene modules connected in series, the input feature layer and output feature layer of each Xscene Module have the same width and height, and the information fusion layer combines R with each other₉And R₁₀Performing feature information fusion, and recording a set formed by the feature graphs after fusion generation as R₁₆Wherein R is₁₆Each characteristic map in (H/2, W/2, 2C). S₄Input terminal receiving R₁₆All characteristic maps of₄The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₁₇Wherein R is₁₇Each characteristic map in (H/2, W/2, 2C); s₅Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xconvergence Module are consistent, and the information convergence layer combines R₁₁And R₁₂Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R₁₈Wherein R is₁₈Each feature map in (H/4, W/4, 4C). S₅Input terminal receiving R₁₈All characteristic maps of₅The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₁₉Wherein R is₁₉Each characteristic map in (H/4, W/4, 4C); at the end of the third phase, the subnet S₄And sub-network S₅Generated feature map R₁₇And R₁₉Performing an upsampling operation to generate and pass S₃R generated by subnet₁₅The feature maps with the same size and scale are respectively marked as R₂₀And R₂₁Then R is added₁₅、R₂₀And R₂₁Inputting the feature information into the feature fusion layer to perform feature information fusion, and generating a new feature graph set R₂₂Wherein R is₂₂Each feature map in (a, b) is (H, W, C).

For the output layer, which is composed of 1 convolutional layer, the input end of the output layer receives the characteristic diagram set R₂₂The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; wherein, the width of each semantic segmentation prediction graph is W, the height of each semantic segmentation prediction graph is H, and the channel of each semantic segmentation prediction graph is C.

2_4) original streetscape image in training set { Jⁿ(i, J) } and corresponding semantic label images (semantic segmentation label gray level images) as original input images, inputting the original input images into the constructed street view image semantic segmentation deep neural network model for training to obtain a semantic segmentation prediction map corresponding to each original street view image in a training set, and enabling each original street view image { J to be a prediction map corresponding to each original street view image in the training setⁿ(i, j) } the set of semantic segmentation prediction maps is denoted as

2_5) calculating a set consisting of semantic segmentation prediction graphs corresponding to each original street view image in the training set

One-hot coded image set processed with corresponding true semantic segmentation image

The value of the loss function in between will

And

the value of the loss function in between is recorded as

In specific implementation, the classified cross entropy is adopted to obtain

And

value of loss function in between

2_6) repeatedly executing the step 2_4) and the step 2_5) for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then finding out the loss function value with the minimum value from the M multiplied by N loss function values; then, the weight vector and the bias item corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias item corresponding to the deep neural network classification training model and are correspondingly marked as W^bestAnd b^best(ii) a And finishing the training of the streetscape image semantic segmentation deep neural network classification model.

3) Testing the model: and inputting the test set into the trained model for testing.

3_1) order

Representing a road scene image to be semantically segmented, namely a test set; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

The pixel value of the pixel point with the middle coordinate position (i ', j');

3_2) will

Inputting the R channel component, the G channel component and the B channel component into a trained streetscape image semantic segmentation deep neural network classification model, and utilizing W^bestAnd b^bestMaking a prediction to obtain

Corresponding predictive semantic segmentation image, denoted

Wherein the content of the first and second substances,

to represent

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

Through the steps, the image semantic segmentation enhanced by the deformable convolution fusion is realized.

Compared with the prior art, the invention has the beneficial effects that:

1) lightweight Xception Module was introduced to replace the bottleeck Module in the conventional network model. The Xtitle Module references the design idea of the Bottleneck, continuously deepens the network model through the residual learning unit, extracts rich semantic features, and replaces the standard convolution in the Bottleneck with the deep separable convolution, so that the parameters of the model can be reduced under the condition of ensuring the precision, and the operation cost is reduced. Meanwhile, the multi-scale fusion of the network has a better effect, and after the characteristic extraction and fusion of the modules, the interaction of high and low resolution has better result output.

2) The deep neural network constructed by the method adopts a high-resolution fusion parallel network to reduce the loss characteristic information of the characteristic image in the whole network, and effective depth information is reserved to the greatest extent by keeping the high resolution unchanged and fusing the low-resolution characteristic image information in the whole process.

3) According to the deep neural network constructed by the method, the deformable convolution is blended in the first stage of the hidden layer, so that the network model has better deformation modeling capability while maintaining high-resolution characteristics in the characteristic extraction process, the problems of small-scale target loss and discontinuous segmentation in semantic segmentation are solved, and the overall robustness of the model is better.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

FIG. 2 is a block diagram of the structure of a street view image semantic segmentation neural network model constructed by the method of the present invention.

FIG. 3 is a schematic diagram of a frame of a street view image semantic segmentation neural network model according to the method of the present invention.

FIG. 4 is a street view image to be semantically segmented, a corresponding real semantic segmentation image, and a predicted semantic segmentation image obtained by prediction according to an embodiment of the present invention;

wherein, (a) is a selected street view image to be semantically segmented; (b) segmenting an image for real semantics corresponding to the street view image to be semantically segmented shown in (a); (c) the method is used for predicting the street view image to be semantically segmented shown in (a) to obtain a predicted semantically segmented image.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a streetscape image semantic segmentation method with deformable convolution fusion enhancement, which is used for constructing a streetscape image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information while a large target object of the streetscape image is segmented, the problems of small-scale target loss and discontinuous segmentation during streetscape image semantic segmentation are solved, the overall robustness of the model is better, the streetscape image processing precision is higher, and the image segmentation effect is improved.

The general implementation block diagram of the streetscape image semantic segmentation method based on deformable convolution fusion enhancement provided by the invention is shown in fig. 1, and the method comprises a training stage and a testing stage.

FIG. 2 is a block diagram of the structure of a street view image semantic segmentation neural network model constructed by the method of the present invention. FIG. 3 is a schematic diagram of a frame of a street view image semantic segmentation neural network model according to the method of the present invention. The method for realizing the street view image semantic segmentation enhanced by the deformable convolution fusion mainly comprises the following steps:

1) firstly, inputting an original image into a first deformable convolution layer in a first stage of a network, and performing feature extraction (high-resolution feature map);

2) inputting the output initial characteristics into a first Xcecption Module to obtain a deeper characteristic diagram;

3) repeating 1) and 2) for 3 times, namely, immediately following an Xception Module behind the deformable convolution Module, and extracting deep-level features for multiple times while increasing the receptive field;

4) respectively carrying out down-sampling operations with the step length of 1 and 2, wherein one part maintains high resolution, and the other part is parallel to a lower resolution graph and is input into 3 Xceptance modules with different branch repetition at the second stage;

5) after feature information fusion is carried out through a feature fusion layer, the feature information is respectively input into 3 Xreception modules repeated by different branches in the third stage again, down-sampling operation with the step length of 1 and 2 is carried out, then up-sampling and feature fusion are carried out, and a high-resolution feature graph is output;

6) and finally, after one convolution, adjusting the number of channels of the output characteristics to be the number of categories to be segmented, and activating through a classifier function to obtain a predicted segmented image.

In specific implementation, the streetscape image semantic segmentation neural network model training stage process of the method comprises the following specific steps:

1, constructing an image training set: selecting N original streetscape image data and corresponding semantic segmentation label gray level images, forming a training set, and recording the nth original streetscape image in the training set as { J }ⁿ(i, J) }, matching the training set with { J }ⁿ(i, j) } corresponding semantic segmentation label images are recorded as

The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500, such as 1000; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of a pixel point in the image; i is more than or equal to 1 and less than or equal to W, J is more than or equal to 1 and less than or equal to H, and W represents { Jⁿ(i, J) }, H denotes { Jⁿ(i, J) } e.g. take W1024, H512, Jⁿ(i, J) represents { JⁿThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j); meanwhile, in order to evaluate the designed model well, the real segmentation label images corresponding to the original street view images and the corresponding semantic segmentation label gray-scale images in the training set need to be used as training targets and recorded as training targets

Here, the original street view image is directly selected from the city landscape data set, i.e. 2975 images of training data set in the city scenes public data set.

2, constructing a deep neural network: the deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer consists of two parts, and the high-resolution input of the network and the double-stage multi-branch parallel fused Xception Module network are formed by connecting three repeated deformable convolution modules in series with an Xception Module.

2_1 for the input layer, the input end of the input layer receives R, G, B three-channel components of an original input image, and the output end of the input layer outputs the R-channel component, the G-channel component and the B-channel component of the original input image to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W, and the height is required to be H;

2_2, constructing a fusion Module by connecting the deformable convolution Module with the Xceptance Module in series in the first part of the hidden layer, repeatedly stacking the fusion Module to three parts, and generating a plurality of feature maps in sequence by the three fusion enhancement modules;

the hidden layer first stage is composed of three fusion enhancement modules, and each fusion enhancement Module is mainly composed of a deformable convolution Module and a lightweight Xception Module connected in series. After all pixel positions are obtained through the first deformable convolution, a new picture M is obtained, and the new picture M is used as input data and is input into the Xception Module.

The first stage is a first part, for the first stage, the input end of the fusion enhancement Module of the 1 st deformable convolution series Xceptance Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature map, and the set formed by the output feature maps is recorded as R₁(ii) a The input end of the fusion enhancement Module of the 2 nd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module series Xception Module₁The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is recorded as R₂(ii) a The input end of the fusion enhancement Module of the 3 rd deformable convolution series Xception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module series Xception Module₂The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is recorded as R₃(ii) a Wherein R is₃Has a width W, a height H, and a number of channels C, so that R is₃Can be recorded as (H, W, C). The first stage comprises a feature extraction branch, which is repeated 3 times for the deformable volumeAfter the product is connected with the Xception Module fusion enhancement Module in series, the downsampling operation with the step length of 1 and 2 is respectively carried out to obtain a set formed by two new feature graphs which are respectively recorded as R₄And R₅(ii) a Wherein R is₄Each feature map in (H, W, C), R₅Each characteristic map in (H/2, W/2, 2C).

2_3 the second and third stages form the second part of the hidden layer; high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the network has good semantic expression capability while the high-resolution spatial resolution is kept. The method comprises the following specific steps:

after the first stage, the second stage generates two parallel networks S₁And S₂，S₁The device is composed of 3 lightweight Xception modules connected in series, wherein the Xception modules in the device are composed of 3 convolution layers with 3 multiplied by 3 depth separable convolutions, step length of 1 and filling of 1. The input feature layer and the output feature layer of each Xception Module have the same width and height, S₁Input terminal receiving R₄All characteristic maps of₁The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₆Wherein R is₆Each feature map in (a) is (H, W, C); s₂Is composed of 3 lightweight Xscene modules in series, the input feature layer and the output feature layer of each Xscene Module have the same width and height, S₂Input terminal receiving R₅All characteristic maps of₂The output end outputs the generated feature map, and the feature map set is recorded as R₇Wherein R is₇Each characteristic map in (H/2, W/2, 2C); two parallel networks S passing through the second stage₁And S₂Respectively carrying out down-sampling operations with the step length of 1 and 2 to obtain five new feature diagram sets which are respectively marked as R₈、R₉、R₁₀R₁₁And R₁₂Wherein R is₈Each feature map in (H, W, C), R₉Each characteristic diagram in (H/2, W/2,2C), R₁₀Each characteristic diagram in (H/2, W/2,2C), R₁₁Each characteristic diagram in (H/4, W/4,4C), R₁₂Each feature map in (H/4, W/4, 4C).

After the second stage, the third stage generates three parallel networks S₃、S₄And S₅Wherein S is₃Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xception Module are consistent when R is equal₇Partial up-sampling is carried out to obtain a new feature diagram formed set which is marked as R₁₃Wherein R is₁₃Each feature map in (a, b) is (H, W, C). The simultaneous information fusion layer will R₈And R₁₃Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R₁₄Wherein R is₁₄Each feature map in (a, b) is (H, W, C). S₃Input terminal receiving R₁₄All characteristic maps of₃The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₁₅Wherein R is₁₅Each feature map in (a) is (H, W, C); s₄Is composed of 3 lightweight Xscene modules connected in series, the input feature layer and output feature layer of each Xscene Module have the same width and height, and the information fusion layer combines R with each other₉And R₁₀Carrying out feature information fusion, and marking the feature graph generated by fusion as R₁₆Wherein R is₁₆Each characteristic map in (H/2, W/2, 2C). S₄Input terminal receiving R₁₆All characteristic maps of₄The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₁₇Wherein R is₁₇Each characteristic map in (H/2, W/2, 2C); s₅Is composed of 3 lightweight Xception modules connected in series. The width and height of the input feature layer and the output feature layer of each Xconvergence Module are consistent, and the information convergence layer combines R₁₁And R₁₂Performing feature information layer fusion, and recording the set formed by the feature graphs generated after fusion as R₁₈Wherein R is₁₈Each feature map in (H/4, W/4, 4C). S₅Input terminal receiving R₁₈All characteristic maps of₅The output end of (2) outputs the generated feature map, and the set of feature map is denoted as R₁₉Wherein R is₁₉Each characteristic diagram in (1) is(H/4, W/4, 4C); at the end of the third phase we need to connect subnet S₄And sub-network S₅Generated feature map R₁₇And R₁₉Performing an upsampling operation to generate and pass S₃R generated by subnet₁₅The feature maps with the same size and scale are respectively marked as R₂₀And R₂₁Then R is added₁₅、R₂₀And R₂₁Inputting the feature information into the feature fusion layer to perform feature information fusion, and generating a new feature graph set R₂₂Wherein R is₂₂Each feature map in (a, b) is (H, W, C).

For the output layer, which is composed of 1 convolutional layer, the input end of the output layer receives the feature map set R₂₂The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; wherein, the width of each semantic segmentation prediction graph is W, and the height of each semantic segmentation prediction graph is H.

2_4, inputting the original street view images and the corresponding semantic segmentation label gray level images in the training set as original input images into a deep neural network for training to obtain a semantic segmentation prediction map corresponding to each original street view image in the training set, and enabling each original street view image to be { J }ⁿ(i, j) } the set of semantic segmentation prediction maps is denoted as

2_5 calculating a set consisting of semantic segmentation prediction graphs corresponding to each original street view image in the training set

The value of the loss function in between will

And

the value of the loss function in between is recorded as

Obtained using categorical cross entropy (categorical cross entropy).

2_6, repeatedly executing the step 2_4 and the step 2_5 for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then finding out the loss function value with the minimum value from the M multiplied by N loss function values; then, the weight vector and the bias item corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias item corresponding to the deep neural network classification training model and are correspondingly marked as W^bestAnd b^best. In this example, M is 484.

3, testing the model: and inputting the test set into the trained model for testing. The test stage process comprises the following specific steps:

3_1 order

Representing a road scene image to be semantically segmented; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

3_2 will

The R channel component, the G channel component and the B channel component are input into a trained deep neural network classification model, and W is utilized^bestAnd b^bestMaking a prediction to obtain

Corresponding predictive semantic segmentation image, denoted

Wherein the content of the first and second substances,

to represent

The feasibility and effectiveness of the method of the invention is further verified below.

A deep neural network architecture is built by using a python-based deep learning library pytorch 1.2. The city scenes test set is adopted to analyze the segmentation effect of the street view image predicted by the method. Here, the segmentation performance of the predicted semantic segmentation image is evaluated by using 3 common objective parameters for evaluating the semantic segmentation method as evaluation indexes, namely, Mean Intersection over unity (MIoU), Pixel Accuracy (PA), and Mean Pixel Accuracy (MPA), and the definitions thereof are given.

Definition 1: mlou (Mean Intersection over Union) is a standard metric for semantic segmentation. Which calculates the ratio of the intersection and union of the two sets. The formula is as follows:

definition 2: the Pixel Accuracy (Pixel Accuracy) represents the proportion of pixels marked correctly to the total pixels, as shown in the following equation:

definition 3: mean Pixel Accuracy (Mean Pixel Accuracy) is an improvement of PA, calculates the proportion of correctly classified pixels in each class, and then averages the PA of all classes, as follows:

the method is utilized to predict each street view image in the Cityscapes test set to obtain a predicted semantic segmentation image corresponding to each street view image, and the higher the average cross union reflecting the semantic segmentation effect of the method is than the value of MIoU, the higher the pixel accuracy PA and the higher the average pixel accuracy MPA, the higher the effectiveness and the higher the prediction accuracy are, wherein the average cross union is shown in Table 1.

TABLE 1 mIoU values of the method of the invention on the Cityscapes dataset

As can be seen from the data listed in table 1, the street view image obtained by the method of the present invention has a good segmentation effect, which indicates that it is feasible and effective to obtain the prediction semantic segmentation image corresponding to the street view image by using the method of the present invention. The average cross-parallel ratio MIoU, the pixel accuracy PA and the average pixel accuracy MPA of the method are shown in the table 2, and the result shows that the segmentation effect of the method is in the front of the existing segmentation model.

TABLE 2 computational Performance on the Cityscapes dataset

In fig. 4, (a) shows a selected street view image to be semantically segmented; (b) giving a real semantic segmentation image corresponding to the street view image to be semantically segmented shown in (a); (c) the invention provides a prediction semantic segmentation image obtained by predicting the street view image to be semantically segmented shown in (a) by using the method. Comparing (b) and (c) in fig. 4, it can be seen that the segmentation precision of the predicted semantic segmentation image obtained by the method of the present invention is higher and is close to the real semantic segmentation image.

It is noted that the disclosed examples are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A deformable convolution fusion enhanced streetscape image semantic segmentation method comprises a training stage and a testing stage, and specifically comprises the following steps:

1) constructing an image training set, wherein the image training set comprises an original street view image and a corresponding semantic label image;

selecting N original street view image data and a corresponding semantic segmentation label gray-scale image, namely a semantic label image, to form an image training set; n is a positive integer; record the n-th original street view image in the training set as { J }ⁿ(i, J) }, matching the training set with { J }ⁿ(i, j) } corresponding semantic segmentation label images are recorded as

N is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of a pixel point in the image; i is more than or equal to 1 and less than or equal to W, J is more than or equal to 1 and less than or equal to H, and W represents { Jⁿ(i, J) }, H denotes { Jⁿ(i, j) }; (i, j) representing the coordinate positions of the pixel points in the image; j. the design is a squareⁿ(i, J) represents { Jⁿ(i, j) } pixel of pixel point with coordinate position (i, j)The value of the one or more of,

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j); recording the real segmentation label image corresponding to the semantic label image as

Then will be

Processed into one-hot coded images, the constituent sets being denoted as

2) Constructing and training a streetscape image semantic segmentation deep neural network model:

the streetscape image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer is a multi-resolution fusion enhancement network introducing deformable convolution and comprises a first stage, a second stage and a third stage;

the first stage is that deformable convolution is connected with Xception Module in series to form a fusion sub-network A, and the sub-network A is connected with the fusion sub-network A in series and repeated three times to obtain more deep semantic feature information; the Xceptation Module is a basic residual Module of a semantic segmentation network Deeplab V3 +;

the second stage is a double-branch parallel network, each branch is a subnet which is composed of three Xceptance modules connected in series and is used for feature extraction and feature fusion; the third stage is a three-branch parallel network, each branch is a subnet and consists of three Xceptance modules connected in series;

2_2) the first stage of the hidden layer comprises a fusion enhancement Module constructed by connecting a deformable convolution Module with an Xception Module in series, and a plurality of feature maps are generated in sequence by the three fusion enhancement modules; obtaining the displacement required by the deformable convolution through the output of a parallel standard convolution, and then acting on a convolution kernel to achieve the deformable convolution;

in the deformable convolution, the regular grid R is increased by an offset { Δ p_k1, ·, K }, where K | R |, represented by formula (3):

then calculating by a bilinear interpolation method to obtain an offset delta p_nA corresponding integer type; obtaining the positions of all pixels, namely obtaining a new picture M, and inputting the M into an Xmeeting Module as input data;

the residual error learning unit of the Xception Module performs convolution calculation and feature extraction on channel by channel and point by point respectively through depth separable convolution; obtaining a characteristic diagram;

for the first stage, the input end of the fusion enhancement Module of the 1 st deformable convolution series Xception Module receives the channel component of the original input image output by the output end of the input layer, and the output end outputs the generated feature map set R₁(ii) a The input end of the fusion enhancement Module of the 2 nd deformable convolution series Xception Module receives R₁And the output end outputs the generated feature map set R₂(ii) a The input end of the fusion enhancement Module of the 3 rd deformable convolution series Xceptance Module receives R₂And the output end outputs the generated feature map set R₃Is marked as (H, W, C), wherein C is the number of channels;

the first stage extracts features, and after repeating 3 times of deformable convolution and connecting an Xception Module fusion enhancement Module in series, downsampling operation is respectively carried out to obtainTwo new feature graph sets, respectively denoted as R₄And R₅(ii) a Wherein R is₄Each feature map in (H, W, C), R₅Each characteristic map in (H/2, W/2, 2C);

2_3) the second stage and the third stage of the hidden layer exchange information among the multi-resolution features, so that the hidden layer has good semantic expression capability while maintaining the spatial resolution of high resolution;

the second stage generates two parallel networks S₁And S₂，S₁Is composed of 3 lightweight Xception modules connected in series; s₁Input terminal of (1) receiving R₄，S₁Output of (3) the generated feature map set R₆Wherein R is₆Each feature map in (a) is (H, W, C); s₂Is composed of 3 lightweight Xception modules connected in series, S₂Input terminal receiving R₅，S₂Feature map set R generated by output end output₇Wherein R is₇Each characteristic map in (H/2, W/2, 2C); two parallel networks S₁And S₂Respectively carrying out down-sampling operation to obtain a set consisting of five new characteristic graphs, which are respectively marked as R₈、R₉、R₁₀ R₁₁And R₁₂(ii) a Wherein R is₈Each feature map in (H, W, C), R₉Each characteristic diagram in (H/2, W/2,2C), R₁₀Each characteristic diagram in (H/2, W/2,2C), R₁₁Each characteristic diagram in (H/4, W/4,4C), R₁₂Each characteristic map in (H/4, W/4, 4C);

the third stage generates three parallel networks S₃、S₄And S₅Wherein S is₃Is composed of 3 lightweight Xception modules connected in series; r₇Partial up-sampling is carried out to obtain a new feature diagram formed set which is marked as R₁₃Wherein R is₁₃Each feature map in (a) is (H, W, C); r is to be₈And R₁₃Carrying out feature information layer fusion, and recording a feature graph set generated after fusion as R₁₄Wherein R is₁₄Each feature map in (a) is (H, W, C); s₃Input terminal receiving R₁₄，S₃Output of the output terminalGenerating a feature map set R₁₅Wherein R is₁₅Each feature map in (a) is (H, W, C); s₄Is composed of 3 lightweight Xception modules connected in series, and R is connected in series₉And R₁₀Carrying out feature information fusion, and generating a feature graph set R by fusion₁₆Wherein R is₁₆Each characteristic map in (H/2, W/2, 2C); s₄Input terminal receiving R₁₆，S₄Output of (3) the generated feature map set R₁₇Wherein R is₁₇Each characteristic map in (H/2, W/2, 2C); s₅Is composed of 3 lightweight Xception modules connected in series; r is to be₁₁And R₁₂The feature graph set generated by fusing the feature information layers is marked as R₁₈Wherein R is₁₈Each characteristic map in (H/4, W/4, 4C); s₅Input terminal receiving R₁₈，S₅Output of (3) the generated feature map set R₁₉Wherein R is₁₉Each characteristic map in (H/4, W/4, 4C); at the end of the third phase, S₄And S₅Generated feature map R₁₇And R₁₉Performing an upsampling operation to generate an upsampled R₁₅The feature maps of the same size are respectively marked as R₂₀And R₂₁(ii) a Then R is put₁₅、R₂₀And R₂₁Inputting the feature information into a feature fusion layer to perform feature information fusion to generate a new feature graph set R₂₂Wherein R is₂₂Each feature map in (a) is (H, W, C);

the output layer is composed of 1 convolution layer, and the input end of the output layer receives the characteristic diagram set R₂₂The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the input original image; wherein the width of each semantic segmentation prediction graph is W, the height of each semantic segmentation prediction graph is H, and the channel of each semantic segmentation prediction graph is C;

2_4) inputting the image training set into the constructed streetscape image semantic segmentation deep neural network model for training to obtain a semantic segmentation prediction graph corresponding to each original streetscape image, and recording a set formed by the semantic segmentation prediction graphs as

2_5) calculation

With corresponding sets of one-hot coded images

Value of loss function in between

2_6) repeatedly executing the step 2_4) and the step 2_5) M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values in total; finding out the loss function value with the minimum value from the M multiplied by N loss function values; taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item corresponding to the deep neural network classification training model, and correspondingly marking as W^bestAnd b^best(ii) a Finishing the training of the streetscape image semantic segmentation deep neural network classification model;

3) testing the model: inputting the test set into the trained model for testing;

3_1) order

Width of (A), H' represents

The height of (a) of (b),

to represent

3_2) will

The channel components are input into a trained streetscape image semantic segmentation deep neural network classification model, and W is utilized^bestAnd b^bestMaking a prediction to obtain

Corresponding predictive semantic segmentation image, denoted

Wherein the content of the first and second substances,

to represent

And (3) the pixel value of the pixel point with the middle coordinate position (i ', j'), namely the image semantic segmentation enhanced by the deformable convolution is realized.

2. The method for semantic segmentation of street view image with deformable convolution fusion enhancement as claimed in claim 1, wherein in step 1), the object classes in the street view image are classified into 19 classes.

3. The method for semantic segmentation of street view image enhanced by deformable convolution according to claim 1, wherein the bilinear interpolation in step 2_2) is calculated and expressed as formula (4):

wherein P represents a position on the feature map; q is all the integral spatial positions in the feature map x, and G (,) is the bilinear interpolation kernel.

4. The streetscape image semantic segmentation method based on deformable convolution fusion enhancement as claimed in claim 1, wherein a residual error learning unit of an Xception Module extracts features through 3 separable convolutions with 3x3 depths; when the input feature and the output feature have different dimensions, the dimension of the input feature is adjusted through 1 × 1 convolution, and then the dimension is added with the output feature of the residual error learning unit, so that a feature map is obtained; and when the input feature and the output feature are the same in dimension, adding the input feature to the output feature map of the residual error learning unit to obtain the feature map.

5. The method as claimed in claim 1, wherein the step 2_5) is achieved by using cross entropy classification

And

value of loss function in between

6. The deformable convolution fusion enhanced streetscape image semantic segmentation method as claimed in claim 1, wherein a deep neural network model is constructed by using a python-based deep learning library pytorch 1.2.

7. The method for street view image semantic segmentation enhanced by deformable convolution fusion as claimed in claim 1, wherein a cityscape test set is specifically adopted, and parameter average cross-over ratio, pixel accuracy and average pixel accuracy are adopted as indexes to verify the street view image segmentation effect of the street view image semantic segmentation enhanced by deformable convolution fusion.