CN112396607B

CN112396607B - Deformable convolution fusion enhanced street view image semantic segmentation method

Info

Publication number: CN112396607B
Application number: CN202011291950.6A
Authority: CN
Inventors: 张珣; 秦晓海; 刘宪圣; 张浩轩; 江东; 张迎春; 付晶莹
Original assignee: Institute of Geographic Sciences and Natural Resources of CAS; Beijing Technology and Business University
Current assignee: Institute of Geographic Sciences and Natural Resources of CAS; Beijing Technology and Business University
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2023-06-16
Anticipated expiration: 2040-11-18
Also published as: CN112396607A

Abstract

The invention discloses a deformable convolution fusion enhanced street view image semantic segmentation method which comprises a training stage and a testing stage, wherein a street view image semantic segmentation deep neural network model is constructed, so that the network model obtains more small target characteristic information when a large target object of a street view image is segmented, the problems of small scale target loss and discontinuous segmentation during the semantic segmentation of the street view image are solved, the image segmentation effect is improved, the overall robustness of the model is better, and the street view image processing precision is higher.

Description

Deformable convolution fusion enhanced street view image semantic segmentation method

Technical Field

The invention belongs to the technical field of computer vision, relates to an image processing technology, and in particular relates to a deformable convolution fusion enhanced street view image semantic segmentation method.

Background

Image semantic segmentation is an important branch of computer vision in the field of artificial intelligence, and is a ring of importance for image understanding and resolution in machine vision. Image semantic segmentation is the task of classifying each pixel in an image into its category accurately so that it is consistent with the visual representation of the image itself, so the task of image semantic segmentation is also known as the task of image classification at the pixel level. At present, semantic segmentation has been widely applied to scenes such as automatic driving and unmanned aerial vehicle landing point judgment.

Convolutional neural networks have been successful in classifying, locating, and scene understanding images. With the proliferation of tasks such as augmented reality and automatic driving of vehicles, many researchers have turned their attention to scene understanding, where one of the main steps is semantic segmentation, i.e., classifying each pixel in a given image. Semantic segmentation has significance in mobile and robot-related applications

Unlike image classification, the difficulty of image semantic segmentation is higher, because it requires not only global context information, but also fine local information to determine the class of each pixel, so often a backbone is used to extract the more global features, and then feature resolution reconstruction is performed in combination with shallow features in the backbone to restore to the original image size. The resolution size of the feature map is changed from decreasing to increasing, the former is generally referred to as an encoding network, and the latter is referred to as a decoding network. Classical semantic segmentation methods include fully connected networks (Full Connected Network, FCN) and deep lab series networks, which have good performance in terms of pixel accuracy, pixel uniformity accuracy and intersection-to-intersection ratio on a road scene segmentation database. Conventional networks are subsampled, where upsampling is in the sense of recovering a small-sized high-dimensional feature map back for pixel prediction to obtain classification information for each point. Although FCNs do upsampling, they do not recover all the lost information without loss; the deep Lab series network is added with a cavity convolution algorithm expansion receptive field on the basis, so that information loss is improved, but the problem of information loss is not well controlled. Therefore, these methods affect the accuracy of the semantic segmentation of the image due to the loss of information, and the segmentation effect is worse especially on the recognition of small target objects.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the deformable convolution fusion enhanced street view image semantic segmentation method, which constructs a street view image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information when a large target object of the street view image is segmented, thereby solving the problems of small scale target loss and discontinuous segmentation in the process of street view image semantic segmentation, ensuring better overall robustness of the model, higher street view image processing precision and improving the image segmentation effect.

The technical scheme adopted for solving the technical problems is as follows:

a deformable convolution fusion enhanced street view image semantic segmentation method is characterized by comprising two processes of a training stage and a testing stage, and comprises the following steps:

1) Constructing an image training set: the method comprises the steps of inputting the image data into a constructed network to participate in training, wherein the image data comprises an original image and a corresponding semantic label image.

1_1) selecting N original street view image data and a corresponding semantic segmentation label gray level map, forming a training set, and marking the nth original street view image in the training set as { J } ⁿ (i, J) } and { J } in the training set ⁿ The semantic segmentation label image corresponding to (i, j) is recorded as

The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of the pixel point in the image; 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W representing { J ] ⁿ Width of (i, J) }, H represents { J }, J ⁿ Height of (i, J), J ⁿ (i, J) represents { J ] ⁿ Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>

Representation->

Pixel values of the pixel points with the middle coordinate positions (i, j); meanwhile, in order to well evaluate the designed model, the real segmentation label image corresponding to the original street view image and the corresponding semantic label image in the training set is required to be used as a training target of us and marked as +.>

Then, semantic segmentation label gray level images corresponding to each original street view image in the training set are obtained by adopting a single-hot coding technology

Processing the single-heat coded image; in the specific implementation, the street view image object categories are classified into 19 categories, and the real semantics corresponding to the original street view image is segmented into tag images +.>

The set of constituents is marked->

2) Constructing and training a street view image semantic segmentation deep neural network model: the street view image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises two parts, mainly a multi-resolution fusion enhancement network which introduces deformable convolution, wherein the first stage is a first part, and the second stage and the third stage are second parts. The first stage is that the deformable convolution is connected in series with Xreception Module to form a fusion sub-network A, and the sub-network A is repeated three times in series, so that more deep semantic feature information can be obtained. The Xreception Module is a basic residual error Module of a semantic segmentation network deep V3+; the second stage is a double-branch parallel network, wherein each branch is a subnet and consists of three Xreception modules in series for feature extraction and feature fusion; the third stage is three branch parallel network, each branch is a subnet, and is composed of three Xreception modules in series, and the function of the third stage is the same as that of the second stage. The first stage is specifically composed of three repeated deformable convolution modules connected in series with Xreception modules, wherein the size of the convolution kernel is 3×3, and the Xreception modules in the invention are composed of 3 convolution layers with depth of 3×3 and separable convolution, step length of 1 and filling of 1. Comprising the following steps:

2_1) an input layer of the street view image semantic segmentation deep convolutional neural network model is used for receiving R, G, B three-channel components of an original input image and outputting the R, G, B three-channel components to a hidden layer;

for the input layer, the input end of the input layer receives R, G, B three-channel components of an original input image with the width W and the height H, and the output end of the input layer outputs R, G, B three-channel components of the original input image to the hidden layer;

2_2) a first stage of the hidden layer comprises a fusion enhancement Module constructed by a deformable convolution Module connected in series with an Xreception Module, wherein three fusion enhancement modules are repeatedly stacked, and a plurality of feature graphs are sequentially generated through the three fusion enhancement modules;

the first stage of the hidden layer is composed of three fusion enhancement modules, and each fusion enhancement Module mainly comprises a deformable convolution Module connected in series with a lightweight Xreception Module. Further displacement adjustment of the spatially sampled position information in the module is achieved by using a deformable convolution unit, which displacement can be learned in the target task and does not require additional supervisory signals. The offset added in the deformable convolution unit is a part of a structure of a street view image semantic segmentation deep convolution neural network model, in an input feature map, an original part obtains a part of feature areas through a sliding window, and after the deformable convolution is introduced, the original convolution network is divided into two paths for sharing the feature map. The upper path learns the offset by using a parallel standard convolution, and meanwhile, the gradient back propagation can normally perform end-to-end learning, so that the successful integration of the deformable convolution is ensured. After the offset is learned, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which is required to be identified at present, and the visual effect is that the sampling point positions of the convolution kernels at different positions can be adaptively changed according to the image content, so that the geometric deformation such as the shape, the size and the like of different objects in the image content is adapted. The concrete steps are as follows: the displacement required by the deformable convolution is obtained through the output of a parallel standard convolution, and then the displacement is acted on a convolution kernel to achieve the effect of the deformable convolution. When the picture pixels are integrated, offset operation is needed to be carried out on the pixels, floating point number types are generated in the generation of offset, the offset is converted into integer types, counter propagation cannot be carried out if the offset is rounded directly, and the corresponding pixels are obtained in a bilinear difference mode.

Convolution comprises two steps: 1) Sampling on the input feature diagram x by using a regular grid R; 2) The sum of the sampled values weighted with w. Grid R defines the size and expansion of receptive fields. As in formula (1):

R＝{(-1,-1),(-1,0),...,(0,1),(1,1)} (1)

a 3 x 3, dilation-to-1 convolution kernel is defined;

the definition of convolution is: for each pixel position p in the output ₀ General convolution calculation is as in equation (2):

wherein y (p) ₀ ) Is p ₀ A feature map corresponding to the position; p is p ₀ For each position on the output signature y; x is the input signature and R represents a grid of receptive fields, here exemplified by 3 x 3. For each position p on the output profile y ₀ ，p _k Is an enumeration of locations in R. In the deformable convolution, the regular grid R is increased by an offset { Δp } _k I k=1,..K= |r|. Equation (2) becomes (3):

wherein p is ₁ To incorporate each position on the deformable convolved output feature map y, y (p ₁ ) Is p ₁ The position corresponds to the deformed characteristic diagram; Δp _k Is the offset 2K in both x and y directions;

the original image pixel value is marked as V, the original convolution process is divided into two paths, the upper path learns the offset in the x direction and the y direction, the output is H.times.W.times.2K, wherein K.times.R.times.2 represents the number of pixels in a grid, 2K means the offset in the x direction and the y direction, and the image pixel value is marked as U at the moment. With this offset, the pixel value index of the image in U is added to that in V, and for each convolution window, the window after the translation is no longer the original regular sliding window, and the calculation process is consistent with the convolution. Now, the sampled position becomes an irregular position due to the offset Δp _k Usually a fraction, we calculate the corresponding integer type by bilinear interpolation as in equation (4):

where p represents an arbitrary position on the feature map (p=p for equation (3) ₁ +p _k +Δp _k ) Q is the enumeration of all integral spatial positions in the feature map x, and G (,) is the bilinear interpolation kernel.

After all pixel positions are obtained, a new image M is obtained, and the new image M is input as input data to the x-resolution Module.

The residual learning unit of the residual structure takes Xreception Module as the basic residual structure of deep V < 3+ >, the characteristic is extracted by 3 depth separable convolution of 3 multiplied by 3, the convolution calculation is carried out channel by channel and point by point respectively, and compared with the conventional convolution operation, the parameter quantity and the operation cost are lower, which is also the main reason for introducing the Xreception Module. When the input feature is different from the output feature in dimension (the number of channels of the feature map), the dimension of the input feature needs to be adjusted through 1×1 convolution, and then the input feature and the output feature of the residual learning unit are added and fused, so that a final feature map is obtained. When the input feature and the output feature are the same in dimension, the final extracted feature can be obtained by adding (fusing) the input feature and the output feature map of the residual learning unit. The idea of combining a depth separable convolution with a basic residual Module Bottleneck residual structure (comprising 1X1 convolution, one 3X3 convolution and one 1X1 convolution, wherein 1X1 convolution is used for adjusting feature dimensions and 3X3 convolution is used for extracting features), and dividing a standard convolution into a channel convolution and a spatial convolution by using the depth separable convolution to reduce parameters of model training; and the residual structure is utilized to eliminate the gradient explosion problem caused by the deepening of the network hierarchy.

For the first stage, the input end of the 1 st fusion enhancement Module of the deformable convolution serial Xreception Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature map, and the set formed by the output feature map is recorded as R ₁ The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 2 nd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module serial Xreception Module ₁ The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R ₂ The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 3 rd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module serial Xreception Module ₂ The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R ₃ The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is ₃ Each of the feature maps has a width W, a height H, and a channel number C, and therefore R ₃ Can be noted as (H, W, C). The first stage comprises a feature extraction branch, and after 3 times of repeated deformable convolution tandem Xreception Module fusion enhancement modules, downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain two new feature image formed sets which are respectively marked as R ₄ And R is ₅ The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is ₄ Each of which is (H, W, C), R ₅ Is (H/2, W/2,2C).

2_3) the second and third phases constitute a second part of the hidden layer; the high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the high-resolution spatial resolution is kept, and meanwhile, the high-resolution spatial resolution has good semantic expression capability. The method comprises the following specific steps:

after the first stage, the second stage generates two parallel networks S ₁ And S is ₂ ，S ₁ Consists of 3 lightweight Xreception modules in series. The width and the height of the input characteristic layer and the output characteristic layer of each Xreception Module are consistent, S ₁ Input terminal receives R ₄ All feature patterns of (S) ₁ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₆ Wherein R is ₆ Is (H, W, C); s is S ₂ Is formed by connecting 3 lightweight Xreception modules in series, the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, S ₂ Input terminal receives R ₅ All feature patterns of (S) ₂ The output end outputs the generated characteristic diagram, and the characteristic diagram set is marked as R ₇ Wherein R is ₇ Is (H/2, W/2,2C); two parallel networks S passing through the second stage ₁ And S is ₂ Downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain five new sets formed by characteristic diagrams, which are respectively marked as R ₈ 、R ₉ 、R ₁₀ R ₁₁ And R is ₁₂ . Wherein R is ₈ Each of which is (H, W, C), R ₉ Each of which is (H/2, W/2,2C), R ₁₀ Each of which is (H/2, W/2,2C), R ₁₁ Each of which is (H/4, W/4,4C), R ₁₂ Is (H/4, W/4,4C).

After the second stage, the third stage generates three parallel networks S ₃ 、S ₄ And S is ₅ Wherein S is ₃ Consists of 3 lightweight Xreception modules in series. Each XacceptThe width and height of the input feature layer and the output feature layer of the ion Module are consistent, and R is ₇ Performing partial upsampling to obtain a new set of feature patterns denoted as R ₁₃ Wherein R is ₁₃ Is (H, W, C). At the same time, the information fusion layer fuses R ₈ And R is ₁₃ Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R ₁₄ Wherein R is ₁₄ Is (H, W, C). S is S ₃ Input terminal receives R ₁₄ All feature patterns of (S) ₃ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₁₅ Wherein R is ₁₅ Is (H, W, C); s is S ₄ The system consists of 3 lightweight Xreception modules connected in series, wherein the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, and an information fusion layer simultaneously fuses R ₉ And R is ₁₀ Fusing the characteristic information, and marking a set formed by the characteristic images generated by fusion as R ₁₆ Wherein R is ₁₆ Is (H/2, W/2,2C). S is S ₄ Input terminal receives R ₁₆ All feature patterns of (S) ₄ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₁₇ Wherein R is ₁₇ Is (H/2, W/2,2C); s is S ₅ Consists of 3 lightweight Xreception modules in series. The input feature layer and the output feature layer of each Xreception Module are consistent in width and height, and the information fusion layer fuses R ₁₁ And R is ₁₂ Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R ₁₈ Wherein R is ₁₈ Is (H/4, W/4,4C). S is S ₅ Input terminal receives R ₁₈ All feature patterns of (S) ₅ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₁₉ Wherein R is ₁₉ Is (H/4, W/4,4C); at the end of the third phase, the sub-network S ₄ And subnetwork S ₅ Generated feature map R ₁₇ And R is ₁₉ Up-sampling operation is carried out to generate and pass S ₃ Subnet generated R ₁₅ Feature maps of the same size scale, specialThe collection of the symptomatic charts is respectively denoted as R ₂₀ And R is ₂₁ R is then taken up ₁₅ 、R ₂₀ And R is ₂₁ Inputting the new feature images into a feature fusion layer to perform feature information fusion, and generating a new set formed by the feature images to be marked as R ₂₂ Wherein R is ₂₂ Is (H, W, C).

For the output layer, which consists of 1 convolution layer, the input end of the output layer receives the data composed of the feature map set R ₂₂ The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; the width of each semantic segmentation prediction graph is W, the height is H, and the channel is C.

2_4) the original street view image { J } in the training set ⁿ (i, J) and the corresponding semantic label image (semantic segmentation label gray level image) are used as original input images, input into a constructed street view image semantic segmentation deep neural network model for training, obtain a semantic segmentation prediction image corresponding to each original street view image in a training set, and send each original street view image { J } ⁿ The set of semantic segmentation prediction graphs corresponding to (i, j) is denoted as

2_5) computing a set of semantic segmentation prediction graphs corresponding to each original street view image in the training set

One-hot coded image set processed with corresponding true semantic segmentation images>

The loss function value between will->

And (3) with

The loss function value between them is recorded as->

In specific implementation, the cross entropy of classification is adopted to obtain

And->

Loss function value->

2_6) repeatedly executing the step 2_4) and the step 2_5) for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then find out the smallest value of the loss function value from M×N loss function values; then the weight vector and the bias term corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias term corresponding to the deep neural network classification training model, and are correspondingly marked as W ^best And b ^best The method comprises the steps of carrying out a first treatment on the surface of the And the training of the street view image semantic segmentation deep neural network classification model is completed.

3) Test model: and inputting the test set into the trained model for testing.

3_1) order

Representing a road scene image to be semantically segmented, namely a test set; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>

Is H' represents ∈>

Height of->

Representation->

Pixel values of the pixel points with the middle coordinate positions of (i ', j');

3_2) will

R channel component, G channel component and B channel component of the model are input into a trained street view image semantic segmentation deep neural network classification model, and W is utilized ^best And b ^best Predicting to obtain->

Corresponding predictive semantic segmentation image, noted +.>

Wherein (1)>

Representation->

And the pixel value of the pixel point with the middle coordinate position of (i ', j').

Through the steps, the deformable convolution fusion enhanced image semantic segmentation is realized.

Compared with the prior art, the invention has the beneficial effects that:

1) The lightweight Xreception Module was introduced to replace the Bottleneck Module in the conventional network model. The Xreception Module uses the design thought of the Bottleneck as a reference, the network model is continuously deepened through the residual error learning unit, rich semantic features are extracted, and meanwhile, standard convolution in the Bottleneck is replaced by depth separable convolution, so that parameters of the model can be reduced under the condition of ensuring accuracy, and the operation cost is reduced. Meanwhile, the multi-scale fusion of the network has a better effect, and after the feature extraction and fusion of the modules, the interaction of high and low resolution has better result output.

2) The deep neural network constructed by the method adopts the high-resolution fusion parallel network to reduce the lost characteristic information of the characteristic image in the whole network, and retains the effective depth information to a great extent by the high-resolution unchanged characteristic image information and the fusion low-resolution characteristic image information in the whole process.

3) The deep neural network constructed by the method disclosed by the invention has the advantages that the deformable convolution is integrated in the first stage of the hidden layer, so that the network model has better deformation modeling capability while maintaining high-resolution characteristics in the characteristic extraction process, the problems of small-scale target loss and discontinuous segmentation during semantic segmentation are solved, and the overall robustness of the model is better.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Fig. 2 is a block diagram of a composition structure of a street view image semantic segmentation neural network model constructed by the method of the invention.

Fig. 3 is a schematic diagram of a framework of a street view image semantic segmentation neural network model according to the method of the present invention.

FIG. 4 shows a street view image to be semantically segmented, a corresponding real semantic segmentation image, and a predicted semantic segmentation image obtained by prediction, which are adopted in the embodiment of the present invention;

wherein, (a) is a street view image to be semantically segmented; (b) The real semantic segmentation image corresponding to the street view image to be semantically segmented is shown in the step (a); (c) The method is used for predicting the street view image to be semantically segmented shown in the step (a) to obtain the predicted semantic segmentation image.

Detailed Description

The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.

The invention provides a deformable convolution fusion enhanced street view image semantic segmentation method, which constructs a street view image semantic segmentation deep neural network model, so that the network model can obtain more small target characteristic information when a large target object of a street view image is segmented, thereby solving the problems of small-scale target loss and discontinuous segmentation during the semantic segmentation of the street view image, ensuring better overall robustness of the model, ensuring higher street view image processing precision and improving the image segmentation effect.

The general implementation block diagram of the deformable convolution fusion enhanced street view image semantic segmentation method provided by the invention is shown in fig. 1, and comprises two processes of a training stage and a testing stage.

Fig. 2 is a block diagram of a composition structure of a street view image semantic segmentation neural network model constructed by the method of the invention. Fig. 3 is a schematic diagram of a framework of a street view image semantic segmentation neural network model according to the method of the present invention. The method for realizing the deformable convolution fusion enhanced street view image semantic segmentation mainly comprises the following steps:

1) Firstly, inputting an original image into a first deformable convolution layer in a first stage of a network, and extracting features (high-resolution feature images);

2) Inputting the output initial characteristics to a first Xcection Module to obtain a deeper characteristic diagram;

3) Repeating 1) and 2) for 3 times, namely immediately following an Xreception Module after the deformable convolution Module, and extracting deep features for multiple times while increasing the receptive field;

4) Respectively performing downsampling operations with step sizes of 1 and 2, wherein one part of the downsampling operations keep high resolution, and the other part of the downsampling operations parallelizes with lower resolution, and inputting the downsampling operations into 3 Xreception modules repeated by different branches in the second stage;

5) After feature information fusion is carried out through a feature fusion layer, the feature information fusion layer is respectively input into 3 Xreception modules repeated by different branches in the third stage, downsampling operation with step sizes of 1 and 2 is carried out, then upsampling and feature fusion are carried out, and a high-resolution feature map is output;

6) Finally, after one convolution, the channel number of the output characteristic is adjusted to be the class number to be segmented, and the predicted segmented image can be obtained after the classifier function is activated.

In specific implementation, the specific steps of the training phase process of the street view image semantic segmentation neural network model of the method are as follows:

1, constructing an image training set: selecting N original street view image data and a corresponding semantic segmentation label gray level map, forming a training set, and marking the nth original street view image in the training set as { J } ⁿ (i, J) } and { J } in the training set ⁿ The semantic segmentation label image corresponding to (i, j) is recorded as

The original street view image is an RGB color image, the corresponding label image is a gray scale image, and N is a positive integer; n is more than or equal to 500, such as 1000; n is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of the pixel point in the image; 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W representing { J ] ⁿ Width of (i, J) }, H represents { J }, J ⁿ Height of (i, j), e.g. w=1024, h=512, j ⁿ (i, J) represents { J ] ⁿ Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>

Representation of

Pixel values of the pixel points with the middle coordinate positions (i, j); meanwhile, in order to well evaluate the designed model, the real segmentation label image corresponding to the original street view image in the training set and the corresponding semantic segmentation label gray level image is required to be used as a training target of the user and marked as +.>

Here, the original street view image directly selects 2975 images from the training dataset in the city landscape dataset, namely the city sceneries public dataset.

2, constructing a deep neural network: the deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer is composed of two parts, and three repeated deformable convolution modules are connected in series to form a high-resolution input of the network and a double-stage multi-branch parallel fused Xreception Module network.

2_1 for the input layer, the input end of the input layer receives R, G, B three channel components of a pair of original input images, and the output end of the input layer outputs R channel components, G channel components and B channel components of the original input images to the hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W, and the height is required to be H;

2_2, constructing a fusion Module for the deformable convolution Module serial Xreception Module at the first part of the hidden layer, then repeatedly stacking the fusion modules to three, and sequentially generating a plurality of feature images through the three fusion enhancement modules;

the first stage of the hidden layer is composed of three fusion enhancement modules, and each fusion enhancement Module mainly comprises a deformable convolution Module connected in series with a lightweight Xreception Module. After all pixel positions are obtained through the first deformable convolution, a new picture M is obtained, and the new picture M is input into the Xreception Module as input data.

The first stage is the first part, for the first stage, the input end of the 1 st fusion enhancement Module of the deformable convolution serial Xreception Module receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end outputs the generated feature image, and the set formed by the output feature image is recorded as R ₁ The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 2 nd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 1 st deformable convolution Module serial Xreception Module ₁ The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R ₂ The method comprises the steps of carrying out a first treatment on the surface of the The input of the fusion enhancement Module of the 3 rd deformable convolution serial Xreception Module receives the output R of the fusion enhancement Module of the 2 nd deformable convolution Module serial Xreception Module ₂ The output end outputs the generated characteristic diagram, and the set formed by the characteristic diagram is marked as R ₃ The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is ₃ Each of the feature maps has a width W, a height H, and a channel number C, and therefore R ₃ Can be noted as (H, W, C). The first stage comprises a feature extraction branch, and after 3 times of repeated deformable convolution tandem Xreception Module fusion enhancement modules, downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain two new feature graphs, wherein the two new feature graphs are respectively marked as R ₄ And R is ₅ The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is ₄ Each of which is (H, W, C), R ₅ Is (H/2, W/2,2C).

2_3 the second and third phases constitute a second part of the hidden layer; the high-resolution features are always kept in the network of the second part of the hidden layer, and information exchange is continuously carried out among the multi-resolution features, so that the high-resolution spatial resolution is kept, and meanwhile, the high-resolution spatial resolution has good semantic expression capability. The method comprises the following specific steps:

after the first stage, the second stage generates two parallel networks S ₁ And S is ₂ ，S ₁ The invention consists of 3 lightweight Xreception modules connected in series, wherein each Xreception Module consists of 3 convolution layers with depth of 3×3 and with a step length of 1 and a filling of 1. The width and the height of the input characteristic layer and the output characteristic layer of each Xreception Module are consistent, S ₁ Input terminal receives R ₄ All feature patterns of (S) ₁ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₆ Wherein R is ₆ Is (H, W, C); s is S ₂ Is formed by connecting 3 lightweight Xreception modules in series, the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, S ₂ Input terminal receives R ₅ All feature patterns of (S) ₂ The output end outputs the generated characteristic diagram, and the characteristic diagram set is marked as R ₇ Wherein R is ₇ Is (H/2, W/2,2C); two parallel networks S passing through the second stage ₁ And S is ₂ Downsampling operations with step length of 1 and step length of 2 are respectively carried out to obtain five new sets formed by characteristic diagrams, which are respectively marked as R ₈ 、R ₉ 、R ₁₀ R ₁₁ And R is ₁₂ Wherein R is ₈ Each of which is (H, W, C), R ₉ Each of which is (H/2, W/2,2C), R ₁₀ Each of which is (H/2, W/2,2C), R ₁₁ Each of which is (H/4, W/4,4C), R ₁₂ Is (H/4, W/4,4C).

After the second stage, the third stage generates three parallel networks S ₃ 、S ₄ And S is ₅ Wherein S is ₃ Consists of 3 lightweight Xreception modules in series. The input feature layer of each Xception Module is consistent with the width and height of the output feature layer,at this time R ₇ Performing partial upsampling to obtain a new set of feature patterns denoted as R ₁₃ Wherein R is ₁₃ Is (H, W, C). At the same time, the information fusion layer fuses R ₈ And R is ₁₃ Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R ₁₄ Wherein R is ₁₄ Is (H, W, C). S is S ₃ Input terminal receives R ₁₄ All feature patterns of (S) ₃ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₁₅ Wherein R is ₁₅ Is (H, W, C); s is S ₄ The system consists of 3 lightweight Xreception modules connected in series, wherein the width and height of an input characteristic layer and an output characteristic layer of each Xreception Module are consistent, and an information fusion layer simultaneously fuses R ₉ And R is ₁₀ Fusing the characteristic information, and marking the characteristic diagram generated by fusion as R ₁₆ Wherein R is ₁₆ Is (H/2, W/2,2C). S is S ₄ Input terminal receives R ₁₆ All feature patterns of (S) ₄ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₁₇ Wherein R is ₁₇ Is (H/2, W/2,2C); s is S ₅ Consists of 3 lightweight Xreception modules in series. The input feature layer and the output feature layer of each Xreception Module are consistent in width and height, and the information fusion layer fuses R ₁₁ And R is ₁₂ Performing feature information layer fusion, and marking a set formed by the feature images generated after fusion as R ₁₈ Wherein R is ₁₈ Is (H/4, W/4,4C). S is S ₅ Input terminal receives R ₁₈ All feature patterns of (S) ₅ The output end of the (a) outputs the generated characteristic diagram, and the set formed by the characteristic diagram is denoted as R ₁₉ Wherein R is ₁₉ Is (H/4, W/4,4C); at the end of the third phase we need to connect the subnetwork S ₄ And subnetwork S ₅ Generated feature map R ₁₇ And R is ₁₉ Up-sampling operation is carried out to generate and pass S ₃ Subnet generated R ₁₅ Feature maps of the same size scale, the sets of feature maps are respectively denoted as R ₂₀ And R is ₂₁ However, it isR is then set ₁₅ 、R ₂₀ And R is ₂₁ Inputting the new feature images into a feature fusion layer to perform feature information fusion, and generating a new set formed by the feature images to be marked as R ₂₂ Wherein R is ₂₂ Is (H, W, C).

For the output layer, which consists of 1 convolution layer, the input end of the output layer receives the feature map set R ₂₂ The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the original input image; wherein, the width of each semantic segmentation prediction graph is W, and the height is H.

2_4, taking the original street view images in the training set and the corresponding semantic segmentation label gray level images as original input images, inputting the original street view images and the corresponding semantic segmentation label gray level images into a deep neural network for training to obtain semantic segmentation prediction graphs corresponding to each original street view image in the training set, and taking each original street view image { J } ⁿ The set of semantic segmentation prediction graphs corresponding to (i, j) is denoted as

2_5 calculating a set formed by semantic segmentation prediction graphs corresponding to each original street view image in the training set

One-hot coded image set processed with corresponding true semantic segmentation images >

The loss function value between will->

And (3) with

The loss function value between them is recorded as->

Obtained using a classification cross entropy (categorical crossentropy).

2_6 repeatedly executing the step 2_4 and the step 2_5 for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values; then find out the smallest value of the loss function value from M×N loss function values; then the weight vector and the bias term corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias term corresponding to the deep neural network classification training model, and are correspondingly marked as W ^best And b ^best . In this example, m=484.

3 test model: and inputting the test set into the trained model for testing. The specific steps of the test stage process are as follows:

3_1 ream

Representing a road scene image to be semantically segmented; wherein, 1.ltoreq.i '. Ltoreq.W ', 1.ltoreq.j '. Ltoreq.H ', W ' represents +.>

Is H' represents ∈>

Height of->

Representation->

3_2 will

The R channel component, the G channel component and the B channel component of the model are input into a trained deep neural network classification model, and W is utilized ^best And b ^best Predicting to obtain->

The corresponding predicted semantic segmentation image is noted as

Wherein (1)>

Representation->

The feasibility and effectiveness of the method of the invention are further verified as follows.

The architecture of the deep neural network was built using the python-based deep learning library pytorch 1.2. The Cityscapes test set is adopted to analyze the segmentation effect of the street view image predicted by the method. Here, the segmentation performance of the predicted semantic segmentation image is evaluated using 3 commonly used objective parameters of the evaluation semantic segmentation method, i.e., the homography ratio (Mean Intersection over Union, MIoU), the Pixel Accuracy (PA), the average Pixel Accuracy (Mean Pixel Accuracy, MPA), and the definition thereof are given.

Definition 1: mlou (homozygote ratio, mean Intersection over Union) is a standard measure of semantic segmentation. Which calculates the ratio of the intersection and union of the two sets. The formula is as follows:

definition 2: the Pixel Accuracy (Pixel Accuracy) represents the proportion of the marked correct pixels to the total pixels as shown in the following formula:

definition 3: the average pixel accuracy (Mean Pixel Accuracy) is a boost of the PA, calculates the ratio of the number of correctly classified pixels in each class, and then averages the PA for all classes as follows:

The method is used for predicting each street view image in the Cityscapes test set to obtain a predicted semantic segmentation image corresponding to each street view image, and the higher the average intersection ratio MIoU value, the pixel accuracy PA and the average pixel accuracy MPA of the semantic segmentation effect of the method are, the higher the validity and the prediction accuracy are, wherein the average intersection ratio MIoU is shown in a table 1.

TABLE 1 mIoU values on the Cityscapes dataset for the method of the present invention

From the data listed in table 1, the segmentation effect of the street view image obtained by the method of the present invention is better, which indicates that the method of the present invention is feasible and effective for obtaining the predicted semantic segmentation image corresponding to the street view image. The specific performances of the average cross ratio MIoU, the pixel accuracy rate PA and the average pixel accuracy rate MPA of the method are shown in the table 2, and the result shows that the segmentation effect of the method is in the front of the existing segmentation model.

TABLE 2 Algorithm Performance on the Cityscapes dataset

/>

In fig. 4, (a) a selected street view image to be semantically segmented is given; (b) Giving out a real semantic segmentation image corresponding to the street view image to be semantically segmented shown in the step (a); (c) The invention provides a predicted semantic segmentation image obtained by predicting the street view image to be semantically segmented shown in the step (a) by using the method. Comparing (b) and (c) in fig. 4, it can be seen that the segmentation accuracy of the predicted semantic segmentation image obtained by the method of the present invention is higher, and is close to the real semantic segmentation image.

It should be noted that the examples are disclosed for the purpose of aiding in the further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. A deformable convolution fusion enhanced street view image semantic segmentation method comprises a training stage and a testing stage, and specifically comprises the following steps:

1) Constructing an image training set which comprises an original street view image and a corresponding semantic tag image;

selecting N pieces of original street view image data and a corresponding semantic segmentation label gray level image, namely semantic label images, to form an image training set; n is a positive integer; the nth original street view image in the training set is recorded as { J } ⁿ (i, J) } and { J } in the training set ⁿ The semantic segmentation label image corresponding to (i, j) is recorded as

N is a positive integer, N is more than or equal to 1 and less than or equal to N; (i, j) is the coordinate position of the pixel point in the image; 1.ltoreq.i.ltoreq.W, 1.ltoreq.j.ltoreq.H, W representing { J ] ⁿ Width of (i, J) }, H represents { J }, J ⁿ (i, j) } height; (i, j) represents the coordinate position of the pixel point in the image; j (J) ⁿ (i, J) represents { J ] ⁿ Pixel value of pixel point with coordinate position (i, j) in (i, j), +.>

Representation->

Pixel values of the pixel points with the middle coordinate positions (i, j); the true segmentation label image corresponding to the semantic label image is marked as +.>

And then->

Processing into single-heat coded image, and forming set

2) Constructing and training a street view image semantic segmentation deep neural network model:

the street view image semantic segmentation deep neural network comprises an input layer, a hidden layer and an output layer; the hidden layer is a multi-resolution fusion enhancement network introducing deformable convolution and comprises a first stage, a second stage and a third stage;

the first stage is that a fusion subnet A is formed by a deformable convolution serial Xreception Module, and the fusion subnet A is connected with a repeated subnet A for three times so as to obtain more deep semantic feature information; the Xreception Module is a basic residual Module of the semantic segmentation network deep V3+;

the second stage is a double-branch parallel network, each branch is a subnet, and each branch consists of three Xreception modules which are connected in series and is used for feature extraction and feature fusion; the third stage is three branch parallel networks, each branch is a subnet, and each branch consists of three Xreception modules connected in series;

2_2) a first stage of the hidden layer comprises a fusion enhancement Module constructed by connecting deformable convolution modules in series with Xreception modules, and a plurality of feature images are sequentially generated through the three fusion enhancement modules; obtaining the displacement required by the deformable convolution through the output of a parallel standard convolution, and then acting on a convolution kernel to achieve the deformable convolution;

in the deformable convolution, the regular grid R is increased by an offset { Δp } _k I k=1,.. represented by formula (3):

then calculating by bilinear interpolation to obtain offset delta p _n A corresponding integer type; obtaining the positions of all pixels, namely obtaining a new image M, and inputting the M as input data to an Xreception Module;

the residual error learning unit of the Xreception Module carries out convolution calculation and extraction of features by depth separable convolution respectively channel by channel and point by point; obtaining a feature map;

for the first stage, the input end of the 1 st deformable convolution serial Xreception Module is connected with the channel component of the original input image output by the output end of the input layer, and the output end outputs the generated characteristic image set R ₁ The method comprises the steps of carrying out a first treatment on the surface of the The input end of the fusion enhancement Module of the 2 nd deformable convolution serial Xreception Module receives R ₁ The output end outputs the generated characteristic diagram set R ₂ The method comprises the steps of carrying out a first treatment on the surface of the The input end of the fusion enhancement Module of the 3 rd deformable convolution serial Xreception Module receives R ₂ The output end outputs the generated characteristic diagram set R ₃ Denoted (H, W, C), where C is the number of channels;

the first stage extracting feature is repeated 3 times to form deformable convolution serial Xreception Module fusion enhancement Module, and then downsampling operation is respectively carried out to obtain two new feature image formed sets, which are respectively marked as R ₄ And R is ₅ The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is ₄ Each of which is (H, W, C), R ₅ Is (H/2, W/2,2C);

2_3) the second stage and the third stage of the hidden layer exchange information among the multi-resolution features, so that the hidden layer has good semantic expression capability while maintaining the high-resolution spatial resolution;

the second stage generates two parallel networks S ₁ And S is ₂ ，S ₁ The system consists of 3 lightweight Xreception modules which are connected in series; s is S ₁ Receiving R at the input terminal of (2) ₄ ，S ₁ The output end of (1) outputs the generated characteristic diagram set R ₆ Wherein R is ₆ Is (H, W, C); s is S ₂ Is formed by connecting 3 lightweight Xreception modules in series, S ₂ Input terminal receives R ₅ ，S ₂ The output end outputs the generated characteristic diagram set R ₇ Wherein R is ₇ Is (H/2, W/2,2C); two parallel networks S ₁ And S is ₂ Respectively performing downsampling operation to obtain five new feature graphs, respectively denoted as R ₈ 、R ₉ 、R ₁₀ R ₁₁ And R is ₁₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein R is ₈ Each of which is (H, W, C), R ₉ Each of which is (H/2, W/2,2C), R ₁₀ Each of which is (H/2, W/2,2C), R ₁₁ Each of which is (H/4, W/4,4C), R ₁₂ Is (H/4, W/4,4C);

third stage of generating three parallel networks S ₃ 、S ₄ And S is ₅ Wherein S is ₃ The system consists of 3 lightweight Xreception modules which are connected in series; r is R ₇ Performing partial upsampling to obtain a new set of feature patterns denoted as R ₁₃ Wherein R is ₁₃ Is (H, W, C); r is R ₈ And R is ₁₃ Fusing the feature information layers, and marking the feature image set generated after fusion as R ₁₄ Wherein R is ₁₄ Is (H, W, C); s is S ₃ Input terminal receives R ₁₄ ，S ₃ Is output by the output end of (1) to generate a feature graph set R ₁₅ Wherein R is ₁₅ Is (H, W, C); s is S ₄ Is composed of 3 lightweight Xreception modules connected in series, and R is simultaneously ₉ And R is ₁₀ Feature information fusion is carried out, and fusion is carried outSet of feature graphs R ₁₆ Wherein R is ₁₆ Is (H/2, W/2,2C); s is S ₄ Input terminal receives R ₁₆ ，S ₄ The output end of (1) outputs the generated characteristic diagram set R ₁₇ Wherein R is ₁₇ Is (H/2, W/2,2C); s is S ₅ The system consists of 3 lightweight Xreception modules which are connected in series; r is R ₁₁ And R is ₁₂ The feature map set generated by feature information layer fusion is marked as R ₁₈ Wherein R is ₁₈ Is (H/4, W/4,4C); s is S ₅ Input terminal receives R ₁₈ ，S ₅ The output end of (1) outputs the generated characteristic diagram set R ₁₉ Wherein R is ₁₉ Is (H/4, W/4,4C); at the end of the third stage, S ₄ And S is ₅ Generated feature map R ₁₇ And R is ₁₉ Up-sampling operation is carried out to generate and R ₁₅ Feature maps of the same size, the sets of feature maps are respectively denoted as R ₂₀ And R is ₂₁ The method comprises the steps of carrying out a first treatment on the surface of the R is then taken up ₁₅ 、R ₂₀ And R is ₂₁ Inputting the new feature image set R into a feature fusion layer for feature information fusion to generate a new feature image set R ₂₂ Wherein R is ₂₂ Is (H, W, C);

the output layer consists of 1 convolution layer, and the input end of the output layer receives the characteristic diagram set R ₂₂ The output end of the output layer outputs a semantic segmentation prediction graph corresponding to the input original image; the width of each semantic segmentation prediction graph is W, the height is H, and the channel is C;

2_4) inputting the image training set into a constructed street view image semantic segmentation deep neural network model for training to obtain a semantic segmentation prediction graph corresponding to each original street view image, and marking a set formed by the semantic segmentation prediction graphs as

2_5) calculation

And corresponding set of one-hot encoded images/>

Loss function value->

2_6) repeatedly executing the step 2_4) and the step 2_5) for M times to obtain a deep neural network classification training model, and obtaining M multiplied by N loss function values in total; finding out a loss function value with the smallest value from M multiplied by N loss function values; the weight vector and the bias term corresponding to the loss function value with the minimum value are used as the optimal weight vector and the optimal bias term corresponding to the deep neural network classification training model, and the corresponding weight vector and the optimal bias term are marked as W ^best And b ^best The method comprises the steps of carrying out a first treatment on the surface of the Training a street view image semantic segmentation deep neural network classification model is completed;

3) Test model: inputting the test set into the trained model for testing;

3_1) order

Is H' represents ∈>

Height of->

Representation->

3_2) will

The channel components of the (2) are input into a trained street view image semantic segmentation deep neural network classification model, and W is utilized ^best And b ^best Predicting to obtain->

Corresponding predictive semantic segmentation image, noted +.>

Wherein (1)>

Representation->

And (3) the pixel value of the pixel point with the middle coordinate position of (i ', j') is the pixel value, so that the deformable convolution fusion enhanced image semantic segmentation is realized.

2. The method for semantic segmentation of a deformable convolution fusion enhanced street view image according to claim 1, wherein in step 1), the object categories in the street view image are classified into 19 categories.

3. The deformable convolution fusion enhanced street view image semantic segmentation method of claim 1, wherein the bilinear interpolation calculation of step 2_2) is represented by formula (4):

wherein P represents a position on the feature map; q is the spatial position of all integrals in the feature map x and G (,) is the bilinear interpolation kernel.

4. The deformable convolution fusion enhanced street view image semantic segmentation method as claimed in claim 1, wherein the residual learning unit of the Xreception Module extracts features by 3 depth separable convolutions of 3×3; when the input feature is different from the output feature in dimension, the dimension of the input feature is adjusted through 1 multiplied by 1 convolution, and then the dimension is added with the output feature of the residual error learning unit, so that a feature map is obtained; when the input feature and the output feature are identical in dimension, the input feature is added to the output feature map of the residual learning unit, thereby obtaining a feature map.

5. The deformable convolution fusion enhanced street view image semantic segmentation method as claimed in claim 1, wherein the step 2_5) specifically adopts classification cross entropy to obtain

And->

Loss function value->

6. The deformable convolution fusion enhanced street view image semantic segmentation method as claimed in claim 1, wherein a deep neural network model is built by specifically using a python-based deep learning library pytorch 1.2.

7. The method for semantic segmentation of the deformable convolution enhanced street view image according to claim 1 is characterized in that a Cityscapes test set is specifically adopted, and parameter average intersection ratio, pixel accuracy and average pixel accuracy are adopted as indexes to verify the street view image segmentation effect of the deformable convolution enhanced street view image semantic segmentation method.