CN113255574B

CN113255574B - Urban street semantic segmentation method and automatic driving method

Info

Publication number: CN113255574B
Application number: CN202110670967.0A
Authority: CN
Inventors: 瞿绍军; 欧阳柳; 刘义亮
Original assignee: Hunan Normal University
Current assignee: Hunan Normal University
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-09-14
Anticipated expiration: 2041-06-17
Also published as: CN113255574A

Abstract

The invention discloses a semantic segmentation method for urban streets, which comprises the steps of obtaining an original training data set; constructing a basic semantic segmentation network, adding an attention module based on pixels and an attention module based on different image levels to obtain the basic semantic segmentation network based on the attention of the different image levels and the attention of the pixels, and training to obtain the semantic segmentation network based on the attention of the different image levels and the attention of the pixels; and performing semantic segmentation on the urban streets in real time by adopting a semantic segmentation network based on different levels of attention and pixel attention of the image. The invention also discloses an automatic driving method comprising the urban street semantic segmentation method. The method of the invention fully utilizes the information of the high-level characteristic diagram and the information of the low-level characteristic diagram, and has high precision, good reliability and better real-time property.

Description

Urban street semantic segmentation method and automatic driving method

Technical Field

The invention belongs to the field of computer vision and image processing, and particularly relates to a semantic segmentation method and an automatic driving method for urban streets.

Background

With the development of economic technology and the improvement of living standard of people, computer vision technology is gradually applied to the production and life of people, and brings endless convenience to the production and life of people.

Semantic segmentation is one of core research hotspots of computer vision, and is aimed at dividing an image into regions with semantic information, allocating a semantic label to each region block, and finally obtaining a segmented image with each pixel being semantically labeled.

The prior art mainly has two semantic segmentation methods: the image semantic segmentation method based on the traditional image semantic segmentation method and the image semantic segmentation method based on the deep learning. The image semantic segmentation method based on deep learning has richer learning characteristics, stronger expression capability and greatly improved segmentation precision, so the method becomes the key point of research. The full convolution network applies the classification network to the network, replaces the full connection layer of the traditional convolution neural network with the convolution layer, combines the characteristic diagram generated by the middle convolution layer by using a jump layer method, and then performs transposition convolution; however, this approach presents two problems: 1. with the convolution pooling, the resolution is continuously reduced, and partial pixels are lost; 2. the original context information of the feature map is not considered. Subsequently, a number of researchers have proposed improved methods based on full convolutional networks. In the pyramid network, the pyramid pooling module can fuse multi-scale context information, effectively utilize the context information, and can make the segmentation result more detailed, but the disadvantage is that the boundary information part of the segmentation target is lost. The U-shaped neural network is a network model of a coder-decoder and comprises a contraction path and an expansion path, wherein the contraction path extracts context information, and the expansion path gradually restores object details and image resolution, but the U-shaped neural network has the defects that network training parameters are too many, the calculation amount is large, and real-time processing cannot be met. OCNet forms a target context feature map by calculating the similarity of each pixel with all pixels, and then represents the pixel by aggregating the features of all pixels, but has the disadvantage that part of the pixels are lost in the process. In deep lab-v3, a void convolution kernel pyramid pooling method is combined to construct a void space pyramid pooling module, and multi-scale context information is captured by using convolutions with different void rates, so that the receptive field is effectively enhanced, the spatial accuracy of a segmentation result is improved, but the defect is that the dependency among pixels is lost.

Disclosure of Invention

The invention aims to provide a semantic segmentation method for urban streets, which has high accuracy, good reliability and good real-time performance.

The invention also aims to provide an automatic driving method comprising the urban street semantic segmentation method.

The invention provides a semantic segmentation method for urban streets, which comprises the following steps:

s1, acquiring an original training data set;

s2, constructing a basic semantic segmentation network;

s3, adding a pixel-based attention module in the basic semantic segmentation network constructed in the step S2 to obtain a pixel-attention-based basic semantic segmentation network;

s4, adding attention modules based on different image levels to the semantic segmentation network based on pixel attention obtained in the step S3, so as to obtain a basic semantic segmentation network based on the attention of different image levels and the attention of pixels;

s5, training the basic semantic segmentation network based on the attention of different levels of the image and the attention of the pixels obtained in the step S4 by adopting the original training data set obtained in the step S1, so as to obtain the semantic segmentation network based on the attention of different levels of the image and the attention of the pixels;

and S6, performing semantic segmentation on the city streets in real time by adopting the semantic segmentation network based on the attention of different levels and the attention of pixels of the image obtained in the step S5.

Step S2, constructing a basic semantic segmentation network, specifically, using a Resnet101 network as the basic semantic segmentation network.

In the basic semantic segmentation network constructed in step S2 and described in step S3, a pixel-based attention module is added, specifically, the pixel-based attention module is connected in series at the output end of the fourth block of the Resnet101 network.

The pixel-based attention module specifically comprises the following steps:

A. features to the fourth block of the Resnet101 networkSign graphX ₄The outer side of (a) is filled with all 1 s of dimension 1;

B. and D, operating the filled characteristic diagram obtained in the step A by adopting the following formula, thereby obtaining a preprocessed characteristic diagramX _pre：

In the formula

Is the feature map of the fourth block of the Resnet101 network populated in step A;

for standard convolution operation, the convolution kernel is 3 x 3, the sample step size is 1,dvoid fraction and value 1;BN() The method is a batch standard operation;

C. for the pretreatment characteristic graph obtained in the step BX _preThe pixel relation matrix is obtained by processing according to the following formulaX _wmap：

In the formulaR() Is a reshape operation; x is a matrix multiplication operation;

activating a function for sigmoid;

D. for the pixel relation matrix obtained in the step CX _wmapProcessing is performed by the following formula to obtain a depth feature mapX _procIs composed ofX _proc=R(X _wmap×R(X _pre))；

E. Feature map for the fourth block of the Resnet101 networkX ₄Is proceeding to the outside ofAll 0 fills of 1;

F. and E, operating the filled characteristic diagram obtained in the step E by adopting the following formula, thereby obtaining the characteristic diagram after convolution with different voidage ratesF ₁~F ₄：

In the formula

A feature map of the fourth block of the Resnet101 network populated in step E;

is a standard convolution operation with a convolution kernel size of 1 x 1, a sample step size of 1,dvoid fraction and value 1;

is a standard convolution operation with a convolution kernel size of 3 x 3, a sample step size of 1,dvoid fraction and value 12;

is a standard convolution operation with a convolution kernel size of 3 x 3, a sample step size of 1,dvoid fraction and value 24;

is a standard convolution operation with a convolution kernel size of 3 x 3, a sample step size of 1,dvoid fraction and value 36;

G. the obtained characteristic diagramX _proc、F ₁、F ₂、F ₃AndF ₄on the channel, processing is performed using the following equation, thereby completing the processing of the pixel-based attention module:

in the formulaF _mThe feature map is output after the attention module based on the pixel processes;concat() A splicing operation in the channel dimension.

In the semantic segmentation based on pixel attention obtained in step S3 in step S4, attention modules based on different image levels are added, specifically at the output end of the second block of the Resnet101 network, and are connected in parallel to the attention modules based on different image levels.

The attention module based on different image levels specifically comprises the following steps:

a. signature graph X for the output of the second block of the Resnet101 network₂The outer side of (a) is filled with all 0 s of dimension 3;

b. c, the filled characteristic diagram obtained in the step a

Performing global average pooling to obtain results

；

c. C, the filled characteristic diagram obtained in the step a

Performing maximum pooling to obtain results

；

d. The results obtained in the step b and the step c are carried outconcatOperate to obtain a first characteristic diagram

；

e. D, operating the first characteristic diagram obtained in the step d by adopting the following formula to obtain the attention characteristic diagramF _N：

In the formula

Is a standard convolution operation with a convolution kernel size of 7 x 7, a sample step size of 1,dvoid fraction and value 1;

is a Hadamard product;

f. the attention feature map obtained in the step eF _NProcessing the image by adopting the following formula so as to obtain a final feature map based on the attention module output of different levels of the imageF _pam：

In the formulaF _mIs a feature map output after processing by the pixel-based attention module.

The invention also provides an automatic driving method comprising the urban street semantic segmentation method.

The urban street semantic segmentation method and the automatic driving method provided by the invention utilize the relation among the high-level feature image pixels to obtain the global information, enhance the correlation among the pixels, and further extract the feature information of the image by combining with the attention module based on the pixels; aiming at the problem of pixel loss of the high-level feature image of the image, an attention module based on different levels of the image is provided, information in the high-level feature image is used as guidance to mine information hidden in the low-level feature image, and then the information is fused with the high-level feature image; therefore, the method of the invention fully utilizes the information of the high-level characteristic diagram and the information of the low-level characteristic diagram, and has high precision, good reliability and good real-time performance.

Drawings

FIG. 1 is a schematic process flow diagram of the process of the present invention.

FIG. 2 is a schematic diagram of a model structure of the method of the present invention.

FIG. 3 is a schematic diagram of a pixel-based attention module in a model structure of the method of the present invention.

FIG. 4 is a schematic diagram of a PSAM module in a pixel-based attention module in a model structure of the method of the present invention.

FIG. 5 is a schematic diagram of the structure of the attention module based on different image levels in the model structure of the method of the present invention.

FIG. 6 is a diagram illustrating the comparison between the effect of the present invention and the effect of the prior art on the same atlas.

Detailed Description

FIG. 1 is a schematic flow chart of the method of the present invention: the method for segmenting the urban street semantics provided by the invention comprises the following steps (the structure schematic diagram of the network provided by the invention is shown in figure 2):

s1, acquiring an original training data set;

s2, constructing a basic semantic segmentation network; particularly, a Resnet101 network is used as a basic semantic segmentation network;

s3, adding a pixel-based attention module in the basic semantic segmentation network constructed in the step S2 to obtain a pixel-attention-based basic semantic segmentation network; in particular, at the output of the fourth block of the Resnet101 network, a pixel-based attention module (ASPP AM module in fig. 2) is concatenated;

in particular, the structure of the pixel-based attention module is shown in FIG. 3; the method specifically comprises the following steps:

A. feature map for the fourth block of the Resnet101 networkX ₄The outer side of (a) is filled with all 1 s of dimension 1;

In the formula

activating a function for sigmoid;

The depth feature map can enable the pixel category information to get more attention, and meanwhile, the detail information is more highlighted;

E. padding the outer side of the feature map X4 of the fourth block of the Resnet101 network with all 0's of scale 1;

wherein, the processing and calculation performed in steps B to E are the PSAM module in fig. 3, and a schematic network structure diagram thereof is shown in fig. 4;

In the formula

A feature map of the fourth block of the Resnet101 network populated in step E;

in the formulaF _mThe feature map is output after the attention module based on the pixel processes;concat() Splicing operation in channel dimension;

after the processing, the detail information of the feature map is weighted and refined, the context information is richer, and the size of the feature map is unchanged;

s4, adding attention modules based on different image levels to the semantic segmentation network based on pixel attention obtained in the step S3, so as to obtain a basic semantic segmentation network based on the attention of different image levels and the attention of pixels; specifically, at the output end of the second block of the Resnet101 network, attention modules (PAM module in fig. 2) based on different levels of the image are connected in parallel;

in specific implementation, a schematic diagram of a network structure of attention modules based on different image levels is shown in fig. 5; the method specifically comprises the following steps:

b. c, the filled characteristic diagram obtained in the step a

Performing global average pooling to obtain results

；

c. C, the filled characteristic diagram obtained in the step a

Performing maximum pooling to obtain results

；

；

In the formula

is a Hadamard product;

f. directing the attention obtained in step eCharacteristic diagramF _NProcessing the image by adopting the following formula so as to obtain a final feature map based on the attention module output of different levels of the imageF _pam：

In the formulaF _mFeature maps output after processing for a pixel-based attention module

After processing, when the feature map is subjected to global average pooling and maximum pooling through direct connection, since the proportion of weight sharing of convolution to the two pooling is the same, in the highlighted information area, the two pooling do not necessarily contribute the same to the task of compensating for edge details; this convolution can be seen as giving different weights to the pooling operation so that the network can learn edge details better;

The process of the invention is further illustrated below with reference to specific examples:

the method disclosed by the invention is used for carrying out experiments on a Cityscapes data set, a frame is used as a pytorch1.4, an evaluation index is a common semantic segmentation evaluation index mIou (mean intersection over), an attention module based on pixels and attention modules based on different levels of an image are embedded into an FCN, and the dependency relationship among the pixels is calculated, wherein the results of the attention modules based on the pixels and the attention modules based on different levels of the image on the Cityscapes verification set are shown in a table 1.

TABLE 1 schematic representation of the impact of two modules on network performance

In order to verify the performance of the attention module, the invention carries out ablation experiments on the two modules, the mIou of Resnet-baseline is 68.1%, and compared with the basic Resnet-baseline, the mIou of the attention module added on the basis of the Resnet-baseline based on the pixels is 73.8%, which is improved by 5.6%. The attention module based on different image levels aims to refine edges and details, the segmentation performance is not obviously improved, and the mIou added to the attention module based on different image levels on the basis of Resnet-baseline is 69.3%, which is improved by 1.2%. The invention also performs experiments on the pyramid pooling module of the void space without any improvement, and the mIou of the pyramid pooling module is 70.7 percent. Experimental results show that the attention module based on pixels is very helpful for scene segmentation. In view of the computational cost, the backbone network is finally used as Resnet-101 with a down-sampling rate of 8, and the results in Table 1 are calculated from the official toolkit.

The invention also compares the method of the invention with the current advanced network, the data set is the test set of Cityscapes, the segmentation chart is predicted by the test set picture provided by the official party through the network, and the result is shown in the table 2 after the official test:

TABLE 2 schematic presentation of comparative data of the method of the invention and various advanced networks

In table 2, the proposed attention mechanism has an mlou of 69.3%, which significantly improves the performance of the previous FCN network. The ASPP AM module is improved by 5.6 percent compared with a reference network on a verification set of the data set, meanwhile, compared with the current popular network, compared with the original scaled FCN-16, the ASPP AM module is improved by nearly 22 percent, compared with the deep Labv3 containing ASPP, the performance of the ASPP AM module is improved by 5 percent, and compared with the latest bilateral attention network BiANet, the ASPP AM module is improved by 3 percent. The accuracy of the model mIou reaches 69.3%. The two modules of the network emphasize the dependency between pixels and the details of a low-level space, so that the method obtains better performance in a Cityscapes test set.

Finally, a visualization of the effect of the two modules proposed by the present invention on the network performance is shown in fig. 6. In fig. 6, from left to right, the original picture (a), the true value label (b), the result (c) of the baseline method, the result (d) of the ASPP AM method, the result (e) of the PAM method, the result (f) of the ASPP method, and the result (g) of the method of the present invention are shown in order. As can be seen from fig. 6, there are modules in the reset-baseline that are misclassified, and some edge detail partitions are not very coherent, such as: the green belt is mixed with sidewalks, plants in the sky, the vehicle is mixed with backgrounds and the like, and after the ASPP AM module is added, the phenomenon of mistaken separation is reduced because the dependence information among pixels is enhanced. After adding the PAM module, selective details are noted, improving the segmentation of the fine objects, such as: traffic signs and the like can be visually compared in the segmentation chart.

Finally, the invention also provides an automatic driving method comprising the urban street semantic segmentation method; in specific implementation, the semantic segmentation method for the urban streets is adopted to carry out semantic segmentation on the urban streets, and then automatic driving control is carried out according to semantic segmentation results.

Claims

1. A method for segmenting urban street semantics is characterized by comprising the following steps:

s1, acquiring an original training data set;

s3, adding a pixel-based attention module in the basic semantic segmentation network constructed in the step S2 to obtain a pixel-attention-based basic semantic segmentation network; specifically, at the output end of the fourth block of the Resnet101 network, a pixel-based attention module is connected in series; the pixel-based attention module specifically comprises the following steps:

In the formula

activating a function for sigmoid;

E. Feature map for the fourth block of the Resnet101 networkX ₄The outer side of (a) is filled with all 0's of dimension 1;

In the formula

A feature map of the fourth block of the Resnet101 network populated in step E;

2. The method for semantic segmentation of city streets according to claim 1, wherein step S4 is implemented by adding an attention module based on different image levels to the pixel attention-based semantic segmentation network obtained in step S3, specifically at the output end of the second block of the Resnet101 network, and then connecting the attention module based on different image levels.

3. The method according to claim 2, wherein the attention module based on different image levels comprises the following steps:

b. c, the filled characteristic diagram obtained in the step a