CN113011429A

CN113011429A - Real-time street view image semantic segmentation method based on staged feature semantic alignment

Info

Publication number: CN113011429A
Application number: CN202110295657.5A
Authority: CN
Inventors: 严严; 翁熙; 王菡子
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-06-22
Anticipated expiration: 2041-03-19
Also published as: CN113011429B

Abstract

A real-time street view image semantic segmentation method based on staged feature semantic alignment relates to a computer vision technology. An encoder is first constructed using a lightweight image classification network ResNet-18 and an efficient spatial-channel attention module, and a decoder is constructed using a plurality of differently designed feature alignment module modules and a global average pooling layer. Then, the encoder and the decoder obtained above are used to construct a semantic segmentation network model based on the network structure of the encoder and the decoder. And finally, aggregating the features in the encoder and the output features of the decoder and sending the aggregated features into a semantic segmentation result generation module to obtain a final semantic segmentation result. Corresponding segmentation results can be efficiently generated at a real-time rate while maintaining a high resolution input image without reducing the image resolution. Compared with the existing real-time semantic segmentation method, the method can obtain more excellent segmentation precision and obtain better balance between speed and precision.

Description

Real-time street view image semantic segmentation method based on staged feature semantic alignment

Technical Field

The invention relates to a computer vision technology, in particular to a real-time street view image semantic segmentation method based on staged feature semantic alignment.

Background

Semantic segmentation is one of the key technologies for scene understanding, and it needs to predict each pixel point in an image to implement the classification work of pixel-level semantic categories of the image. In recent years, the use of automatic driving and intelligent transportation has attracted much attention. In these applications, a problem that needs to be solved is how to provide a comprehensive understanding of traffic conditions at a semantic level. Therefore, it is of exceptional importance for these applications to study street view image semantic segmentation methods and provide pixel-level street view scene understanding.

In recent years, a large number of semantic segmentation methods based on deep learning methods have been proposed, with the benefit of the development of convolutional neural networks. These methods achieve excellent segmentation results by capturing rich semantic information and spatial detail information. However, the underlying network part network of these methods employs a complex deep neural network to capture semantic information in the input image. Such as the commonly used network ResNet-101(k.he, x.zhang, s.ren, and j.sun, "Deep residual learning for image recognition," in proc.ieee int.conf.com.vis.pattern recognition. (CVPR), jun.2016, pp.770-778.), it provides powerful semantic information extraction capability, but its number of layers and width of the huge network also make it inefficient. Generally speaking, applications such as automatic driving and intelligent transportation require not only high-resolution input images to cover a wide field of view, but also efficient interaction or response speed. Therefore, researchers have paid much attention to semantic segmentation methods that can maintain high segmentation accuracy in real time.

To date, efforts have been made to achieve efficient or real-time semantic segmentation. These methods typically employ methods that reduce the resolution of the input image or employ lightweight infrastructure networks to increase the efficiency of the network. Although these methods greatly reduce the computational complexity of semantic segmentation, contextual information or spatial details are lost to some extent, resulting in a significant drop in accuracy. Therefore, how to achieve a good balance between network prediction speed and segmentation accuracy becomes a key challenge for real-time semantic segmentation.

Based on the technical background, the invention provides a real-time street view image semantic segmentation method based on staged feature semantic alignment. The representation capability of the used features is enhanced while only adopting a lightweight base network. Thus, the semantic segmentation network model can maintain excellent segmentation precision while maintaining real-time network prediction speed.

Disclosure of Invention

The present invention is directed to solve the above problems of the prior art, and an object of the present invention is to provide a real-time street view image semantic segmentation method based on staged feature semantic alignment, which can efficiently generate corresponding segmentation results at a real-time rate and has high segmentation accuracy.

The invention comprises the following steps:

A. dividing a streetscape image semantic segmentation data set into a training set, a verification set and a test set;

B. constructing a basic network of a semantic segmentation network model by combining a specially designed high-efficiency space-channel attention module on the basis of a lightweight image classification network structure;

C. b, designing a feature semantic alignment module with different network structures according to the self characteristics of the features of different stages in the basic network obtained in the step B;

D. b, taking the basic network obtained in the step B as an encoder, combining a global average pooling layer and the plurality of feature alignment modules designed in the step C into a decoder, and building a semantic segmentation network model based on a symmetrical encoder-decoder network structure;

E. d, aggregating the output characteristics of the last stage of the network structure obtained in the step D and the characteristics of the first stage of the encoder, and sending the aggregated output characteristics into a semantic segmentation result generation module to form a prediction result;

F. training the parameters in the semantic segmentation network obtained in the step E by using a semantic segmentation training set;

G. in the training process, selectively sending the output features of partial feature alignment modules into mutually independent semantic segmentation result generation modules to generate different prediction results, and jointly updating network parameters by using the prediction results so as to explicitly solve the target multi-scale problem in the street view image;

H. and inputting the test set into the trained network to obtain a semantic segmentation result of the corresponding street view image.

In step a: the street view image semantic segmentation data set can adopt a public data set cityscaps, the data set contains 25000 street view images, and the street view images are divided into a fine labeling subset (5000 pieces) and a rough labeling subset (20000 pieces) according to the fine degree of semantic labeling; the fine labeling subset is further divided into a training set (containing 2975 images), a verification set (containing 500 images) and a test set (containing 1525 images); each image has a size of 1024 × 2048 resolution and each pixel point is labeled as 19 predefined categories including one of road (road), sidewalk (sidewalk), building (building), wall (wall), fence (funce), pillar (pole), traffic light (traffic light), traffic sign (traffic sign), vegetation (vegetation), terrain (terra), sky (sky), person (person), rider (rider), car (car), truck (truck), bus (bus), train (train), motorcycle (motorcycler), and bicycle (bicycle).

In step B, the basic network of the semantic segmentation network model is constructed in a manner including the following two substeps:

B1. adopting a lightweight image classification network ResNet-18 as a basis, wherein ResNet-18 is the most lightweight version in the ResNet network, and as semantic segmentation is a pixel level classification network, the ResNet-18 network cannot be directly used, and all network layers behind the last basic residual block of ResNet-18 are removed to obtain a basic network of a primary semantic segmentation network model; the basic network contains 8 basic residual blocks in total, and the network is divided into four stages by taking 2 continuous basic residual blocks as a group: res-1, Res-2, Res-3 and Res-4;

B2. embedding an efficient space-channel attention module between two residual modules of Res-2, Res-3 and Res-4 so as to improve the characteristic representation capability of a basic network of a semantic segmentation network model and reduce information loss caused by downsampling operation; obtaining a basic network part in the semantic segmentation network model; the high-efficiency space-channel attention module comprises two branch paths, wherein the space branch comprises a 1 x 1 standard convolution and a Sigmoid activation function, and the channel branch comprises a global average pooling operation, a 1 x 1D convolution and a Sigmoid activation function.

In step C, the feature semantic alignment modules with different network structures, each semantic alignment module containing two input features and one output feature, the two input features having different sizes, the small input feature coming from the previous module connected to the module, and the large input feature coming from the corresponding stage of the basic network obtained in step B); in order to increase the speed of the network, features from the underlying network are passed through an additional CBR module to reduce the number of channels of the features; the CBR module contains a 3 × 3 standard convolution operation, a Normalization operation (Batch Normalization), and a ReLU activation function;

then, the large-size input features pass through a feature enhancement module with different designs and an efficient space-channel attention module, so that the features can enhance the feature representation capability according to the features of the features; the method comprises the steps that a characteristic enhancement module (FEB) enables input characteristics to be subjected to a series of convolution layers and normalization operations to enhance semantic information or spatial detail information of the characteristics, and then the enhanced characteristics are aggregated with the input characteristics and pass through a ReLU activation function; for features from Res-4, the convolution layer in the feature enhancement module (FEB-4) is a plurality of depth separable convolutions with different hole rates to enhance semantic information; for features from Res-2, the feature enhancement module (FEB-2) employs standard convolution to improve the capture capability of spatial detail information in the features; for data from Res-3, the feature enhancement module (FEB-3) uses a depth separable volume set without void fraction to balance between enhancing feature representation force and control module computational complexity; for features from Res-1, the feature enhancement module is used out of alignment due to its too large size;

meanwhile, the small-size input features are subjected to the CBR module and the up-sampling operation to obtain the same size and the same channel number as the processed large-size input features, then the two processed input features are spliced together and sent into a standard convolution operation of 3 multiplied by 3 to learn a semantic offset field between the two features; and performing semantic alignment operation on the processed small-size input features by utilizing the learned semantic shift field, and gathering the processed large-size input features and the small-size input features subjected to semantic alignment, and sending the gathered large-size input features and the small-size input features subjected to semantic alignment into another efficient space-channel attention module to generate output features of a feature semantic alignment module.

In step D, the specific construction method of the semantic segmentation network model is as follows: taking the base network obtained in the step B) as an encoder, thereby obtaining four characteristics from Res-1, Res-2, Res-3 and Res-4 stages of the encoder; then, obtaining a feature semantic alignment module-1 to a feature semantic alignment module-4 which are designed according to the feature characteristics obtained from Res-1 to Res-4 from the step C, and finally, sequentially adding a global tie pooling layer, the feature semantic alignment module-4, a feature semantic alignment module-3, a feature semantic alignment module-2 and the feature semantic alignment module-1 at the end of the basic network, wherein the newly added modules form a decoder of a semantic segmentation network model so as to form a symmetrical coding and decoding network structure; corresponding branch paths are established between Res-1 to Res-4 and between the feature semantic alignment module-1 to the feature semantic alignment module-4, and output features of corresponding stages of the basic network are transmitted for subsequent use of the corresponding feature semantic alignment modules.

In step E, the polymerization is specifically performed by: performing channel splicing on the final output of the semantic segmentation network model obtained in the step D and the output characteristics obtained in Res-1, and sending the spliced characteristics into a semantic segmentation result generation module, wherein the module comprises a CBR operation, a 3 multiplied by 3 standard convolution and an upsampling operation; the CBR operation reduces the channel number to 64, a 3 x 3 standard convolution product reduces the channel number of 64 to the category number (19) of the semantic segmentation data set, and an up-sampling operation restores the characteristics which finally have the same category number of channels as the data set to the same size as the original input image so as to obtain the final semantic segmentation result.

In step F, the training is to perform data enhancement operation on the original data set by three methods of random flipping, random scaling (the scaling is 0.5-2.0) and random cutting (768 × 1536); setting the initial learning rate of the network to be 0.005, setting the weight fading parameter to be 0.0005 and setting the momentum factor to be 0.9, and adopting a random gradient descent method (SGD) as an optimizer of the network; the learning rate strategy adopts a poly learning strategy, and the learning rate of the network is updated by 0.9 polynomial power (power); the number of training passes for the entire network was 120000 iterations, with 12 samples per iteration.

In step G, the specific method for selectively sending the output features of the partial feature alignment module to the mutually independent semantic segmentation result generation modules to generate different prediction results, and using the prediction results to update the network parameters together may be: selectively inputting the output of the semantic segmentation network model feature semantic alignment module-1 to the feature semantic alignment module-4 obtained in the step D into the same semantic segmentation result generation module as that in the step E; selecting the output of the feature semantic alignment module-3 and the feature semantic alignment module-4, and respectively using a semantic segmentation result generation module to obtain an auxiliary semantic segmentation result; the whole network comprises three final output results, and each result is used for comparing with an annotated image provided by a data set so as to obtain a corresponding cross entropy loss function result; and finally, adding the obtained three cross entropy loss function results, and updating the network parameters by using a direction propagation algorithm in cooperation with the step F.

In step H, the step of inputting the test set into the trained network is to directly input the image of the original size into the network without any skill for the image used for the test, so as to obtain the semantic segmentation result of the corresponding size.

The invention firstly utilizes a lightweight image classification network ResNet-18 and an efficient space-channel attention module to construct an encoder, and utilizes a plurality of feature alignment module modules with different designs and a global average pooling layer to construct a decoder. Then, the encoder and the decoder obtained above are used to construct a semantic segmentation network model based on the network structure of the encoder and the decoder. And finally, aggregating the features in the encoder and the output features of the decoder and sending the aggregated features into a semantic segmentation result generation module to obtain a final semantic segmentation result.

Compared with the prior art, the invention has the following outstanding advantages:

the present invention is able to efficiently generate corresponding segmentation results at a real-time rate, while maintaining a high resolution (1024 × 2048) of the input image without reducing the image resolution. Meanwhile, compared with the existing real-time semantic segmentation method, the method can obtain more excellent segmentation precision, and better balance between speed and precision is obtained.

Drawings

Fig. 1 is a flowchart of the entire implementation of the embodiment of the present invention.

Fig. 2 is a diagram of the entire network structure according to the embodiment of the present invention. In the figure, 'C' denotes a channel splicing operation and 'UP' denotes an UP-sampling operation.

Fig. 3(a) is a network structure diagram of the feature semantic alignment module according to the embodiment of the present invention. (b) Feature enhancement modules designed for different network architectures. The '+' in the figure represents the element-by-element addition operation.

Fig. 4 is a network structure diagram of the efficient spatio-temporal attention module according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the following examples, which are provided in the present application and are not limited to the following examples.

Referring to fig. 1, an implementation of an embodiment of the invention includes the steps of:

A. and preparing a semantic segmentation training set, a verification set and a test set of the streetscape image.

The dataset used in the invention was cityscaps, which is a large-scale street view image dataset in which data was collected from fifty different cities in germany. The data set comprises 25000 streetscape images and is divided into a fine labeling subset (containing 5000 images with fine semantic labels) and a rough labeling subset (containing 20000 images with rough semantic labels) according to the fine degree of the semantic labels. Each image has a size of 1024 × 2048 resolution and each pixel point is labeled as one of 19 categories (road, sidewalk, building, wall, fence, post, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, and bicycle) defined in advance. In the present invention, only a subset of fine labels is used. And further dividing the fine annotation subset into three parts for use by the invention according to the division mode of the data set provider: a training set (containing 2975 images), a validation set (containing 500 images), and a test set (containing 1525 images).

B. And constructing a basic network of a semantic segmentation network model by combining a specially designed high-efficiency space-channel attention module on the basis of a lightweight image classification network structure.

The construction mode of the basic network of the semantic segmentation network model mainly comprises the following two substeps:

and B1, adopting a lightweight image classification network ResNet-18 as a basis, wherein ResNet-18 is the most lightweight version of the ResNet network, and compared with other ResNet networks, the method is higher in speed and less in model parameter quantity. In addition, because the semantic segmentation is a pixel-level classification network and cannot directly use a ResNet-18 network, all network layers behind the ResNet-18 last basic residual block are removed in the invention to obtain a basic network of a preliminary semantic segmentation network model. The basic network contains 8 basic residual blocks in total, and the network is divided into four stages by taking 2 continuous basic residual blocks as a group: res-1, Res-2, Res-3 and Res-4.

Step B2. the base network obtained in step B1 still has operations detrimental to the semantic segmentation task, mainly due to the down-sampling operation of the first residual module in Res-2, Res-3 and Res-4. Although this operation can extract high-level semantic information, the lack thereof also results in the loss of spatial detail information that is as important for semantic segmentation. Therefore, in order to reduce the information loss caused by the down-sampling operation, a specially designed efficient space-channel attention module is embedded between two residual modules of Res-2, Res-3 and Res-4 so as to improve the feature representation capability of the basic network of the semantic segmentation network model. The underlying network portion in the semantically segmented network model used in the invention is thus obtained. Referring to fig. 4, the efficient spatial-channel attention module contains two branch paths, a spatial branch containing a 1 × 1 standard convolution and a Sigmoid activation function, and a channel branch containing a global averaging pooling operation, a 1 × 1D convolution and a Sigmoid activation function.

C. And B, designing a feature semantic alignment module with different network structures according to the self characteristics of the features of different stages in the basic network obtained in the step B.

Referring to fig. 3, feature semantic alignment modules with different network structures are provided, and the feature semantic alignment modules with different structures are designed according to the self characteristics of input features of specific operations of the modules, so that the problem of misalignment among features of different levels can be effectively solved, and the feature representation capability can be enhanced. Each semantic alignment module contains two input features and one output feature. The two input features are of different sizes, the small sized one coming from the previous module connected to this module, while the large sized one comes from the corresponding stage of the underlying network obtained in step B. To increase the speed of the network, features from the underlying network have been passed through an additional CBR module (containing a 3 × 3 standard convolution operation, a Batch Normalization operation, and a ReLU activation function) to reduce the number of channels of the features.

Then, the large-sized input features are passed through a feature enhancement module having a different design and an efficient spatio-channel attention module, so that the features can enhance the feature representation capability according to their own characteristics. The feature enhancement module (FEB) subjects the input features to a series of convolutional layer and normalization operations to enhance semantic information or spatial detail information of the features, and then the enhanced features are aggregated with the input features and pass through a ReLU activation function. For features from Res-4, the convolutional layer in the feature enhancement module (FEB-4) is a number of depth separable convolutions with different hole rates to enhance semantic information. For features from Res-2, the feature enhancement module (FEB-2) uses standard convolution to improve the capture capability of spatial detail information in the features. For data from Res-3, the feature enhancement module (FEB-3) uses a depth separable volume set without void fraction to balance the enhancement feature representation force against the control module computational complexity. While for the feature from Res-1, the feature enhancement module is not used in alignment because of its large size.

Meanwhile, the small-sized input features pass through one CBR module and an upsampling operation to obtain the same size and number of channels as the processed large-sized input features. The two processed input features are then stitched together and fed into a 3 x 3 standard convolution operation to learn the semantic offset field between the two features. And performing semantic alignment operation on the processed small-size input features by utilizing the learned semantic offset field. And finally, the processed large-size input features and the semantically aligned small-size input features are aggregated together and sent to another efficient space-channel attention module to generate output features of a feature semantic alignment module.

D. And D, taking the basic network obtained in the step B as an encoder, combining the global average pooling layer and the plurality of feature alignment modules designed in the step C into a decoder, and building a semantic segmentation network model based on a symmetrical encoder-decoder network structure.

Referring to fig. 2, the semantic segmentation network model is specifically constructed by the following steps: the base network obtained in step B is used as an encoder, so that four characteristics can be obtained from the four stages Res-1, Res-2, Res-3 and Res-4 of the encoder. And then, obtaining a feature semantic alignment module-1 to a feature semantic alignment module-4 which are designed according to the feature characteristics obtained by Res-1 to Res-4 from the step C. And finally, a global tie pooling layer, a feature semantic alignment module-4, a feature semantic alignment module-3, a feature semantic alignment module-2 and a feature semantic alignment module-1 are sequentially added to the basic network, and the newly added modules form a decoder of a semantic segmentation network model so as to form a symmetrical coding and decoding network structure. In addition, corresponding branch paths are established between Res-1 to Res-4 and between the feature semantic alignment module-1 to the feature semantic alignment module-4, and output features of corresponding stages of the basic network are transmitted for subsequent use of the corresponding feature semantic alignment modules.

E. And D, aggregating the output characteristics of the last stage of the network structure obtained in the step D and the characteristics of the first stage of the encoder, and sending the aggregated output characteristics into a semantic segmentation result generation module to form a prediction result.

The polymerization operation in the invention is specifically carried out as follows: and D, performing channel splicing on the final output of the semantic segmentation network model obtained in the step D and the output characteristics obtained in Res-1. And sending the spliced features into a semantic segmentation result generation module, wherein the module comprises a CBR operation, a 3 x 3 standard convolution and an up-sampling operation. The CBR operation reduces the channel number to 64, then the standard convolution of 3 x 3 reduces the channel number of 64 to the category number of the semantic segmentation data set (19), and finally the upsampling operation restores the channel number to the same size as the original input image to obtain the final semantic segmentation result.

F. And E, training the parameters in the semantic segmentation network obtained in the step E by using a semantic segmentation training set.

In the training process, three methods of random inversion, random scaling (the scaling is 0.5-2.0) and random cutting (768 × 1536) are adopted to perform data enhancement operation on the original data set. The initial learning rate of the network is set to 0.005, the weight decay parameter is 0.0005, the momentum factor is set to 0.9, and a random gradient descent method (SGD) is employed as the optimizer of the network. For the learning rate strategy, a popular "poly" learning strategy is adopted, and the learning rate of the network is updated by 0.9 polynomial power (power). The number of training passes for the entire network was 120000 iterations, with 12 samples per iteration.

G. In the training process, the output features of partial feature alignment modules are selectively sent to mutually independent semantic segmentation result generation modules to generate different prediction results, and the prediction results are used for jointly updating network parameters so as to explicitly solve the multi-scale problem of the target in the street view image.

And D, selectively inputting the output of the semantic segmentation network model feature semantic alignment module-1 to the feature semantic alignment module-4 obtained in the step D into the same semantic segmentation result generation module as the semantic segmentation result generation module in the step E. In the invention, the output of the feature semantic alignment module-3 and the output of the feature semantic alignment module-4 are selected, and a semantic segmentation result generation module is respectively used for obtaining an auxiliary semantic segmentation result. Thus, the entire network contains three final output results, each of which is used to compare with the annotated image provided by the dataset to obtain the corresponding cross-entropy loss function result. And finally, adding the obtained three cross entropy loss function results, and updating the network parameters by using a direction propagation algorithm in cooperation with the step F).

The image used for testing is directly input into the network without any skill, and the semantic segmentation result of the corresponding size is obtained.

Table 1 shows the performance and speed of the present invention compared to some other semantic segmentation methods on the Cityscapes test data set. From table 1, it can be seen that the present invention achieves a segmentation accuracy of 78.0% mlio u and a prediction speed of 37fps while using high resolution (1024 × 2048) input images. Compared with most methods, the method has better precision and prediction speed. Especially in methods that meet real-time (greater than 30fps) requirements, the present invention achieves optimal segmentation accuracy. Even if compared with the PSPNet which pursues precision, the invention can maintain the similar segmentation precision under the condition of being faster than the PSPNet by 47 times of the inference speed. Therefore, the invention can maintain excellent segmentation precision while maintaining real-time network prediction speed.

TABLE 1

Deep lab corresponds to the method proposed by l.c. chen et al (l.c. chen, g.pandreuu, i.kokkinos, k.murphy, and a.l. yuille, "selective image segmentation with default connected networks and fully connected CRFs," in proc.int.conf.lean. content (ICLR), May 2015.);

PSPNet corresponds to the method proposed by h.zhao et al (h.zhao, j.shi, x.qi, x.wang, and j.jia, "Pyramid scene sharing network," in proc.ieee int.conf.com.vis.pattern Recognit. (CVPR), ju.2017, pp.2881-2890.);

SegNet corresponds to the method proposed by V.Badrinarayana et al (V.Badrinarayana, A.Kendall, and R.Cipolla, "SegNet: A deep connected encoder-decoder architecture for image segmentation," IEEE trans. Pattern animal. Mach.Intel.39, No.12, pp.2481-2495, dec.2017.);

ENet corresponds to the method proposed by A.Paszke et al (A.Paszke, A.Charrasia, S.Kim, and E.Curchielo, "ENet: A deep neural network architecture for real-time segmentation," Jun.2016, arXiv:1606.02147.[ Online ] Available: https:// axiv.org/abs/1606.02147);

ESPNet corresponds to the method proposed by s.mehta et al (s.mehta, m.rastegari, a.caspi, l.shapiro, and h.hajishirzi, "ESPNet: effective specific pyrad of modified constants for the segmentation," in proc.eur.conf.com.vis. (ECCV), sep.2018, 552-568.);

ERFNet corresponds to the method proposed by E.Romera et al (E.Romera, J.M.lvarez, L.M.Bergasa, and R.Arroyo, "ERFNet: effective residual magnetized ConvNet for real-time magnetic segmentation," IEEE Trans.inner.Transp.Syst. vol.19, No.1, pp.263-272, Jan.2018.);

ICNet corresponds to the method proposed by h.zhao et al (h.zhao, x.qi, x.shen, j.shi, and j.jia, "icnetforward-time segmentation on high-resolution images," in proc. eur.conf.com. vis. (ECCV), sep.2018, pp.405-420.);

DABNet corresponds to the method proposed by G.Li et al (G.Li, I.Yun, J.Kim, and J.Kim, "DABNet: Depth-wise asymmetry bottom for real-time segmentation," in Proc. Brit.Mach.Vis.Conf. (BMVC), Sep.2019, pp.1-12.);

the method proposed by GUN corresponds to d.mazzini (d.mazzini, "Guided upsampling network for real-time management segmentation," in proc.brit.mach.vis.conf. (BMVC), sep.2018, p.117.);

EDANet corresponds to the method proposed by s.y.lo et al (s.y.lo, h.m.hang, s.w.chan, and j.j.lin, "efficiency diversity modules of elementary regulation for real-time differentiation," in proc.acm Multimedia Asia (MMAsia), dec.2019, pp.1-6.);

LEDNet corresponds to the method proposed by Y.Wang et al (Y.Wang et al, "LEDNet: A light encoder-decoder network for real-time information segmentation," in Proc.IEEE int. Conf.image Process. (ICIP), aug.2019, pp.1860-1864.);

DFANet corresponds to the method proposed by H.Li et al (H.Li, P.Xiong, H.Fan, and J.Sun, "DFANet: Deep feature aggregation for real-time segmentation," in Proc.IEEE int.Conf.Comp.Vis.Pattern Recognit. (CVPR), Jun.2019, pp.9522-9531.);

DF1-Seg corresponds to the method proposed by X.Li et al (X.Li, Y.Zhou, Z.Pan, and J.Feng, "Partial order pruning: For best space/acquisition track-off in neural architecture search," in Proc.IEEE int.Conf.Comp.Vis.Pattern Recognit (CVPR), Jun.2019, pp.9145-9153.);

DF2-Seg corresponds to the method proposed by X.Li et al (X.Li, Y.Zhou, Z.Pan, and J.Feng, "Partial order pruning: For best space/acquisition track-off in neural architecture search," in Proc.IEEE int.Conf.Comp.Vis.Pattern Recognit (CVPR), Jun.2019, pp.9145-9153.);

LRNNet corresponds to the method proposed by W.Jiang et al (W.Jiang, Z.Xie, Y.Li, C.Liu and H.Lu, "LRNNet: A light-weighted network with effective reduced non-local operation for real-time management segmentation," in Proc.IEEE Int.conf.multimedia and Expo works (ICMEW), Jul.2020, pp.1-6.);

the RTHP corresponds to the method proposed by G.Dong et al (G.Dong, Y.Yan, C.Shen, and H.Wang, "Real-time high-performance semantic image segmentation of database string scenes," IEEE Trans.Intell.Transp.Syst., pp.1-17, Jan.2020.);

SwiftNet corresponds to the method proposed by m.orisic et al (m.orisic, i.kreso, p.bevantic, and s.segvic, "In destination of predefined ImageNet architecture for real-time management segmentation of road-driving images," In proc.ieee int.conf.com.v.picture.pattern recognition (CVPR), jun.2019, pp.12607-12616.);

SwiftNet-ens corresponds to the method proposed by m.oric et al (m.orisic, i.kreso, p.bevantic, and s.segvic, "In destination of predicted ImageNet architecture for real-time management segmentation of road-driving images," In proc.ieee int.conf.com.vis.pattern recognition (CVPR), jun.2019, pp.12607-12616.);

SFNet (DF2) corresponds to the method proposed by x.li et al (x.li et al, "mechanical flow for fast and acid gene pairing," inproc eur. conf. comprehensive. vis. (ECCV), nov.2020, pp.775-793.);

SFNet (ResNet-18) corresponds to the method proposed by X.Li et al (X.Li et al, "" mechanical flow for fast and cure scene preparation "" InProc. Eur. Conf. Compout. Vis. (ECCV), "Nov.2020, pp.775-793.);

the method proposed by C.Yu et al (C.Yu, J.Wang, C.Peng, C.Gao, G.Yu, and N.Sangg), "BiSeNet: Bilaterl segmentation network for real-time segmentation," in Proc.Eur.Conf.Comput.Vis. (ECCV), Sep.2018, pp.325-341.);

BiSeNetV2 corresponds to the method proposed by C.Yu et al (C.Yu, C.Gao, J.Wang, G.Yu, C.Shen, and N.Sangg. "BiSeNetV 2: binary network with defined aggregation for real-time segmentation," Apr.2020, arXiv:2004.02147.[ Online ]. Available: https:// arxiv.org/abs/2004.02147).

Claims

1. A real-time street view image semantic segmentation method based on staged feature semantic alignment is characterized by comprising the following steps:

2. The method for real-time street view image semantic segmentation based on staged feature semantic alignment as claimed in claim 1, wherein in step a: the street view image semantic segmentation data set adopts a public data set Cityscapes, the data set contains 25000 street view images, and the street view images are divided into 5000 fine labeling subsets and 20000 coarse labeling subsets according to the fine degree of semantic labeling; further dividing 5000 fine labeling subsets into 2975 training sets, 500 verification sets and 1525 test sets; each image has a size of 1024 × 2048 resolution and each pixel is labeled as 19 predefined categories including road, sidewalk, building, wall, enter, pole, traffic light, traffic sign, vegetation, terain, sky, person, rider, car, truck, bus, train, motorcycle, bicycle.

3. The method as claimed in claim 1, wherein in step B, the basic network of the semantic segmentation network model is constructed in a manner including the following two sub-steps:

B1. based on a lightweight image classification network ResNet-18, wherein ResNet-18 is the most lightweight version in the ResNet network, and all network layers behind the last basic residual block of ResNet-18 are removed to obtain a basic network of a preliminary semantic segmentation network model; the basic network contains 8 basic residual blocks in total, and the network is divided into four stages by taking 2 continuous basic residual blocks as a group: res-1, Res-2, Res-3 and Res-4;

4. The method as claimed in claim 1, wherein in step C, the feature semantic alignment modules with different network structures each include two input features and an output feature, the two input features have different sizes, the small input feature is from the previous module connected to the module, and the large input feature is from the corresponding stage of the basic network obtained in step B); in order to increase the speed of the network, features from the underlying network are passed through an additional CBR module to reduce the number of channels of the features; the CBR module contains a 3 x 3 standard convolution operation, a normalization operation and a ReLU activation function;

then, the large-size input features pass through a feature enhancement module with different designs and an efficient space-channel attention module, so that the features can enhance the feature representation capability according to the features of the features; the feature enhancement module FEB enables input features to be subjected to a series of convolution layers and normalization operations to enhance semantic information or spatial detail information of the features, and then the enhanced features are aggregated with the input features and pass through a ReLU activation function; for features from Res-4, the convolution layer in the feature enhancement module FEB-4 is a plurality of depth separable convolutions with different hole rates to enhance semantic information; for the feature from Res-2, the feature enhancement module FEB-2 adopts standard convolution to improve the capture capability of spatial detail information in the feature; for data from Res-3, feature enhancement module FEB-3 uses a depth separable volume set without void fraction to balance the enhancement feature representation force against the control module computational complexity; for features from Res-1, the feature enhancement module is used out of alignment due to its too large size;

5. The real-time street view image semantic segmentation method based on staged feature semantic alignment as claimed in claim 1, wherein in step D, the semantic segmentation network model is specifically constructed by: taking the base network obtained in the step B) as an encoder, thereby obtaining four characteristics from Res-1, Res-2, Res-3 and Res-4 stages of the encoder; then, obtaining a feature semantic alignment module-1 to a feature semantic alignment module-4 which are designed according to the feature characteristics obtained from Res-1 to Res-4 from the step C, and finally, sequentially adding a global tie pooling layer, the feature semantic alignment module-4, a feature semantic alignment module-3, a feature semantic alignment module-2 and the feature semantic alignment module-1 at the end of the basic network, wherein the newly added modules form a decoder of a semantic segmentation network model so as to form a symmetrical coding and decoding network structure; corresponding branch paths are established between Res-1 to Res-4 and between the feature semantic alignment module-1 to the feature semantic alignment module-4, and output features of corresponding stages of the basic network are transmitted for subsequent use of the corresponding feature semantic alignment modules.

6. The method as claimed in claim 1, wherein in step E, the aggregating is performed specifically by: performing channel splicing on the final output of the semantic segmentation network model obtained in the step D and the output characteristics obtained in Res-1, and sending the spliced characteristics into a semantic segmentation result generation module, wherein the module comprises a CBR operation, a 3 multiplied by 3 standard convolution and an upsampling operation; the CBR operation reduces the channel number to 64, a 3 x 3 standard convolution product reduces the channel number of 64 to the category number of the semantic segmentation data set, and an up-sampling operation restores the characteristics which finally have the same category number of channels as the data set to the same size as the original input image so as to obtain the final semantic segmentation result.

7. The method for real-time street view image semantic segmentation based on staged feature semantic alignment as claimed in claim 1, wherein in step F, the training is performed by performing data enhancement operation on an original data set by three methods of random flipping, random scaling and random cropping; the scaling of the random scaling is 0.5-2.0, and the size of the random cutting is 768 multiplied by 1536; setting the initial learning rate of the network to be 0.005, setting the weight fading parameter to be 0.0005 and setting the momentum factor to be 0.9, and adopting a random gradient descent method as an optimizer of the network; the learning rate strategy adopts a poly learning strategy, and the learning rate of the network is updated by a polynomial power of 0.9; the number of training passes for the entire network was 120000 iterations, with 12 samples per iteration.

8. The method as claimed in claim 1, wherein in step G, the output features of the partial feature alignment module are selectively sent to semantic segmentation result generation modules independent from each other to generate different prediction results, and the specific method for updating the network parameters by using the prediction results together may be: selectively inputting the output of the semantic segmentation network model feature semantic alignment module-1 to the feature semantic alignment module-4 obtained in the step D) into the same semantic segmentation result generation module as in the step E); selecting the output of the feature semantic alignment module-3 and the feature semantic alignment module-4, and respectively using a semantic segmentation result generation module to obtain an auxiliary semantic segmentation result; the whole network comprises three final output results, and each result is used for comparing with an annotated image provided by a data set so as to obtain a corresponding cross entropy loss function result; and finally, adding the obtained three cross entropy loss function results, and updating the network parameters by using a direction propagation algorithm in cooperation with the step F.

9. The method according to claim 1, wherein in step H, the test set is input into the trained network, and the image of the original size is directly input into the network without any skill for the image used in the test, so as to obtain the semantic segmentation result of the corresponding size.