CN112819000A

CN112819000A - Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium

Info

Publication number: CN112819000A
Application number: CN202110208934.4A
Authority: CN
Inventors: 梁超; 王小瑀; 宋宇; 程超; 姜长泓
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-05-18

Abstract

The invention discloses a streetscape image semantic segmentation system and a segmentation method, electronic equipment and a computer readable medium, wherein the segmentation method comprises the following steps: step 1, street view images are collected and preprocessed and data enhancement is carried out on the street view images; step 2, encoding the street view image into an output feature map by using an encoder; step 3, collecting the characteristics of the three output characteristic graphs by using a multi-stage characteristic combination upper sampling module, and fusing to obtain a second output characteristic graph; step 4, converting the second output characteristic diagram into a third output characteristic diagram; step 5, inputting the third output feature map into a convolution classifier to obtain a semantic segmentation feature value; step 6, performing end-to-end training by using a back propagation algorithm to obtain a streetscape image semantic segmentation model; step 7, performing semantic segmentation on the street view image by using the street view image semantic segmentation model; the invention accelerates the network segmentation speed and enhances the real-time response capability in the application under the condition of not reducing the semantic segmentation precision.

Description

Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium

Technical Field

The invention belongs to the technical field of image semantic segmentation, and particularly relates to a streetscape image semantic segmentation system, a streetscape image semantic segmentation method, electronic equipment and a computer readable medium.

Background

Semantic segmentation is one of basic tasks of computer vision, and aims to allocate a semantic label to each pixel in an image so as to obtain a pixel-level segmentation result.

As the most original full convolution neural network, the FCN is transformed from a convolution neural network specially used for image classification, and the semantic segmentation has made great progress in recent years due to the benefit of deep learning technology after the FCN; semantic segmentation algorithms applied to unmanned driving are generally divided into two main categories: the first type is a network based on an encoder-decoder structure, such as Unet and SegNet, when the encoder-decoder structure is used for performing segmentation tasks of few categories, the classification speed is high, the accuracy is high, but when the classification categories are increased, the semantic segmentation speed and the semantic segmentation accuracy are greatly reduced; the second type is a network based on context information, such as PSPNet and deep lab v3+, which improves the scene analysis capability of the network by introducing more context information, and keeps the receptive field unchanged by introducing a hole convolution, and adopts a hole pyramid pooling at the top of the final feature map, thereby avoiding down-sampling operation and obtaining a large amount of receptive field information, but the introduction of the hole convolution can increase the computational complexity and memory occupancy of the network, and the network has serious defects in the aspect of segmentation speed.

The existing semantic segmentation network usually generates a large number of parameters during running, consumes a large amount of running time, only considers the segmentation precision but not the real-time property of the network, and the unmanned driving field has requirements on the accuracy of the semantic segmentation network and is very sensitive to the real-time property of the algorithm, so that the semantic segmentation algorithm is required to have real-time processing speed and rapid interaction and response capability, and the network is not suitable for unmanned driving.

Disclosure of Invention

The invention aims to provide a streetscape image semantic segmentation system, which uses a multi-level feature combined up-sampling module and a pyramid pooling module to extract deep features and shallow features in streetscape images, the collected features can comprehensively represent each segmented object, so that the semantic segmentation precision is higher, and meanwhile, a low-resolution feature map is used for approximating a high-resolution feature map, so that the network operation speed is accelerated, and the response capability in application is improved.

The invention also aims to provide a street view image semantic segmentation method, which can be used for performing semantic segmentation on a street view image, greatly improving the real-time performance of the semantic segmentation under the condition of ensuring the segmentation precision, quickly performing semantic segmentation on the street view image when the street view image is used for unmanned driving, giving real-time response and improving the safety of the unmanned driving.

It is also an object of the present invention to provide an electronic device and a computer readable medium for storing and performing semantic segmentation of street view images.

The technical scheme adopted by the invention is that the streetscape image semantic segmentation system comprises:

the preprocessing module is used for carrying out scaling, random cutting, random turning and normalization processing on the finely marked street view image;

the encoder is used for encoding the preprocessed street view image into five output characteristic graphs with gradually reduced size and resolution, and inputting the three output characteristic graphs into the multi-level characteristic combination upper sampling module;

the multi-level feature combined up-sampling module is used for extracting features and context information in the three subsequent output feature graphs and fusing the features and the context information to obtain a second output feature graph;

the pyramid pooling module is used for performing convolution processing on the second output characteristic diagram and converting the second output characteristic diagram into a third output characteristic diagram with low resolution;

and the convolution classifier is used for dividing the third output characteristic image into different objects to realize image semantic segmentation.

The street view image semantic segmentation method comprises the following steps:

step 1, obtaining street view images with fine labels, dividing the street view images into a training set, a testing set and a verification set, and inputting the street view images into a preprocessing module for preprocessing and data enhancement;

step 2, the preprocessing module inputs the processed street view images of the training set into an encoder, the encoder performs convolution operation and maximum pooling operation on the input street view images to obtain five output feature maps of Conv1 layers to Conv5 layers, and the last three output feature maps are input into a multi-level feature combined upper sampling module;

step 3, the multi-level feature combined up-sampling module respectively collects features and context information in the three output feature graphs, and the collected results are fused to obtain a second output feature graph;

step 4, the pyramid pooling module takes the second output characteristic diagram as input and carries out convolution operation on the second output characteristic diagram so as to convert the second output characteristic diagram into a third output characteristic diagram with low resolution;

step 5, inputting the third output feature map into a convolution classifier to obtain a semantic segmentation feature value;

step 6, comparing the semantic segmentation characteristic value with the fine label, and performing end-to-end training by using a back propagation algorithm to obtain a streetscape image semantic segmentation model;

and 7, preprocessing the street view image to be tested, inputting the street view image semantic segmentation model to obtain a semantic segmentation characteristic value, and up-sampling the semantic segmentation characteristic value to obtain a semantic segmentation image.

Further, the preprocessing and data enhancement in step 1 includes: and carrying out scaling, random cutting, random turning and normalization processing on the training set images, and carrying out scaling and normalization processing on the test set images and the verification set images.

Further, the encoder in step 2 is a lightweight network FCN8s, and sequentially comprises 2 groups of 2 convolution operations of 3 × 3, a maximum pooling operation, 3 groups of 3 convolution operations of 3 × 3, and a maximum pooling operation;

the five output characteristic graphs are as follows: the Conv1 layer output feature map is one half of the size of the original image and has 64 encoders; the Conv2 layer output feature map is one fourth of the size of the original image and has 64 encoders; the Conv3 layer output feature map is one eighth of the size of the original image and has 128 encoders; the Conv4 layer output feature map is one sixteenth of the original image in size, and has 256 encoders; the Conv5 layer output feature map is thirty-half the size of the original image, with 512 encoders.

Further, the step 3 specifically includes the following steps:

step 31, performing convolution processing on the three input feature maps to generate three first intermediate feature maps, and performing up-sampling and splicing operation on the three first intermediate feature maps to obtain a first output feature map;

and step 32, processing the first output characteristic diagram by using four depth separable convolutions with different expansion rates respectively to obtain four second intermediate characteristic diagrams, and inputting the four second intermediate characteristic diagrams into the convolution layers to stack and compress the input convolution layers to obtain a second output characteristic diagram, wherein the expansion rates of the depth separable convolutions are 1, 2, 4 and 8 respectively.

Further, the specific operation of step 4 is to perform step-by-step convolution on the input second output feature map, delete the odd-numbered elements to obtain a third intermediate feature map, and perform several times of ordinary convolution on the third intermediate feature map to obtain a third output feature map.

Further, in the step 5, the convolution classifier is operated by conv2d, the number of input channels is the number of street view image segmentation objects, the size of a convolution kernel is 1, the convolution filling mode is same, and the activation function is softmax.

Further, the reverberation propagation algorithm in the step 6 uses Adam optimizer, the loss function is spark _ probability _ cross strategy, the initial learning rate is 0.001, the learning rate strategy is an inverse time decay strategy, and the weight decay is normalized by using L2, wherein escape _ steps is 74300, and escape _ rate is 0.5.

An electronic device comprising a processor and a memory, the processor and memory in communication with each other;

a memory for storing a computer program;

and the processor is used for realizing the steps of the method when executing the program stored in the memory.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps.

The invention has the beneficial effects that: the embodiment of the invention provides a semantic segmentation method with higher efficiency and better real-time performance on the basis of the existing semantic segmentation network, a lightweight network FCN8s is used as an encoder to output a multi-scale feature map, then multi-level feature joint up-sampling is used for extracting features and context information in the multi-scale feature map, and then step convolution and common convolution are used for extracting features to obtain more comprehensive feature information, so that the semantic segmentation model obtained by training has higher semantic segmentation precision, and meanwhile, a low-resolution feature map is used to approximate a high-resolution feature map, so that the operation amount of the semantic segmentation network is greatly reduced, the segmentation speed of the network is increased, and the real-time response capability of the network in application is further enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of an implementation of an embodiment of the present invention.

Fig. 2 is a network configuration diagram of an embodiment of the present invention.

Fig. 3 is a block diagram of a multi-level feature joint upsampling module.

FIG. 4 is the semantic segmentation effect of different algorithms on the Cityscapes dataset.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The streetscape image semantic segmentation system comprises a preprocessing module, an encoder, a multi-level feature combined up-sampling module, a pyramid pooling module and a convolution classifier which are sequentially connected, wherein the preprocessing module is used for carrying out scaling, random cutting, random turning and normalization operations on images in a data set; the encoder is a lightweight network FCN8s, the lightweight network is used for encoding image features to obtain output feature maps of a Conv1 layer, a Conv2 layer, a Conv3 layer, a Conv4 layer and a Conv5 layer, the multi-level feature combined upsampling module is used for extracting features and context information in the output feature maps of the Conv3 layer, the Conv4 layer and the Conv5 layer and fusing the extracted feature information to obtain a second output feature map, the pyramid pooling module is used for performing convolution processing on the second output feature map to convert the second output feature map with high resolution into a third output feature map with low resolution, and the convolution classifier is used for dividing the feature maps into different objects to realize image semantic segmentation.

Examples

As shown in fig. 1, the streetscape image semantic segmentation method includes the following steps:

step 1, acquiring an unmanned street view image with fine labels, and dividing the unmanned street view image into a training set, a verification set and a test set;

selecting a Cityscapes database released by a speed company as an unmanned street view image, wherein the Cityscapes database comprises 50 street view images of cities in different scenes, backgrounds and seasons, 5000 fine labeled images with the resolution of 1024 multiplied by 2048 are contained, and the fine labeled images are divided into 2975 training images, 500 verification images and 1525 test images;

as a segmentation object, the following 34 classes of objects were used: unlabeled, ego vehicle, recovery binder, out of roi, static, dynamic, ground, road, sidewalk, park, rail track, building, wall, dance, guard rail, bridge, tunnel, pole, polegroup, traffic light, traffic sign, vegetation, terain, sky, person, rider, car, truck, bus, caravan, trailer, train, motorcyclee, bicycle, license plate;

preprocessing and data enhancing are carried out on the finely labeled street view image;

the resolution of the finely-labeled street view images in the cityscaps database is higher, and the direct semantic segmentation of the finely-labeled street view images can seriously reduce the running speed of a semantic segmentation network, so that the street view images in a training set need to be scaled to 512 multiplied by 1024, then randomly cut to 512 multiplied by 512, and are randomly turned and normalized, and the street view images in a test set and a verification set are scaled to 512 multiplied by 512 and are normalized;

step 2, as shown in fig. 2, a lightweight network FCN8s is used as an encoder of the semantic segmentation network to encode street view images of the training set;

the lightweight network FCN8s has the characteristics of accurate encoding semantic information, small calculated amount and the like, and can reduce the time consumption of an algorithm in an encoding characteristic stage by using the lightweight network FCN8s as an encoder, wherein the lightweight network FCN8s sequentially comprises 2 groups of 2 convolution operation operations of 3 × 3, maximum pooling operation, 3 groups of 3 convolution operations of 3 × 3 and maximum pooling operation;

the input street view image format is H multiplied by W multiplied by 3, the length and width of the image are reduced to one half of the original image after each maximum pooling operation, the Conv1 layer output feature map generated by the encoder has a size of one half of the original image and 64 encoders, the Conv2 layer output feature map has a size of one fourth of the original image and 64 encoders, the Conv3 layer output feature map has a size of one eighth of the original image and 128 encoders, the Conv4 layer output feature map has a size of one sixteenth of the original image and 256 encoders, the Conv5 layer output feature map has a size of one thirty-half of the original image and 512 encoders;

step 3, respectively acquiring context information and characteristics of Conv3 layer, Conv4 layer and Conv5 layer output characteristic diagrams by using a multi-level characteristic combined up-sampling module, and fusing acquisition results to obtain a second output characteristic diagram with high resolution;

as shown in fig. 3, the multistage feature joint upsampling module takes the last three feature maps (Conv3-Conv5) of the encoder network FCN8s as its input, performs convolution processing on the three input feature maps (Conv3-Conv5) respectively to generate three first intermediate feature maps, puts the three first intermediate feature maps into a same space with a lower dimension, and performs upsampling and splicing on the three first intermediate feature maps to obtain a first output feature map, so that the context information of the multistage feature maps is better fused, and the computational complexity of the first output feature map is reduced;

then, respectively extracting deep-layer and shallow-layer features in the first output feature map by using four depth separable convolutions to obtain four second intermediate feature maps, stacking channels of the four second intermediate feature maps by using a convolution layer, and compressing and converting the channels into a high-resolution second output feature map with a normal channel size; the expansion rates of the four depth separable convolutions are respectively 1, 2, 4 and 8, the relationship between the first output characteristic diagram and the separation characteristic diagram is captured by using the depth separable convolution with the expansion rate of 1, and the characteristic diagram obtained by separating the first output characteristic diagram is converted into the mapping of the second output characteristic diagram by using the depth separable convolution learning with the expansion rates of 2, 4 and 8;

in the embodiment, a multilevel feature combined upsampling module is used for avoiding convolution calculation of a cavity pyramid pooling network with huge parameter quantity and a high-resolution output feature map so as to greatly reduce the segmentation speed, and multi-scale context information can be extracted from multilevel feature mapping, so that better performance is obtained;

step 4, inputting the second output characteristic diagram into a pyramid pooling module, and converting the high-resolution second output characteristic diagram into a low-resolution third output characteristic diagram through convolution processing so as to further extract multi-scale information of the second output characteristic diagram and improve the capability of a network for segmenting targets with different scales;

the pyramid pooling module comprises step convolution and a plurality of times of common convolution, the second output feature map is input into the step convolution for convolution processing, then elements with odd indexes are deleted to obtain a third intermediate feature map, and the third intermediate feature map is subjected to the plurality of times of common convolution to obtain a third output feature map with lower spatial resolution;

when the number of times of the ordinary convolution is increased, the more abstract the information contained in the feature map obtained along with the convolution, the stronger semantic information is provided, the receptive field is enlarged, but the resolution is reduced, the perception capability of the detail is poor, the resolution of the feature map obtained by reducing the number of times of the convolution is higher, the more information such as position, detail and the like is contained, but the semantic property is reduced, the noise is more, and the embodiment performs the ordinary convolution for 5 times in the operation process;

the convolution classifier is configured to: adopting conv2d operation, setting input filters 34, kernel _ size 1, padding same as 'same', activation same as 'softmax', filters as filter number, kernel _ size as convolution kernel size, padding as convolution filling mode, and activation as activation function;

processing the semantic characteristic values only by using random inversion and random clipping in an end-to-end training process, wherein an Adam optimizer is used in a back propagation algorithm, a loss function is a sparse _ coordinated _ cross strategy, the initial learning rate is 0.001, a learning rate strategy is an inverse time attenuation strategy, and weight attenuation is normalized by using L2, wherein the decay _ steps is 74300, the decay _ rate is 0.5, and represents that the learning rate is attenuated to two thirds after each 100 epochs;

and 7, down-sampling the street view image to be tested to 512 multiplied by 512, then inputting a semantic segmentation model to obtain a semantic segmentation characteristic value, up-sampling the semantic segmentation characteristic value by utilizing bilinear interpolation, and restoring the semantic segmentation characteristic value into the street view image semantic segmentation image.

The method comprises the steps of performing semantic segmentation on a Cityscapes database by using various existing semantic segmentation algorithms and the embodiment respectively, wherein evaluation indexes of the Cityscapes database are shown in table 1, wherein an index Pix Acc and an index mIoU are used for evaluating semantic segmentation accuracy of the algorithms, and an index FPS is used for evaluating semantic segmentation speed of the algorithms, and the data in the table 1 show that the semantic segmentation method can greatly increase the operation speed of semantic segmentation without losing the semantic segmentation accuracy, and can greatly improve real-time response capability and driving safety of unmanned driving when being used for the unmanned driving.

TABLE 1 comparison of different evaluation indexes of algorithms on the Cityscapes database

Algorithm	Backbone network	Pix Acc％	mIoU％	FPS (frame/s)
					Unet	VGG16	87.07	37.06	16.6
SegNet	VGG16	85.48	33.75	24.7
					Enet	From Seratch	85.75	30.46	37.8
PSPNet	Resnet101	89.24	41.65	11.2
					EncNet	Resnet101	92.68	45.65	13.6
Deeplab v3+	Resnet101	93.24	44.76	14.3
					This example	FCN8s	91.85	43.78	32.3

The existing various semantic segmentation algorithms and the semantic segmentation result of the embodiment in the citiscapes database are shown in fig. 4, and it can be known from fig. 4 that the segmentation result obtained by the invention is closest to the real label, the error of the segmentation label does not occur, and the contour line of each classified object is clearer.

The invention also comprises an electronic device which comprises a memory and a processor, wherein the memory is used for storing the collected street view images and various computer program instructions for carrying out operations such as preprocessing, coding, feature extraction, up-sampling and the like on the street view images, and the processor is used for executing the computer program instructions to complete all or part of the steps so as to realize semantic segmentation of the street view images to be processed; the electronic device may communicate with one or more external devices, may also communicate with one or more devices that enable user interaction with the electronic device, and/or any device that enables the electronic device to communicate with one or more other computing devices, may also communicate with one or more networks (e.g., local area networks, wide area networks, and/or public networks) through a network adapter; the present invention also includes a computer-readable medium having stored thereon a computer program executable by a processor to implement street view image semantic segmentation, the computer-readable medium can include, but is not limited to, magnetic storage devices, optical disks, digital versatile disks, smart cards, and flash memory devices, the readable storage medium of the present invention can represent one or more devices for storing information and/or other machine-readable media, the term "machine-readable medium" including, but not limited to, wireless channels and various other media (and/or storage media) capable of storing, containing, and/or carrying code and/or instructions and/or data.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. Street view image semantic segmentation system, characterized by including:

2. The street view image semantic segmentation method using the street view image semantic segmentation system according to claim 1, characterized by comprising the steps of:

3. The streetscape image semantic segmentation method according to claim 2, wherein the preprocessing and data enhancement in step 1 comprises: and carrying out scaling, random cutting, random turning and normalization processing on the training set images, and carrying out scaling and normalization processing on the test set images and the verification set images.

4. The streetscape image semantic segmentation method according to claim 2, wherein the encoder in the step 2 is a lightweight network FCN8s, which is composed of 2 groups of 2 convolution operations of 3 × 3, a max-pooling operation, 3 groups of 3 convolution operations of 3 × 3, and a max-pooling operation in sequence;

5. The streetscape image semantic segmentation method according to claim 2, wherein the step 3 specifically comprises the following steps:

6. The streetscape image semantic segmentation method according to claim 2, wherein the specific operation of step 4 is to perform step-by-step convolution on the input second output feature map, delete odd-numbered elements to obtain a third intermediate feature map, and perform several times of ordinary convolution on the third intermediate feature map to obtain a third output feature map.

7. The streetscape image semantic segmentation method according to claim 2, wherein the convolution classifier in step 5 is operated by conv2d, the number of input channels is the number of streetscape image segmentation objects, the size of a convolution kernel is 1, the convolution filling mode is same, and the activation function is softmax.

8. The streetscape image semantic segmentation method according to claim 2, wherein the reverberation propagation algorithm in the step 6 uses Adam optimizer, the loss function is sparse _ clustering _ cross strategy, the initial learning rate is 0.001, the learning rate strategy is inverse time decay strategy, and the weight decay is normalized by using L2, wherein the decay _ steps is 74300 and the decay _ rate is 0.5.

9. An electronic device comprising a processor and a memory, the processor and memory in communication with each other;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 2 to 8 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 2-8.