CN116311251A

CN116311251A - Lightweight semantic segmentation method for high-precision stereoscopic perception of complex scene

Info

Publication number: CN116311251A
Application number: CN202310303533.6A
Authority: CN
Inventors: 王智慧; 李豪杰; 张拨川
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-06-23

Abstract

The invention belongs to the technical field of computer semantic segmentation, and relates to a lightweight semantic segmentation method for high-precision stereoscopic perception of a complex scene. According to the network structure provided by the invention, the ultra-lightweight semantic segmentation network is used as a base line, the CLIP text encoder is used as a class prototype generator, and a plurality of class characterizations are generated as class centers to guide network learning, so that the segmentation performance of the network among similar classes is improved. Meanwhile, as the segmentation effect of the network is poor at the edge, in order to improve the segmentation accuracy at the edge of the object, an edge optimization module is added at the network decoder side, and a ternary loss function for edge extraction is provided. The network of the invention effectively improves generalization performance and improves segmentation effect at class edges.

Description

Lightweight semantic segmentation method for high-precision stereoscopic perception of complex scene

Technical Field

The invention belongs to the technical field of deep learning semantic segmentation, and relates to a lightweight semantic segmentation method for high-precision stereoscopic perception of a complex scene.

Background

With the development of automation and computer technology, unmanned and intelligent technology is the subject of modern technology. The unmanned application platform can sense the surrounding environment by means of equipment such as a camera, an infrared detector and the like, so that control and decision can be made. The visual perception module is the only module for interaction between the unmanned application platform and the environment, so that the accuracy and the robustness of the visual perception algorithm are important sources of the motion capability and the intelligence capability of the unmanned application platform, and the technical level of the core function of the unmanned application platform is determined.

Semantic segmentation is an important visual task of the visual perception module of the existing unmanned application platform. Semantic segmentation is a problem of assigning a class label to each pixel in an image for facilitating scene understanding and object detection, providing reliable intelligence for achieving localization of objects. The key requirement of the unmanned application platform on the visual perception module is that the occupied computing resource is small and the prediction delay is low. Many high-performance semantic segmentation models are computationally expensive, slow to predict, and therefore unsuitable for deployment. The lightweight semantic segmentation model aims to be deployed on a low-memory embedded system through a more compact and efficient model, and simultaneously meets the conditions of real-time and accurate reasoning.

At present, light-weight framework related research works exist in the technical field of semantic segmentation. The encoder-decoder architecture is a paradigm of a semantic segmentation network, and the computational effort required to apply convolution to an image or feature map is proportional to its resolution, significantly reducing the computational resource footprint of the network by downsampling the input image. In lightweight segmentation networks, feature maps are typically upsampled using interpolation and a minimum of convolution. Paszke et al use downsampling and convolutional coding at the shallow layer of the network to generate a more compact model (A.Paszke, A.Chaurasia, S.Kim and E.Curuicello, "Enet: A deep neuralnetwork architecture for real-time semantic segmentation," arXivpreprint arXiv:1606.02147,2016). Convolution strategies that reduce parameters and computation effort are often used in lightweight semantic segmentation, F.Chollet et al use depth separable convolutions to integrate volumes into depth convolution and point convolutions (F.Chollet, "Xportion: deep Learning With Depthwise SeparableConvolutions," in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017.), G.Papandreou et al use sparse application of kernel weights over a larger input window, and achieve hole convolutions of larger receptive fields without increasing kernel size (G.Papandreou, I.Kokkinos and P. -A.Savale, "Modeling Local andGlobal Deformations in Deep Learning: epitomic ConvolutionMultiple Instance Learning, and Sliding Window Detection," inProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015.). Zhang et al used packet convolution and channel shuffling to obtain satisfactory classification results from lightweight networks to run on embedded systems (X.Zhang, X.Zhou, M.Lin and j. Sun, "ShuffleNet: an ExtremelyEfficient Convolutional Neural Network for Mobile Devices," inProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018.). However, in the application scene of the unmanned application platform, the target object is often hidden, and the algorithm is required to judge the category more accurately, and judge the texture edge of the object more clearly to obtain an accurate result; the existing lightweight semantic segmentation algorithm is often poorly adapted to the complex environment of an unmanned application platform.

Disclosure of Invention

The invention aims to provide a lightweight semantic segmentation method for high-precision stereoscopic vision perception of a complex scene, which is used for an unmanned application platform.

The invention provides a lightweight semantic segmentation method, which is based on a category prototype generation module and an edge optimization module and provides a semantic segmentation network. The edge optimization module utilizes the edge optimization ternary loss provided by the invention to improve the problem of non-ideal boundary of the semantic segmentation result.

The specific technical scheme of the invention is as follows:

a lightweight semantic segmentation method for high-precision stereoscopic perception of complex scenes comprises the following steps:

step 1) constructing a category prototype generating module: the category prototype generation module uses a pre-trained CLIP text encoder (A.Radford, J.W.Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J.Clark, et al, "Learning transferable visual models from natural language supervision," Image, vol.2, p.t2, 2021.) as the robust category prototype feature to guide the semantic segmentation network to learn the category features; the class prototype generation module is structured as a text input transducer network that maps text input to feature space to generate class prototypes. Before the category name text is input into the category generating module, the category name is firstly constructed into a text prompt, such as a { tag } photo, so as to reduce the misdiscrimination of the network on the multi-meaning text, and then the text prompt is input into the category prototype generating module. N text prompts are constructed for the text names of the specific categories, so that N category prototypes of one category are constructed, different expression forms of the internal examples of each category are represented, and meanwhile differences among different categories are represented.

Step 2) constructing an edge optimization module: the input of the edge optimization module is the characteristics of each layer of the decoder of the backbone network of the semantic segmentation network, and after the characteristic map is input, the object edge prediction map of the 1 channel is obtained through the 2-layer convolution layer and the activation layer in sequence. And supervising the multi-level class edge prediction graph by using the class edge graph obtained by the real annotation, and optimizing the supervision of the edge graph by using a supervision mode of combining the three-element loss and the classification loss. The supervision mode of the edge ternary loss and the classification loss is specifically as follows: in the ternary loss function of the optimized edge graph, a pixel set with the confidence coefficient higher than a threshold value alpha and the true value being an edge is regarded as a real example, a pixel set with the confidence coefficient lower than beta and the true value being the interior of the class is regarded as a false negative example, and the rest pixel sets with the true value being the edge are regarded as targets to be optimized; respectively calculating the average value of the features of the real example, the false negative example and the position of the target to be optimized in the ternary loss, describing the distances between the target to be optimized and the real example and the false negative example by using cosine measurement, and constructing and calculating the ternary loss by using a cross entropy function; according to statistics of edge pixels of the data set, the number of feature points of the real example, the false negative example and the target to be optimized is added by an offset of gamma% of the total number of pixels so as to converge more rapidly. Wherein, alpha, beta and gamma are parameters; the value principle of alpha and beta is to ensure that alpha is larger than beta, and the number of the pixels of the real example with higher confidence than alpha is smaller than that of the pixels of the false negative example with lower confidence than beta, so that the characteristic representativeness of the real example and the false negative example is ensured; the value of gamma is slightly smaller than the number of class edge pixel points in the truth value mask, so that the smoothing effect of the class edge pixel points is ensured, and the calculated gradient value is ensured to be relatively stable.

Step 3) constructing a semantic segmentation network: and constructing the whole structure of the semantic segmentation network by the category prototype generation module and the edge optimization module constructed in the step 1) and the step 2) and the backbone network. The backbone network is divided into two parts, namely encoding and decoding; for the encoder, the original image is extracted to

Characteristic diagram size of the primitive size, C ₁ A feature map of the dimension; the decoder part decodes the feature map generated by the encoder into the input original size, C ₂ A feature map of the dimension. And taking the up-sampled characteristic of each time of the decoder as the input of an edge optimization module to obtain edge prediction graphs under different resolutions, and supervising by using the edge ternary loss.

Step 4), semantic segmentation network training reasoning process: c of final output of decoder ₂ And calculating the dimension vector and the category prototype generated by the category prototype generating module to obtain a final segmentation result. During training, the class prototype generation module randomly selects M class prototype mean values and final output characteristics of the decoder to perform pixel-by-pixel cosine similarity measurement, and a class with highest cosine similarity is a prediction class of the pixel. When reasoning, the edge optimization module does not participate in calculation, the decoder features are compared with all class prototypes in similarity pixel by pixel, and each class selects M score calculation means with the highest similarity score as the prediction score of the pixel belonging to the class. Prediction of each pixel over all classesThe highest-score class is used as the final prediction class, and the prediction results of all the pixel points form a segmentation result graph.

The invention has the beneficial effects that:

the invention discloses a lightweight semantic segmentation method for high-precision visual perception of a complex scene, which is used for constructing a semantic segmentation network based on a category prototype generation module and an edge optimization module. The former enhances the distinguishing capability of the network to the object of the interested category, and reduces the phenomenon of partial region erroneous segmentation; the latter strengthens the capability of the backbone network to detect the edges of objects, and constructs a semantic and edge joint feature space. The time complexity of the class prototype prediction mask is the same as that of the traditional one-hot vector generation, the edge optimization module does not participate in calculation during prediction, and the class prototype prediction mask and the edge optimization module are suitable for being directly added into a lightweight semantic segmentation frame, so that the class prototype prediction mask is suitable for being deployed on an unmanned application platform, the speed and the precision of a segmentation task are well balanced, and a new solution is provided for a high-precision stereoscopic vision perception scene of a complex scene.

According to the network structure provided by the invention, the ultra-lightweight semantic segmentation network is used as a base line, the CLIP text encoder is used as a class prototype generator, and a plurality of class characterizations are generated as class centers to guide network learning, so that the segmentation performance of the network among similar classes is improved. Meanwhile, as the segmentation effect of the network is poor at the edge, in order to improve the segmentation accuracy at the edge of the object, an edge optimization module is added at the network decoder side, and a ternary loss function for edge extraction is provided. The network of the invention effectively improves generalization performance and improves segmentation effect at class edges.

Drawings

Fig. 1 is a network configuration diagram of the present invention.

FIG. 2 shows the results of the present invention on a common dataset, with the first column being the original image, the second column being the true label, the third column being the base network effect, and the fourth column being the partitioning effect of the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.

This embodiment is implemented on a TeslaV100GPU, intel Xeon CPU E5-2680 v4 using the CUDA11.4 backend. The semantic segmentation framework provided by the invention is realized on PyTorch. The image resolution in both training and reasoning was 1024 x 512, using Adam optimizer, the initial learning rate was set to 5e ^-4 . The batch size was 16. Zero mean normalization, random inversion, random scaling (between 0.8 and 1.5) and clipping were used to increase the data. The cityscapes dataset (M.Cordts, M.Omran, S.Ramos, T.Rehfeld, M.Enzweiler, R.Benenson, U.Franke, S.Roth and b.schiele, "The Cityscapes Dataset for SemanticUrban Scene Understanding," in 2016IEEE Conference on ComputerVision and Pattern Recognition,CVPR 2016,Las Vegas,NV,USA,June 27-30,2016,2016.) was used as training data for the network. cityscapes is an image containing 5000 dense labels and 20000 coarse labels, covering semantic and instance segmentation, resolution 2048 x 1024.

As shown in fig. 1, the method of the present invention comprises the following specific steps:

1) And (3) constructing a category prototype generating module: the structure was a text input transducer (raffil, c., shazer, n., roberts, a., lee, k., narag, s., matena, m., zhou, y., li, w., and Liu, p.j.representation the limits of transfer learning with a unified text-to-text transducer.arxiv preprint arXiv:1910.10683,2019.) using a model with a width of 63M parameters of 12 layers 512 with 8 segments. For one input text, the module outputs 768-dimensional text representations. After the transformation structure, the text characterization is reduced to 64 dimensions by using a three-layer fully connected network, and a category prototype is generated. In the present invention, a text prompt is generated for the predictive category. For example, the category "car", the text prompt "picture of a car", "blurred picture of a car", "clear picture of a car", etc. is generated. The generated text prompts are input into a category prototype generation module, and 10 category prototypes of each category are constructed, wherein the category prototypes have inter-category differences and intra-category variances.

2) And (3) constructing an edge optimization module: the input of the edge optimization module is the characteristics of each layer of the decoder of the backbone network, and after the characteristic map is input, the object edge prediction map of the 1 channel is obtained through the 2-layer convolution layer and the activation layer in sequence. And supervising the multi-level class edge prediction graph by using the class edge graph obtained by the real annotation. The invention provides a supervision mode for edge ternary loss and classification loss combined to optimize edge graph supervision. In the ternary loss function of the optimized edge graph, a pixel set with the confidence higher than the threshold value of 0.9 and the true value being the edge is regarded as a real example, a pixel set with the confidence lower than 0.5 and the true value being the interior of the class is regarded as a false negative example, and the rest pixel sets with the true value being the edge are regarded as targets to be optimized. And respectively calculating the average value of the features of the real example, the false negative example and the position of the target to be optimized in the ternary loss, describing the distances between the target to be optimized and the real example and the false negative example by using cosine measurement, and constructing and calculating the ternary loss by using a cross entropy function. According to statistics of edge pixels of the data set, the number of feature points of the real example, the false negative example and the target to be optimized is added by an offset of 0.5% of the total number of pixels so as to converge more rapidly.

3) Constructing a semantic segmentation network: the class prototype generation module and the edge optimization module constructed by 1) and 2) construct an integral structure with the backbone network. The backbone network is divided into two parts, encoding and decoding. For the encoder, extracting the original image to a 1/8 original size 128-dimensional feature map by using a 10-layer residual block; the decoder section decodes the feature map generated by the encoder into an input full-size 64-dimensional feature map using a 6-layer residual block. And taking each upsampling characteristic of the decoder and the output characteristic of the encoder as the input of an edge optimization module to obtain edge prediction graphs under different resolutions, and supervising by using the edge ternary optimization loss.

4) The network training reasoning process comprises the following steps: and finally, carrying out calculation on the 64-dimensional vector and the category prototype output by the decoder to obtain a final segmentation result. During training, the class prototype generator randomly selects 3 class prototype mean values and the decoder to finally output pixel-by-pixel characteristics to carry out cosine similarity measurement, and the class with the highest cosine similarity is the prediction class. When reasoning, the edge optimization module does not participate in calculation, the decoder compares the pixel-by-pixel characteristics with all class prototypes in similarity, and each pixel class selects 3 score averages with the highest similarity score as the final prediction score.

Ablation experiments were performed on the class prototype generation module, the edge optimization module, respectively, to verify their contribution to the overall structure, as shown in table 1. For the category prototype generation module, firstly, a comparison learning method (Zhou, T., wang, W., konukoglu, E., van Gool, L.: rethinking semantic segmentation: A prototype view. In: CVPR (2022)) in the randomly generated category prototype and semantic segmentation is used for supervision to obtain an effect better than a base line; the edge detection head is added on the basis of the category prototype of the last step, and the effect of the edge detection head is further improved by using the traditional cross entropy supervision; after the edge ternary loss function provided by the invention is added on the basis of the previous step, the effect is further improved; on the basis of the previous step, the category prototype is adjusted to be the prototype generated by the category prototype generator, and the best segmentation effect is obtained.

Table 1 ablation experiments

As can be seen from Table 1, the method provided by the invention achieves 72.26mIoU on the cityscapes verification data set, and is improved by 2.24mIoU compared with the base line network. Compared with a base line network, the method provided by the invention improves the prediction precision of most interested categories on the premise of unchanged prediction time, which means that the scheme of the invention is superior to the existing base line network and can be better embedded into an unmanned application platform.

Table 3 comparison of the algorithm of the present invention with IoU of the baseline network

Table 3 shows the partitioning metrics for all categories of interest for the partitioning method of the present invention over the base network on the cityscapes dataset verification set. Therefore, the segmentation method provided by the invention has larger performance improvement on most interested categories compared with a baseline. The method of the invention improves on general categories such as roads, sidewalks, buildings, vegetation, sky, automobiles; while achieving significant accuracy improvements in unique categories of intelligent driving scenarios, such as walls, fences, utility poles, traffic lights, traffic signs, pedestrians, riders, trucks, buses, trains, motorcycles, and bicycles. These unique categories are less training data, more shape-form specific, or confusing with other categories. For example, the method provided by the invention achieves 72.26mIoU on the cityscapes verification data set on the category which is easily confused with other categories such as automobiles, and compared with a base line network, the method provided by the invention improves the speed by 2.24mIoU. The invention thus provides a great improvement in improving the edges of complex structures (e.g. traffic lights, traffic signs, fences) and differences between similar categories (e.g. car and bus, truck), respectively. Referring to fig. 2, for the first two-line image, the base line network has the phenomenon of mistaken segmentation of the sub-areas of the automobile category and the enclosure category, and a small part of the interior of the same object is identified as other adjacent vehicle types and building type categories. In the third row of segmentation results, compared with a base line network, the method has the advantages that the segmentation areas are more complete and the boundaries are clearer on categories such as tiny telegraph poles and traffic signs. Therefore, the visualization experiment proves that the method provided by the invention reduces the phenomenon of partial sub-region erroneous segmentation on the basis of the base line network, optimizes the boundary of the tiny category, and improves the segmentation performance of the network. Compared with a base line network, the method provided by the invention improves the prediction precision of most interested categories on the premise of unchanged prediction time, which means that the scheme of the invention is superior to the existing base line network and can be better embedded into an unmanned application platform.

Claims

1. A lightweight semantic segmentation method for high-precision stereoscopic perception of complex scenes is characterized by comprising the following steps of:

step 1) constructing a category prototype generating module: the category prototype generation module uses a pre-trained CLIP text encoder as a category prototype feature for obtaining robustness so as to guide semantic segmentation network learning category features; the class prototype generation module structure is a text input converter network, and the converter network maps the text input to the feature space to generate a class prototype; before the category name text is input into the category generating module, firstly, the category name is constructed into a text prompt so as to reduce the misdiscrimination of the network on the multi-meaning text, and then the text prompt is input into the category prototype generating module; constructing N text prompts for text names of specific categories, so as to construct N category prototypes of one category, represent different expression forms of internal examples of each category, and represent differences among different categories;

step 2) constructing an edge optimization module: the input of the edge optimization module is the characteristics of each layer of the decoder of the backbone network of the semantic segmentation network, and after the characteristic map is input, the object edge prediction map of the 1 channel is obtained through the 2 layers of convolution layers and the activation layers in sequence; using a class edge graph obtained by real labeling to monitor a multi-level class edge prediction graph, and optimizing the supervision of the edge graph by using a supervision mode of combining three-element loss and classification loss of the edge; the supervision mode of the edge ternary loss and the classification loss is specifically as follows: in the ternary loss function of the optimized edge graph, a pixel set with the confidence coefficient higher than a threshold value alpha and the true value being an edge is regarded as a real example, a pixel set with the confidence coefficient lower than beta and the true value being the interior of the class is regarded as a false negative example, and the rest pixel sets with the true value being the edge are regarded as targets to be optimized; respectively calculating the average value of the features of the real example, the false negative example and the position of the target to be optimized in the ternary loss, describing the distances between the target to be optimized and the real example and the false negative example by using cosine measurement, and constructing and calculating the ternary loss by using a cross entropy function; according to statistics of edge pixels of a data set, adding offset of gamma% of total pixel number to the number of feature points of a real example, a false negative example and a target to be optimized so as to converge more rapidly; wherein, alpha, beta and gamma are parameters; the value principle of alpha and beta is to ensure that alpha is larger than beta, and the number of the pixels of the real example with higher confidence than alpha is smaller than that of the pixels of the false negative example with lower confidence than beta, so that the characteristic representativeness of the real example and the false negative example is ensured; the value of gamma is smaller than the number of class edge pixel points in the truth value mask, so that the smoothing effect of the class edge pixel points is ensured, and the calculated gradient value is ensured to be relatively stable;

step 3) constructing a semantic segmentation network: constructing the whole structure of the semantic segmentation network by the category prototype generating module and the edge optimizing module constructed in the step 1) and the step 2) and the backbone network; the backbone network is divided into two parts, namely encoding and decoding; for the encoder, the original image is extracted to

Characteristic diagram size of the primitive size, C ₁ A feature map of the dimension; the decoder part decodes the feature map generated by the encoder into the input original size, C ₂ A feature map of the dimension; taking the up-sampled characteristic of each time of the decoder as the input of an edge optimization module to obtain edge prediction graphs under different resolutions, and supervising by using the edge ternary loss;

step 4), semantic segmentation network training reasoning process: c of final output of decoder ₂ The final segmentation result is obtained through the operation of the dimension vector and the category prototype generated by the category prototype generation module; during training, the class prototype generation module randomly selects M class prototype mean values and final output characteristics of the decoder to perform pixel-by-pixel cosine similarity measurement, and a class with highest cosine similarity is a prediction class of the pixel; when reasoning, the edge optimization module does not participate in calculation, the decoder features are compared with all class prototypes in similarity pixel by pixel, and each class selects M score calculation means with the highest similarity score as the prediction score of the pixel belonging to the class; and the category with the highest prediction score on all categories of each pixel point is used as the final prediction category, and the prediction results of all pixel points form a segmentation result graph.