CN117079277A

CN117079277A - Traffic scene real-time semantic segmentation method based on deep learning

Info

Publication number: CN117079277A
Application number: CN202310841112.9A
Authority: CN
Inventors: 王孜健; 穆柯楠; 么新鹏; 王悦华; 范颂华; 荣文; 惠飞; 刘梦菲; 张涵; 孙加新; 李一鸣; 赵玉钰
Original assignee: Changan University; Shandong High Speed Group Co Ltd
Current assignee: Changan University; Shandong High Speed Group Co Ltd
Priority date: 2023-07-10
Filing date: 2023-07-10
Publication date: 2023-11-17

Abstract

The invention discloses a real-time semantic segmentation method of traffic scenes based on deep learning, which includes the following steps. Step 1: Obtain training images and preprocess the training images; Step 2: Construct a traffic scene semantic segmentation network, and the encoder during construction The MobileNetV2 backbone feature extraction network is used to form the DeeplabV3+ network model of the backbone network MobielNetV2; the training images are used to train the traffic scene semantic segmentation network to obtain the semantic segmentation network model; step 3, perform real-time semantic segmentation of the traffic scene based on the semantic segmentation network model. By improving the DeepLabv3+ decoder and loss function, the segmentation accuracy of the existing semantic segmentation model is improved, while the amount of calculation and parameters are reduced, so that it can be better obtained in vehicle-mounted embedded platforms with limited hardware storage and computing power. widely used.

Description

A method of real-time semantic segmentation of traffic scenes based on deep learning

技术领域Technical field

本发明属于计算机视觉技术领域，具体属于一种基于深度学习的交通场景实时语义分割方法。The invention belongs to the field of computer vision technology, and specifically belongs to a real-time semantic segmentation method of traffic scenes based on deep learning.

背景技术Background technique

语义分割任务是一项关键且基础的技术，隶属于图像分割领域。其核心思想是通过一定方法将图像中的每一个像素赋予由特定的预定义类别构成的语义信息。具体来说，就是将图像分类任务中有较好表现的基本分类网络如AlexNet、VGG等作为主干网络提取相应特征，并根据不同的应用场景对其进行改进，再通过一定的结构将特征恢复到原图大小，生成分割后的图像。语义分割有着广泛应用，如遥感影像分析、场景分析等。The semantic segmentation task is a key and basic technology and belongs to the field of image segmentation. The core idea is to assign semantic information consisting of specific predefined categories to each pixel in the image through a certain method. Specifically, basic classification networks that have good performance in image classification tasks, such as AlexNet and VGG, are used as backbone networks to extract corresponding features, improve them according to different application scenarios, and then restore the features to The size of the original image is used to generate the segmented image. Semantic segmentation has a wide range of applications, such as remote sensing image analysis, scene analysis, etc.

除此之外，在车辆自动驾驶环境感知领域，图像语义分割作为一种关键技术，能够对交通道路场景图像中的交通参与者、道路边缘、前方障碍物等目标物体进行逐像素分类，根据分割结果进一步识别目标类型、获取车辆前方障碍物信息、规划出可通行区域车辆的可通行范围等，从而为自动驾驶系统提供相关环境感知结果。In addition, in the field of vehicle autonomous driving environment perception, image semantic segmentation is a key technology that can classify target objects such as traffic participants, road edges, and obstacles ahead in traffic road scene images pixel by pixel. According to segmentation The results further identify target types, obtain information about obstacles in front of the vehicle, and plan the passable range of vehicles in the passable area, etc., thereby providing relevant environment perception results for the autonomous driving system.

然而，在城市道路中，交通场景较为复杂且目标多样，不仅包含车辆、道路、建筑、植被和近处行人等所占比例较大的物体，也包含交通标识牌、路灯、小型非机动车以及远处的行人、车辆等所占比例较小的物体。自动驾驶以及车辆辅助驾驶系统等作为交通场景语义分割任务的主要应用，其目的是使车辆可以在无人操作或辅助操作的情况下，实现自动导航、躲避车辆和行人、识别障碍物以及交通标识牌等任务。因此，语义分割的精度与实时性，对于驾驶员、路面车辆和行人的安全是至关重要的。图像语义分割作为像素级的图像识别技术，相对于其他技术来说更加准确，但难度也更大。对算法运行时间、分割精度以及对硬件的存储空间等都需要严格把控。不仅需要识别并分割出近处相对较大的物体，对远处或较小的物体也需要实现精确分割，以便于对路况进行提前判断。并且语义分割算法消耗时间较久，数据量的收集与标注、算法的复杂程度等都需要不断完善，这样才足以实现真正意义上的自动驾驶，保障人员及车辆的安全。此外，由于语义分割网络模型复杂和繁琐，其参数量和计算量较大导致网络推理速度较慢，不适合部署在计算资源受限且耗时要求较低的车载芯片平台上，难以同时兼顾实时性和精度性。However, on urban roads, the traffic scene is more complex and has diverse targets, including not only vehicles, roads, buildings, vegetation, and nearby pedestrians, which account for a large proportion of objects, but also traffic signs, street lights, small non-motorized vehicles, and Pedestrians, vehicles and other objects in the distance occupy a smaller proportion. Autonomous driving and vehicle-assisted driving systems are the main applications of traffic scene semantic segmentation tasks. Their purpose is to enable vehicles to realize automatic navigation, avoid vehicles and pedestrians, and identify obstacles and traffic signs without human operation or assisted operation. Card tasks. Therefore, the accuracy and real-time performance of semantic segmentation are crucial to the safety of drivers, road vehicles and pedestrians. As a pixel-level image recognition technology, image semantic segmentation is more accurate than other technologies, but it is also more difficult. The algorithm running time, segmentation accuracy, and hardware storage space need to be strictly controlled. It is not only necessary to identify and segment relatively large objects in the vicinity, but also to accurately segment distant or smaller objects to facilitate early judgment of road conditions. Moreover, the semantic segmentation algorithm takes a long time, and the collection and annotation of data volume and the complexity of the algorithm need to be continuously improved, so that it can achieve true autonomous driving and ensure the safety of people and vehicles. In addition, due to the complexity and cumbersomeness of the semantic segmentation network model, its large number of parameters and calculations lead to slow network reasoning. It is not suitable for deployment on vehicle-mounted chip platforms with limited computing resources and low time-consuming requirements. It is difficult to take into account real-time at the same time. performance and accuracy.

综上所述，现有技术中存在交通景图像中的细小物体分割不准确，难以在算资源受限且耗时要求较低的车载芯片平台应用的问题。To sum up, the existing technology has problems such as inaccurate segmentation of small objects in traffic scene images, making it difficult to apply on vehicle-mounted chip platforms with limited computing resources and low time-consuming requirements.

发明内容Contents of the invention

为了解决现有技术中存在的问题，本发明提供一种基于深度学习的交通场景实时语义分割方法，用于解决现有中交通景图像中的细小物体分割不准确，难以在算资源受限且耗时要求较低的车载芯片平台应用的问题。通过对DeepLabv3+解码器和损失函数进行改进，提出了一种基于深度学习的交通场景语义分割方法，提高了现有语义分割模型的分割精度，同时降低计算量和参数量，从而可以更好的在硬件存储和计算力有限的车载嵌入式平台中的得到广泛应用。In order to solve the problems existing in the existing technology, the present invention provides a real-time semantic segmentation method of traffic scenes based on deep learning, which is used to solve the problem of inaccurate segmentation of small objects in existing traffic scene images and difficulty in computing resources and limitations. Problems with in-vehicle chip platform applications with lower time-consuming requirements. By improving the DeepLabv3+ decoder and loss function, a traffic scene semantic segmentation method based on deep learning is proposed, which improves the segmentation accuracy of the existing semantic segmentation model, while reducing the amount of calculation and parameters, so that it can better It is widely used in vehicle embedded platforms with limited hardware storage and computing power.

为实现上述目的，本发明提供如下技术方案：In order to achieve the above objects, the present invention provides the following technical solutions:

一种基于深度学习的交通场景实时语义分割方法，包括以下步骤，A method of real-time semantic segmentation of traffic scenes based on deep learning, including the following steps,

步骤1，获取训练图像，对训练图像进行预处理；Step 1: Obtain training images and preprocess the training images;

步骤2，构建交通场景语义分割网络，构建时的编码器采用MobileNetV2骨干特征提取网络，形成骨干网络MobielNetV2的DeeplabV3+网络模型；采用训练图像训练交通场景语义分割网络，得到语义分割网络模型；Step 2: Construct a traffic scene semantic segmentation network. The encoder during construction uses the MobileNetV2 backbone feature extraction network to form the DeeplabV3+ network model of the backbone network MobielNetV2; use training images to train the traffic scene semantic segmentation network to obtain a semantic segmentation network model;

步骤3，依据语义分割网络模型对交通场景实时进行语义分割。Step 3: Perform real-time semantic segmentation of traffic scenes based on the semantic segmentation network model.

优选的，步骤1中，预处理过程具体为，将训练图像转化为张量格式，并调整图像尺寸大小进行归一化处理，通过随机翻转采用增加噪声进行数据增强。Preferably, in step 1, the preprocessing process specifically includes converting the training image into a tensor format, adjusting the image size for normalization, and adding noise for data enhancement through random flipping.

优选的，步骤2中具体包括以下步骤，Preferably, step 2 specifically includes the following steps:

步骤2.1，构建交通场景语义分割网络时的编码器采用MobileNetV2轻量级网络架构提取特征，将Xception的DeeplabV3+网络改进为MobielNetV2的DeeplabV3+网络；MobielNetV2的DeeplabV3+网络通过深度卷积神经网络获得高级语义特征和低级语义特征；Step 2.1, the encoder when constructing the traffic scene semantic segmentation network uses MobileNetV2 lightweight network architecture to extract features, and improves Xception's DeeplabV3+ network into MobielNetV2's DeeplabV3+ network; MobielNetV2's DeeplabV3+ network obtains advanced semantic features and Low-level semantic features;

将高级语义特征输入到ASPP模块，ASPP模块中引入深度卷积神经网络、CBAM注意力机制模块和SE注意力机制模块；Input high-level semantic features into the ASPP module, which introduces deep convolutional neural network, CBAM attention mechanism module and SE attention mechanism module;

步骤2.2，将训练图像输入到DeeplabV3+网络模型中，使用Focal Loss损失函数进行训练；进入深度卷积神经网络后得到相应的低级语义特征；Step 2.2, input the training image into the DeeplabV3+ network model and use the Focal Loss loss function for training; after entering the deep convolutional neural network, the corresponding low-level semantic features are obtained;

步骤2.3，通过步骤2.1中的ASPP模块再进行1×1卷积运算，获取多尺度上下文信息；Step 2.3, perform a 1×1 convolution operation through the ASPP module in step 2.1 to obtain multi-scale context information;

步骤2.4，通过一次3×3的卷积运算和4倍双线性上采样，得到输出的分割图像；Step 2.4, through a 3×3 convolution operation and 4 times bilinear upsampling, the output segmented image is obtained;

步骤2.5，通过Decoder模块进行解码操作，得到语义分割后的结果。Step 2.5, perform decoding operations through the Decoder module to obtain the results after semantic segmentation.

进一步的，步骤2.1中，MobileNetV2轻量级网络的残差部分的输入和输出直接相接。Further, in step 2.1, the input and output of the residual part of the MobileNetV2 lightweight network are directly connected.

进一步的，步骤2.1中，MobielNetV2的DeeplabV3+网络先利用1×1卷积进行升维，再利用3×3深度可分离卷积进行特征提取，最后利用1×1卷积降维；Further, in step 2.1, the DeeplabV3+ network of MobielNetV2 first uses 1×1 convolution to increase the dimension, then uses 3×3 depth-separable convolution for feature extraction, and finally uses 1×1 convolution to reduce the dimension;

完成MobielNetV2的特征提取后，获得两个有效特征层，其中一个有效特征层是输入图片高和宽压缩2次的结果，另一个有效特征层是输入图片高和宽压缩4次的结果。After completing the feature extraction of MobielNetV2, two effective feature layers are obtained. One of the effective feature layers is the result of compressing the height and width of the input image twice, and the other effective feature layer is the result of compressing the height and width of the input image four times.

进一步的，步骤2.1中，ASPP模块中深度可分离卷积参数量公式为，Further, in step 2.1, the formula of the depth-separable convolution parameter in the ASPP module is,

h·w·C_in·C_out+C_in·C_out h·w·C _in ·C _out +C _in ·C _out

式中，输入张量的深度为C_in，输出张量的深度为C_out，h和w分别为图像得宽和高。In the formula, the depth of the input tensor is C _in , the depth of the output tensor is C _out , h and w are the width and height of the image respectively.

进一步的，步骤2.1中，ASPP模块中CBAM注意力机制模块公式为，Further, in step 2.1, the formula of the CBAM attention mechanism module in the ASPP module is,

其中，M_c(F)为通道注意力模块，M_s(F)为空间注意力模块，输入函数用F表示；Sigmoid激活函数用σ表示，W₀，W₁表示全连接层的参数，和/>分别为平均池化和最大池化后的特征；f^7×7表示一个卷积核大小为7×7的卷积操作空间注意力模块更关心空间层面的特征图。Among them, M _c (F) is the channel attention module, M _s (F) is the spatial attention module, the input function is represented by F; the Sigmoid activation function is represented by σ, W ₀ and W ₁ represent the parameters of the fully connected layer, and/> They are the features after average pooling and maximum pooling respectively; f ^7×7 represents a convolution operation with a convolution kernel size of 7×7. The spatial attention module is more concerned about the feature map at the spatial level.

优选的，步骤2中，ASPP模块中SE注意力机制模块公式为，Preferably, in step 2, the SE attention mechanism module formula in the ASPP module is,

其中，u表示卷积后的特征图，u的通道数为C，W×H为u的空间维度。Among them, u represents the convolved feature map, the number of channels of u is C, and W×H is the spatial dimension of u.

优选的，步骤2.2中，Focal Loss损失函数公式为，Preferably, in step 2.2, the Focal Loss loss function formula is,

CE(p,y)＝CE(p_t)＝-log(CE(p,y)＝CE(p _t )＝-log(

其中，p指预测样本是正样本的概率；y为真实的标签，p_t表示样品隶属于真实类别的可能性。Among them, p refers to the probability that the predicted sample is a positive sample; y is the real label, and p _t represents the possibility that the sample belongs to the real category.

与现有技术相比，本发明具有以下有益的技术效果：Compared with the existing technology, the present invention has the following beneficial technical effects:

本发明提供一种基于深度学习的交通场景实时语义分割方法，通过采用MobileNetV2轻量级网络架构提取特征，降低计算量。在ASPP模块中，用深度可分离卷积代替普通卷积，减少参数量。并且添加CBAM和通道注意力机制SE-attention，解决在多尺度融合特征时不同特征之间的关联依赖较少的问题，提高模型预测精度。并替换交叉熵损失函数为Focal Loss损失函数，解决正负样本不平衡和难易分类样本不平衡的问题。本发明利用PASCAL VOC数据集对改进网络模型进行训练和测试，结果表明，改进的DeepLabV3+网络模型平均交并比(mIoU)达80.04％，准确率为92.8％，参数量降低至6.13M。与原始DeepLabV3+相比，mIoU和准确率分别提高1.08％和2.1％，参数量减少48.58M,计算量降低10.82GFlop。本发明提出的网络能够有效地改进图像分割的精度，加快语义分割的速度，减轻了设备的运算负荷。The present invention provides a real-time semantic segmentation method of traffic scenes based on deep learning, which uses MobileNetV2 lightweight network architecture to extract features and reduce the amount of calculation. In the ASPP module, depth-separable convolution is used instead of ordinary convolution to reduce the number of parameters. And CBAM and channel attention mechanism SE-attention are added to solve the problem of less correlation between different features when merging features at multiple scales, and improve the prediction accuracy of the model. And replace the cross-entropy loss function with the Focal Loss loss function to solve the problem of imbalance of positive and negative samples and imbalance of difficult-to-classify samples. The present invention uses the PASCAL VOC data set to train and test the improved network model. The results show that the improved DeepLabV3+ network model has an average intersection-over-union ratio (mIoU) of 80.04%, an accuracy of 92.8%, and a parameter amount reduced to 6.13M. Compared with the original DeepLabV3+, the mIoU and accuracy rates are increased by 1.08% and 2.1% respectively, the parameter amount is reduced by 48.58M, and the calculation amount is reduced by 10.82GFlop. The network proposed by the present invention can effectively improve the accuracy of image segmentation, accelerate the speed of semantic segmentation, and reduce the computing load of the device.

附图说明Description of the drawings

图1是反向残差块。Figure 1 is the reverse residual block.

图2是MobileNetV2卷积块结构。Figure 2 is the MobileNetV2 convolution block structure.

图3是改进DeepLabV3+的轻量化交通场景语义分割网络架构图。Figure 3 is an architecture diagram of the lightweight traffic scene semantic segmentation network improved by DeepLabV3+.

图4是原始模型与改进模型mIoU曲线。Figure 4 is the mIoU curve of the original model and the improved model.

图5是原始模型与改进模型mPA曲线。Figure 5 is the mPA curve of the original model and the improved model.

图6是总迭代30000次损失曲线。Figure 6 is the loss curve for a total of 30,000 iterations.

图7是部分可视化。Figure 7 is a partial visualization.

具体实施方式Detailed ways

下面结合具体的实施例对本发明做进一步的详细说明，所述是对本发明的解释而不是限定。The present invention will be further described in detail below with reference to specific examples, which are explanations rather than limitations of the present invention.

本发明提供的一种基于深度学习的交通场景语义分割方法，包括以下步骤：The invention provides a traffic scene semantic segmentation method based on deep learning, which includes the following steps:

步骤1：获取训练图像及对应标签的文件路径，读取图像，将图像转化为tensor张量格式，其目的是为了创造更高维度的矩阵，resize调整图像尺寸大小并进行归一化处理，归一化处理是为了把预处理得到的数据限定在一定范围[0,1]或者[-1,1]。之后通过随机翻转(水平，上下)图像，增加噪声等方式进行数据增强。数据增强可以增加训练的数据量，提高模型的泛化能力；增加噪声数据，提升模型的鲁棒性。Step 1: Obtain the file path of the training image and the corresponding label, read the image, and convert the image into tensor format. The purpose is to create a higher-dimensional matrix, resize the image size and perform normalization processing. Unification processing is to limit the preprocessed data to a certain range [0,1] or [-1,1]. Data enhancement is then performed by randomly flipping (horizontally, up and down) the image, adding noise, etc. Data enhancement can increase the amount of training data and improve the generalization ability of the model; increase noise data and improve the robustness of the model.

步骤2：构建交通场景语义分割网络，并在基准模型的基础上进行改进，用Cityscapes数据集训练交通场景语义分割网络，得到本发明提出的语义分割网络模型，具体步骤如下：Step 2: Construct a traffic scene semantic segmentation network and improve it on the basis of the benchmark model. Use the Cityscapes data set to train the traffic scene semantic segmentation network to obtain the semantic segmentation network model proposed by the present invention. The specific steps are as follows:

步骤2.1：DeepLabV3+网络首先通过深度卷积神经网络(DeepConvolutionalNeural Network，DCNN)提取输入图特征，由此获得高级和低级语义特征。高级语义特征输入到ASPP模块包含4个扩张率(dilation rate)不同的空洞卷积层和1个池化层，再对空洞卷积层进行卷积，以及池化层池化操作后，得到5个特征图，从而得到多尺度信息。Step 2.1: The DeepLabV3+ network first extracts input image features through a deep convolutional neural network (DeepConvolutional Neural Network, DCNN), thereby obtaining high-level and low-level semantic features. The advanced semantic features are input to the ASPP module, which includes 4 dilated convolution layers with different dilation rates and 1 pooling layer. After convolution of the dilated convolution layer and the pooling operation of the pooling layer, 5 feature map to obtain multi-scale information.

步骤2.2：将Cityscapes数据集中图像输入到改进后的DeepLabV3+网络中，进入DCNN后得到相应的低级语义特征。为了减少卷积操作中参数的数量，将深度可分离卷积引入到ASPP模块中，进一步压缩模型，提高网络推理速度。Step 2.2: Input the images in the Cityscapes dataset into the improved DeepLabV3+ network, and obtain the corresponding low-level semantic features after entering the DCNN. In order to reduce the number of parameters in the convolution operation, depth-separable convolution is introduced into the ASPP module to further compress the model and improve the network inference speed.

步骤2.3：为了提高模型的准确性，在大量背景信息中着重突出目标特征，本发明添加注意力机制模块，使网络能够在时间消耗和训练参数增加不多的情况，学习图像上的重要特征，进而达到精度和模型计算成本的平衡。本发明在ASPP模块原始的4个分支基础上，为了对不同的区域进行加权，可以聚集重要信息来产生输出，本发明添加了第5个分支，第5分支在池化层上引入SE注意力机制，着重于通道信息，在通道维度上细化从骨干网络中提取的特征图，使网络能够从信息丰富的通道进行深入学习，从而有效地捕捉特征通道之间的特征关系。在SE注意力之后，引入CBAM注意力机制，当输入特征的位置信息在通道及空间位置中均有更高的重要性时，使网络学习更多的有效特征。Step 2.3: In order to improve the accuracy of the model and highlight the target features in a large amount of background information, the present invention adds an attention mechanism module so that the network can learn important features on the image without much increase in time consumption and training parameters. In order to achieve a balance between accuracy and model calculation cost. Based on the original four branches of the ASPP module, in order to weight different areas and gather important information to generate output, the present invention adds a fifth branch, which introduces SE attention in the pooling layer. The mechanism, focusing on channel information, refines the feature maps extracted from the backbone network in the channel dimension, enabling the network to perform in-depth learning from information-rich channels, thereby effectively capturing the feature relationships between feature channels. After SE attention, the CBAM attention mechanism is introduced to enable the network to learn more effective features when the position information of the input feature has higher importance in both channel and spatial position.

步骤2.4：通过上述改进后的ASPP模块，再进行1×1卷积运算，获取多尺度上下文信息。通过使用多尺度，提取更全面的信息，既包括全局的整体信息，又包括了局部的详细信息。Step 2.4: Use the above-mentioned improved ASPP module and then perform a 1×1 convolution operation to obtain multi-scale context information. By using multiple scales, more comprehensive information is extracted, including both global overall information and local detailed information.

步骤2.5：最后，通过一次3×3的卷积运算和4倍双线性上采样，得到输出的分割图像。Step 2.5: Finally, through a 3×3 convolution operation and 4 times bilinear upsampling, the output segmented image is obtained.

在一些难样本图像中，图像边缘细节往往不能很好地进行分割，一些形状不规则的目标物体很难正确分类，这会导致样本产生大量负样本。也就是说，样本易于分类，因此主导梯度更新方向，网络学习有用的信息能力下降，不能精确分割目标对象。本发明将基准模型中交叉熵损失函数替换为Focal Loss损失函数，来解决上述问题。通过Focal Loss损失函数，削弱简单样本对梯度更新方向的主导作用，避免网络学习到大量无用的信息。同时能够避免模型向样本多的类别偏移，缓解类别不平衡问题。In some difficult sample images, image edge details often cannot be segmented well, and some irregularly shaped target objects are difficult to classify correctly, which will lead to a large number of negative samples. That is to say, the sample is easy to classify, so it dominates the gradient update direction, the network's ability to learn useful information decreases, and it cannot accurately segment the target object. The present invention replaces the cross-entropy loss function in the benchmark model with the Focal Loss loss function to solve the above problems. Through the Focal Loss loss function, the dominant role of simple samples in the gradient update direction is weakened, and the network is prevented from learning a large amount of useless information. At the same time, it can prevent the model from shifting to categories with more samples and alleviate the problem of category imbalance.

本发明的步骤2中，编码器中采用MobileNetV2轻量级网络架构提取特征，降低计算量。MobileNetV2引入ResNet残差思想，高维特征和线性整流(Rectified Linear Unit，ReLU)函数结合，其核心思想是设计一个反向残差块，如图1所示。首先，将输入低维特征的通道数扩展6倍，得到高维特征，然后做深度可分离卷积处理，接着移除ReLU激活层，再用映射层替代，最终输出低维特征。In step 2 of the present invention, the MobileNetV2 lightweight network architecture is used in the encoder to extract features and reduce the amount of calculation. MobileNetV2 introduces the ResNet residual idea, combining high-dimensional features with a linear rectification (Rectified Linear Unit, ReLU) function. The core idea is to design a reverse residual block, as shown in Figure 1. First, the number of channels of the input low-dimensional features is expanded by 6 times to obtain high-dimensional features, and then depth-separable convolution processing is performed. Then the ReLU activation layer is removed, and then replaced with a mapping layer, and finally the low-dimensional features are output.

MobileNetV2卷积块结构如图2所示，在stride＝1时，输入的特征和输出特征的维度是一样的，同时引入残差结构而stride＝2的块不具有残差结构。MobileNetV2网络的反向残差结构提高了内存使用的效率，与MobileNetV1网络相比，反向残差结构还可以提取到更多特征。The MobileNetV2 convolution block structure is shown in Figure 2. When stride=1, the dimensions of the input features and the output features are the same, and a residual structure is introduced while the block with stride=2 does not have a residual structure. The reverse residual structure of the MobileNetV2 network improves the efficiency of memory usage. Compared with the MobileNetV1 network, the reverse residual structure can also extract more features.

本发明步骤2中，在ASPP模块中引入深度可分离卷积，对于某个卷积层而言，若输入张量的深度为C_in，输出张量的深度为C_out，h和w分别为图像得宽和高，深度可分离卷积参数量如公式(1)所示：In step 2 of the present invention, depth-separable convolution is introduced in the ASPP module. For a certain convolution layer, if the depth of the input tensor is C _in and the depth of the output tensor is C _out , h and w are respectively The width, height and depth of the image are separable convolution parameters as shown in formula (1):

h·w·C_in·C_out+C_in·C_out (1)h·w·C _in ·C _out +C _in ·C _out (1)

深度可分离卷积与标准卷积的参数之比如公式(2)所示：The parameter ratio between depthwise separable convolution and standard convolution is shown in formula (2):

由此可知，相同条件下使用深度可分离卷积，可以有效降低模型参数量。因此本发明中使用深度可分离卷积的方式改进语义分割模型架构。It can be seen that using depth-separable convolution under the same conditions can effectively reduce the number of model parameters. Therefore, the depth-separable convolution method is used in this invention to improve the semantic segmentation model architecture.

本发明步骤2中，添加了CBAM注意力机制模块(Convolutional BlockAt-tentionModule)，如公式(3)和公式(4)所示：In step 2 of the present invention, a CBAM attention mechanism module (Convolutional BlockAt-tentionModule) is added, as shown in formula (3) and formula (4):

其中M_c(F)为通道注意力模块，M_s(F)为空间注意力模块，输入函数用F表示；Sigmoid激活函数用σ表示，W₀，W₁表示全连接层的参数，和/>分别为平均池化和最大池化后的特征。f^7×7表示一个卷积核大小为7×7的卷积操作空间注意力模块更关心空间层面的特征图。Among them, M _c (F) is the channel attention module, M _s (F) is the spatial attention module, the input function is represented by F; the Sigmoid activation function is represented by σ, W ₀ and W ₁ represent the parameters of the fully connected layer, and/> They are the features after average pooling and max pooling respectively. f ^7×7 represents a convolution operation with a convolution kernel size of 7×7. The spatial attention module is more concerned about the feature map at the spatial level.

本发明步骤2中，添加了SE注意力机制模块(Squeeze-and-ExcitationModule，SEModule)，如公式(5)：In step 2 of the present invention, the SE attention mechanism module (Squeeze-and-ExcitationModule, SEModule) is added, as shown in formula (5):

接下来，利用两个全连接层(Fully Connected layers，FC)及激励操作来得到压缩实数列的信息，由此提高了模块非线性。如公式(6)所示:Next, two fully connected layers (FC) and excitation operations are used to obtain the information of the compressed real sequence, thereby improving the module nonlinearity. As shown in formula (6):

s＝σ[W₂δ(W₁z)] (6)s＝σ[W ₂ δ(W ₁ z)] (6)

其中，W₁和W₂为两个全连接层的参数，δ为非线性激活函数ReLU，σ为Sigmoid激活函数。Among them, W ₁ and W ₂ are the parameters of the two fully connected layers, δ is the nonlinear activation function ReLU, and σ is the Sigmoid activation function.

最后，把原始函数中逐通道乘以激励操作获得的通道权重系数，得到加权后的特征。如公式(7)所示:Finally, the original function is multiplied channel by channel by the channel weight coefficient obtained by the excitation operation to obtain the weighted features. As shown in formula (7):

本发明的步骤3中，使用Focal Loss损失函数解决正负样本不均衡问题，公式(8)为二分类交叉熵损失函数。其中，p指预测样本是正样本的概率；y为真实的标签。为表示方便起见，使用p_t来表示样品隶属于真实类别的可能性，表示方法如公式(9)。由公式(8)与公式(9)可知CE函数的表达形式如公式(10)。根据公式(10)，获得了Focal Loss损失函数的表达形式，如公式(11)、(12)所示。In step 3 of the present invention, the Focal Loss loss function is used to solve the problem of imbalance between positive and negative samples. Formula (8) is a binary cross-entropy loss function. Among them, p refers to the probability that the predicted sample is a positive sample; y is the true label. For the convenience of expression, p _t is used to express the possibility that the sample belongs to the real category, and the expression method is as follows: formula (9). From formula (8) and formula (9), we can know that the expression form of the CE function is as formula (10). According to formula (10), the expression form of Focal Loss loss function is obtained, as shown in formulas (11) and (12).

CE(p,y)＝CE(p_t)＝-log(p_t) (10)CE(p,y)＝CE(p _t )＝-log(p _t ) (10)

FL(p_t)＝-α_t(1-p_t)^γlog(p (12)FL(p _t )＝-α _t (1-p _t ) ^γ log(p (12)

本发明以MobileNetV2为骨干特征提取网络，降低参数量48.58M，减少计算量10.82GFlops，并将其应用到ASPP模块中。其次，利用注意力机制提高了图像的分割准确率。在此基础上，使用Focal Loss损失函数解决正负样本不均衡问题。本发明提出的网络能够有效地改进图像分割的精度，加快语义分割的速度，减轻了设备的运算负荷。This invention uses MobileNetV2 as the backbone feature extraction network, reduces the amount of parameters by 48.58M, reduces the amount of calculation by 10.82GFlops, and applies it to the ASPP module. Secondly, the attention mechanism is used to improve the segmentation accuracy of the image. On this basis, the Focal Loss loss function is used to solve the problem of imbalance between positive and negative samples. The network proposed by the present invention can effectively improve the accuracy of image segmentation, accelerate the speed of semantic segmentation, and reduce the computing load of the device.

下面通过一个具体实施例进行说明。The following is explained through a specific embodiment.

本发明的方法训练所用的数据集是Citscapes中语义分割部分高精度标注的数据集，该数据集含有训练集，验证集以及测试集。数据集中的每张图像都对应有该图像的标注，包括实力分割的标注和语义分割标注。本发明只使用语义分割标注，训练集的图片共2975张，验证集图片有500张。The data set used for training the method of the present invention is a high-precision annotated data set of the semantic segmentation part in Citscapes. The data set contains a training set, a verification set and a test set. Each image in the data set corresponds to annotations of the image, including strength segmentation annotations and semantic segmentation annotations. This invention only uses semantic segmentation annotation. There are 2975 pictures in the training set and 500 pictures in the verification set.

本发明使用Pytorch作为深度学习开发框架，程序设计语言为python3.7，基于Windows 10系统，通过实验验证了该方法的有效性。实验的硬件条件为Intel Xeon(R)Gold6226R主频2.90GHz，内存32GB，GPU为Nvidia GRIDRTX8000P-24Q，显存24GB。在网络训练过程中，改进网络与原网络使用相同超参数，设初始学习率为7e-3，weight decay为1e-4，epoch为100，采用sgd优化算法进行学习，动量设置为0.9，损失函数为Focal Loss函数。This invention uses Pytorch as the deep learning development framework, the programming language is python3.7, and is based on the Windows 10 system. The effectiveness of the method is verified through experiments. The hardware conditions of the experiment are Intel Xeon(R) Gold6226R with a main frequency of 2.90GHz and 32GB of memory. The GPU is Nvidia GRIDRTX8000P-24Q and the video memory is 24GB. During the network training process, the improved network uses the same hyperparameters as the original network. The initial learning rate is 7e-3, the weight decay is 1e-4, and the epoch is 100. The sgd optimization algorithm is used for learning, the momentum is set to 0.9, and the loss function is the Focal Loss function.

具体采用以下几个步骤实现：Specifically, the following steps are used to achieve this:

步骤1：将基准模型(骨干网络为Xception的DeeplabV3+网络)改为MobielNetV2的DeeplabV3+网络，其中MobielNetV2的主干部分首先利用1×1卷积进行升维，再利用3×3深度可分离卷积进行特征提取，最后利用1×1卷积降维。MobielNetV2的残差部分的输入和输出直接相接。在完成MobielNetV2的特征提取后，获得两个有效特征层，其中一个有效特征层是输入图片高和宽压缩2次的结果，另一个有效特征层是输入图片高和宽压缩4次的结果。Step 1: Change the benchmark model (the backbone network is Xception's DeeplabV3+ network) to the DeeplabV3+ network of MobielNetV2. The backbone part of MobielNetV2 first uses 1×1 convolution for dimensionality enhancement, and then uses 3×3 depth separable convolution for feature extraction. Extraction, and finally using 1×1 convolution to reduce dimensionality. The input and output of the residual part of MobielNetV2 are directly connected. After completing the feature extraction of MobielNetV2, two effective feature layers are obtained. One of the effective feature layers is the result of compressing the height and width of the input image twice, and the other effective feature layer is the result of compressing the height and width of the input image four times.

步骤2：在步骤1的基础上，为了降低普通卷积运算的参数量和计算量，ASPP模块中的前4个分支中引入深度可分离卷积。首先使用与通道数相同个数的卷积核对不同通道的特征图进行2D卷积，提取到的若干特征在通道方向独立，随后通过逐点卷积进行卷积核大小为1×1的3D卷积操作，对深度卷积得到的特征图进行通道方向的特征提取。Step 2: Based on step 1, in order to reduce the amount of parameters and calculations of ordinary convolution operations, depth-separable convolution is introduced in the first four branches of the ASPP module. First, use the same number of convolution kernels as the number of channels to perform 2D convolution on the feature maps of different channels. Several extracted features are independent in the channel direction. Then, a 3D convolution with a convolution kernel size of 1×1 is performed through point-by-point convolution. Product operation is used to extract features in the channel direction on the feature map obtained by depth convolution.

步骤3：在步骤2的基础上，为了对不同的区域进行加权，可以聚集重要信息来产生输出，在原始的4个分支基础上添加了第5个分支，第5分支在池化层上引入SE注意力机制，着重于通道信息，在通道维度上细化从骨干网络中提取的特征图，使网络能够从信息丰富的通道进行深入学习，从而有效地捕捉特征通道之间的特征关系。在SE注意力之后，引入CBAM注意力机制，当输入特征的位置信息在通道及空间位置中均有更高的重要性时，使网络学习更多的有效特征。Step 3: On the basis of step 2, in order to weight different areas, important information can be gathered to produce output, a fifth branch is added based on the original 4 branches, and the fifth branch is introduced on the pooling layer The SE attention mechanism, focusing on channel information, refines the feature maps extracted from the backbone network in the channel dimension, enabling the network to perform in-depth learning from information-rich channels, thereby effectively capturing the feature relationships between feature channels. After SE attention, the CBAM attention mechanism is introduced to enable the network to learn more effective features when the position information of the input feature has higher importance in both channel and spatial position.

步骤4：由于Cityscapes数据集训练样本量较少，且在复杂场景下对各分类的数目及神经网络学习数据集中的21种目标类别无法呈现均匀分布。利用Focal Loss损失函数取代DeepLabv3+中的交叉熵函数(Cross Entropy LossFunction)，来平衡Cityscapes数据集中各分类不平衡的问题。Step 4: Due to the small number of training samples in the Cityscapes data set, and the number of each category and the 21 target categories in the neural network learning data set cannot be uniformly distributed in complex scenarios. The Focal Loss loss function is used to replace the Cross Entropy Loss Function in DeepLabv3+ to balance the imbalance of each classification in the Cityscapes data set.

图1和图2分别是MobileNetV2反向残差块和卷积块结构。Figures 1 and 2 are the MobileNetV2 reverse residual block and convolution block structures respectively.

图3是改进后的DeepLabv3+网络。将图像输入到网络中，进入DCNN后得到相应的低级语义特征。为了减少卷积操作中参数的数量，将深度可分离卷积引入到ASPP模块中。除此之外，为了提高模型的准确性，在大量背景信息中着重突出目标特征，本发明添加注意力机制模块，使网络能够在时间消耗和训练参数增加不多的情况，学习图像上的重要特征，进而达到精度和模型计算成本的平衡。通过上述改进后的ASPP模块，再进行1×1卷积运算，获取多尺度上下文信息。最后Decoder模块进行解码操作，得到语义分割后的结果。本发明采用Focal Loss函数解决正负样本间的不平衡问题，从而改善了网络的分割精度。Figure 3 is the improved DeepLabv3+ network. The image is input into the network, and the corresponding low-level semantic features are obtained after entering the DCNN. In order to reduce the number of parameters in the convolution operation, depthwise separable convolution is introduced into the ASPP module. In addition, in order to improve the accuracy of the model and highlight the target features in a large amount of background information, the present invention adds an attention mechanism module so that the network can learn important features on the image without much increase in time consumption and training parameters. features, thereby achieving a balance between accuracy and model calculation cost. Through the above-mentioned improved ASPP module, a 1×1 convolution operation is performed to obtain multi-scale context information. Finally, the Decoder module performs the decoding operation and obtains the result after semantic segmentation. The present invention uses the Focal Loss function to solve the imbalance problem between positive and negative samples, thereby improving the segmentation accuracy of the network.

图4是原始模型和改进后模型品平均交并比(mIoU)曲线，由78.96％提升至80.04％。本发明在Cityscapes数据集上对改进的DeepLabV3+网络进行训练和验证，得到mIoU曲线。最终模型较基准模型mIoU提高1.08％。Figure 4 is the average intersection-over-union (mIoU) curve of the original model and the improved model, which increased from 78.96% to 80.04%. The present invention trains and verifies the improved DeepLabV3+ network on the Cityscapes data set to obtain the mIoU curve. The final model improves mIoU by 1.08% compared with the baseline model.

图5是原始模型和改进后模型品平均交并比(mIoU)曲线，由78.96％提升至80.04％。本发明在Cityscapes数据集上对改进的DeepLabV3+网络进行训练和验证，得到mPA曲线。最终模型较基准模型mPA提高2.1％。Figure 5 is the average intersection-over-union (mIoU) curve of the original model and the improved model, which increased from 78.96% to 80.04%. The present invention trains and verifies the improved DeepLabV3+ network on the Cityscapes data set to obtain the mPA curve. The final model improves mPA by 2.1% compared with the baseline model.

图6是针对损失函数、特征提取网络和注意力机制进行了消融实验后总迭代30000次损失函曲线，可以看出Focal Loss损失函数的收敛效果较好。FocalLoss函数取代交叉熵损失函数后，mIoU提高到80.04％，同时保持了图像分割的准确性。loss的平均值从0.0925降低至0.0185，val_loss的平均值从0.1078降低至0.1074，由此看出用Focal Loss损失函数处理正负样本不均衡问题有一定效果。Figure 6 shows the loss function curve after a total of 30,000 iterations of ablation experiments on the loss function, feature extraction network and attention mechanism. It can be seen that the convergence effect of the Focal Loss loss function is better. After the FocalLoss function replaced the cross-entropy loss function, the mIoU increased to 80.04% while maintaining the accuracy of image segmentation. The average value of loss decreased from 0.0925 to 0.0185, and the average value of val_loss decreased from 0.1078 to 0.1074. It can be seen that using the Focal Loss loss function to deal with the imbalance of positive and negative samples has a certain effect.

图7是Cityscapes数据集部分可视化结果展示，从左至右分别为原始图像，改进前分割结构，改进后分割结果。以看出本发明模型较基准模型在细节上的准确性都有较明显的提升。Figure 7 shows some of the visualization results of the Cityscapes data set. From left to right are the original image, the pre-improved segmentation structure, and the improved segmentation results. It can be seen that the accuracy of the details of the model of the present invention is significantly improved compared with the baseline model.

本发明的一种基于深度学习的交通场景语义分割方法，该方法从3个方面对DeepLabV3+网络进行改进：①在编码器中，采用MobileNetV2轻量级网络架构提取特征，降低计算量。在ASPP模块中，用深度可分离卷积代替普通卷积，减少参数量。②添加CBAM和通道注意力机制SE-attention，解决在多尺度融合特征时不同特征之间的关联依赖较少的问题，提高模型预测精度。③替换交叉熵损失函数为Focal Loss损失函数，解决正负样本不平衡和难易分类样本不平衡的问题。实验利用PASCAL VOC数据集对改进网络模型进行训练和测试，结果表明，改进的DeepLabV3+网络模型平均交并比(mIoU)达80.04％，准确率为92.8％，参数量降低至6.13M。与原始DeepLabV3+相比，mIoU和准确率分别提高1.08％和2.1％，参数量减少48.58M,计算量降低10.82GFlop。The present invention is a traffic scene semantic segmentation method based on deep learning. This method improves the DeepLabV3+ network from three aspects: ① In the encoder, the MobileNetV2 lightweight network architecture is used to extract features and reduce the amount of calculation. In the ASPP module, depth-separable convolution is used instead of ordinary convolution to reduce the number of parameters. ② Add CBAM and channel attention mechanism SE-attention to solve the problem of less correlation between different features when fusing multi-scale features, and improve the prediction accuracy of the model. ③Replace the cross-entropy loss function with the Focal Loss loss function to solve the problem of imbalance of positive and negative samples and imbalance of difficult-to-classify samples. The experiment used the PASCAL VOC data set to train and test the improved network model. The results show that the improved DeepLabV3+ network model has an average intersection-over-union ratio (mIoU) of 80.04%, an accuracy of 92.8%, and the number of parameters reduced to 6.13M. Compared with the original DeepLabV3+, the mIoU and accuracy rates are increased by 1.08% and 2.1% respectively, the parameter amount is reduced by 48.58M, and the calculation amount is reduced by 10.82GFlop.

以上给出本发明的具体实施例，需要说明的是本发明并不局限于以下具体实施例，凡是在本申请方案基础上做的同等变换均落入本发明的保护范围。The specific embodiments of the present invention are given above. It should be noted that the present invention is not limited to the following specific embodiments. All equivalent transformations based on the solution of this application fall within the protection scope of the present invention.

Claims

1. A real-time semantic segmentation method of traffic scenes based on deep learning, which is characterized by including the following steps:

Step 1: Obtain training images and preprocess the training images;

Step 2: Construct a traffic scene semantic segmentation network. The encoder during construction uses the MobileNetV2 backbone feature extraction network to form the DeeplabV3+ network model of the backbone network MobielNetV2; use training images to train the traffic scene semantic segmentation network to obtain a semantic segmentation network model;

Step 3: Perform real-time semantic segmentation of traffic scenes based on the semantic segmentation network model.

2. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 1, characterized in that in step 1, the preprocessing process specifically includes converting the training image into a tensor format and adjusting the image size. Normalization is performed, and data augmentation is performed by adding noise through random flipping.

3. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 1, characterized in that step 2 specifically includes the following steps:

Step 2.1, the encoder when constructing the traffic scene semantic segmentation network uses MobileNetV2 lightweight network architecture to extract features, and improves Xception's DeeplabV3+ network into MobielNetV2's DeeplabV3+ network; MobielNetV2's DeeplabV3+ network obtains advanced semantic features and Low-level semantic features;

Input high-level semantic features into the ASPP module, which introduces deep convolutional neural network, CBAM attention mechanism module and SE attention mechanism module;

Step 2.2, input the training image into the DeeplabV3+ network model and use the Focal Loss loss function for training; after entering the deep convolutional neural network, the corresponding low-level semantic features are obtained;

Step 2.3, perform a 1×1 convolution operation through the ASPP module in step 2.1 to obtain multi-scale context information;

Step 2.4, through a 3×3 convolution operation and 4 times bilinear upsampling, the output segmented image is obtained;

Step 2.5, perform decoding operations through the Decoder module to obtain the results after semantic segmentation.

4. A method of real-time semantic segmentation of traffic scenes based on deep learning according to claim 3, characterized in that in step 2.1, the input and output of the residual part of the MobileNetV2 lightweight network are directly connected.

5. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 3, characterized in that, in step 2.1, the DeeplabV3+ network of MobielNetV2 first uses 1×1 convolution for dimensionality increase, and then uses 3×3 Depthwise separable convolution is used for feature extraction, and finally 1×1 convolution is used to reduce dimensionality;

After completing the feature extraction of MobielNetV2, two effective feature layers are obtained. One of the effective feature layers is the result of compressing the height and width of the input image twice, and the other effective feature layer is the result of compressing the height and width of the input image four times.

6. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 3, characterized in that, in step 2.1, the depth-separable convolution parameter formula in the ASPP module is,

h·w·C _in ·C _out +C _in ·C _out

In the formula, the depth of the input tensor is C _in , the depth of the output tensor is C _out , h and w are the width and height of the image respectively.

7. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 3, characterized in that in step 2.1, the CBAM attention mechanism module formula in the ASPP module is,

M _c (F)=σ(MLP(AvgPool(F)

MLP(MaxPool(F))＝

Among them, M _c (F) is the channel attention module, M _s (F) is the spatial attention module, the input function is represented by F; the Sigmoid activation function is represented by σ, W ₀ and W ₁ represent the parameters of the fully connected layer, and/> They are the features after average pooling and maximum pooling respectively; f ^7×7 represents a convolution operation with a convolution kernel size of 7×7. The spatial attention module is more concerned about the feature map at the spatial level.

8. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 1, characterized in that, in step 2, the SE attention mechanism module formula in the ASPP module is,

Among them, u represents the convolved feature map, the number of channels of u is C, and W×H is the spatial dimension of u.

9. A real-time semantic segmentation method of traffic scenes based on deep learning according to claim 1, characterized in that in step 2.2, the Focal Loss loss function formula is,

CE(p,y)＝CE(p _t )＝-log(

Among them, p refers to the probability that the predicted sample is a positive sample; y is the real label, and p _t represents the possibility that the sample belongs to the real category.