CN109635642A

CN109635642A - A kind of road scene dividing method based on residual error network and expansion convolution

Info

Publication number: CN109635642A
Application number: CN201811293377.5A
Authority: CN
Inventors: 周武杰; 吕思嘉; 袁建中; 向坚; 王海江; 何成
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2018-11-01
Filing date: 2018-11-01
Publication date: 2019-04-16

Abstract

The invention discloses a kind of based on residual error network and expands the road scene dividing method of convolution, constructs convolutional neural networks in the training stage, the Residual block that hidden layer is set gradually by 10 is formed；The original road scene image of every in training set is input in convolutional neural networks and is trained, the corresponding 12 width semantic segmentation prognostic chart of every original road scene image is obtained；By calculate set that the corresponding 12 width semantic segmentation prognostic chart of every original road scene image is constituted and corresponding true semantic segmentation image procossing at 12 width one-hot coding image constructions set between loss function value, obtain the best initial weights vector of convolutional neural networks classification based training model；It in test phase, is predicted using the best initial weights vector of convolutional neural networks classification based training model, obtains the corresponding prediction semantic segmentation image of road scene image to semantic segmentation；Advantage is that computation complexity is low, segmentation is high-efficient, segmentation precision is high, and robustness is good.

Description

A road scene segmentation method based on residual network and dilated convolution

技术领域technical field

本发明涉及一种深度学习的语义分割技术，尤其是涉及一种基于残差网络和扩张卷积的道路场景分割方法。The invention relates to a deep learning semantic segmentation technology, in particular to a road scene segmentation method based on residual network and dilated convolution.

背景技术Background technique

深度学习是人工神经网络的一个分支，具有深度网络结构的人工神经网络是深度学习最早的网络模型。最初，深度学习的应用主要是在图像和语音领域。自2006年以来，深度学习在学术界持续升温，深度学习和神经网络在语义分割、计算机视觉、语音识别、跟踪方面都有极广泛的应用，其极强的高效性也使得它在实时应用等各方面具有巨大的潜力。Deep learning is a branch of artificial neural network, and artificial neural network with deep network structure is the earliest network model of deep learning. Initially, the applications of deep learning were mainly in the fields of images and speech. Since 2006, deep learning has continued to heat up in academia. Deep learning and neural networks have been widely used in semantic segmentation, computer vision, speech recognition, and tracking. Its high efficiency also makes it suitable for real-time applications, etc. There is huge potential in all aspects.

卷积神经网络在图像的分类、定位以及场景理解等方面取得了成功。随着增强现实和自动驾驶车辆等任务的激增，许多研究人员将注意力转移到场景理解上，其中一个主要步骤就是语义分割，即对所给定图像中的每个像素点做分类。语义分割在移动和机器人相关应用中具有重要意义。Convolutional neural networks have achieved success in image classification, localization, and scene understanding. With the proliferation of tasks such as augmented reality and self-driving vehicles, many researchers have turned their attention to scene understanding, and one of the main steps is semantic segmentation, which is to classify each pixel in a given image. Semantic segmentation is of great importance in mobile and robotics related applications.

语义分割问题在很多应用场景中都有着十分重要的作用，例如图片理解、自动驾驶等，所以近年来，语义分割问题在学术界和工业界得到了广泛的关注。经典的语义分割方法有全连接网络(Full Connected Network，FCN)和卷积神经网络SegNet等，这些方法在道路场景分割数据库上的像素精度、均像素精度和均交并比均有不错的表现。但是，FCN的一个不足之处在于，由于池化层的存在，导致响应张量的大小(长和宽)越来越小，然而FCN的设计初衷则需要和输入大小一致的输出，因此FCN做了上采样，但是上采样并不能将丢失的信息全部无损地找回来；卷积神经网络SegNet是在FCN的基础上构建的网络模型，然而其并没有很好地控制信息丢失这个问题。因此，这些方法因信息丢失影响了分割精确度，从而导致方法的鲁棒性降低。The problem of semantic segmentation plays a very important role in many application scenarios, such as image understanding, automatic driving, etc. Therefore, in recent years, the problem of semantic segmentation has received extensive attention in academia and industry. Classical semantic segmentation methods include Full Connected Network (FCN) and Convolutional Neural Network SegNet, etc. These methods have good performance in pixel accuracy, average pixel accuracy and average intersection ratio on road scene segmentation databases. However, one of the shortcomings of FCN is that due to the existence of the pooling layer, the size (length and width) of the response tensor is getting smaller and smaller. However, the original design of FCN needs to output the same size as the input, so FCN does Upsampling is used, but upsampling cannot recover all the lost information losslessly; the convolutional neural network SegNet is a network model built on the basis of FCN, but it does not control the problem of information loss well. Therefore, these methods affect the segmentation accuracy due to the loss of information, which leads to a decrease in the robustness of the method.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是提供一种基于残差网络和扩张卷积的道路场景分割方法，其计算复杂度低、分割效率高、分割精度高，且鲁棒性好。The technical problem to be solved by the present invention is to provide a road scene segmentation method based on residual network and dilated convolution, which has low computational complexity, high segmentation efficiency, high segmentation accuracy and good robustness.

本发明解决上述技术问题所采用的技术方案为：一种基于残差网络和扩张卷积的道路场景分割方法，其特征在于包括训练阶段和测试阶段两个过程；The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a road scene segmentation method based on residual network and dilated convolution, which is characterized in that it includes two processes: a training phase and a testing phase;

所述的训练阶段过程的具体步骤为：The specific steps of the training phase process are:

步骤1_1：选取Q幅原始的道路场景图像及每幅原始的道路场景图像对应的真实语义分割图像，并构成训练集，将训练集中的第q幅原始的道路场景图像记为{I^q(i,j)}，将训练集中与{I^q(i,j)}对应的真实语义分割图像记为然后采用独热编码技术将训练集中的每幅原始的道路场景图像对应的真实语义分割图像处理成12幅独热编码图像，将处理成的12幅独热编码图像构成的集合记为其中，道路场景图像为RGB彩色图像，Q为正整数，Q≥100，q为正整数，1≤q≤Q，1≤i≤W，1≤j≤H，W表示{I^q(i,j)}的宽度，H表示{I^q(i,j)}的高度，I^q(i,j)表示{I^q(i,j)}中坐标位置为(i,j)的像素点的像素值，表示中坐标位置为(i,j)的像素点的像素值；Step 1_1: Select Q original road scene images and the real semantic segmentation images corresponding to each original road scene image, and form a training set, and mark the qth original road scene image in the training set as {I ^q (i ,j)}, denote the real semantic segmentation images corresponding to {I ^q (i, j)} in the training set as Then, the one-hot encoding technology is used to process the real semantic segmentation images corresponding to each original road scene image in the training set into 12 one-hot encoded images. The set of 12 processed one-hot encoded images is denoted as Among them, the road scene image is an RGB color image, Q is a positive integer, Q≥100, q is a positive integer, 1≤q≤Q, 1≤i≤W, 1≤j≤H, W represents {I ^q (i, j)}, H represents the height of {I ^q (i, j)}, I ^q (i, j) represents the pixel point at the coordinate position (i, j) in {I ^q (i, j)} Pixel values, express The pixel value of the pixel whose middle coordinate position is (i, j);

步骤1_2：构建卷积神经网络：卷积神经网络包括输入层、隐层和输出层；隐层由10个依次设置的Residual block组成，其中，第1个Residual block中的每个卷积层通过设置扩张率为1形成扩张卷积层，第2个Residual block中的每个卷积层通过设置扩张率为1形成扩张卷积层，第3个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第4个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第5个Residual block中的每个卷积层通过设置扩张率为4形成扩张卷积层，第6个Residualblock中的每个卷积层通过设置扩张率为4形成扩张卷积层，第7个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第8个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第9个Residual block中的每个卷积层通过设置扩张率为1形成扩张卷积层，第10个Residual block中的每个卷积层通过设置扩张率为1形成扩张卷积层；Step 1_2: Build a convolutional neural network: The convolutional neural network includes an input layer, a hidden layer and an output layer; the hidden layer consists of 10 Residual blocks set in sequence, among which, each convolutional layer in the first Residual block passes through Set the dilation rate to 1 to form a dilated convolutional layer, each convolutional layer in the second Residual block forms a dilated convolutional layer by setting the dilation rate to 1, and each convolutional layer in the third Residual block is dilated by setting The rate of 2 forms a dilated convolution layer, each convolution layer in the fourth Residual block forms an expanded convolution layer by setting the expansion rate to 2, and each convolution layer in the fifth Residual block is set by setting the expansion rate. 4 forms a dilated convolutional layer, each convolutional layer in the sixth Residualblock forms a dilated convolutional layer by setting a dilation rate of 4, and each convolutional layer in the seventh Residual block forms a dilation by setting a dilation rate of 2. Convolutional layer, each convolutional layer in the 8th Residual block forms a dilated convolutional layer by setting the dilation rate to 2, and each convolutional layer in the 9th Residual block forms a dilated convolutional layer by setting the dilation rate to 1 layer, each convolutional layer in the 10th Residual block forms a dilated convolutional layer by setting the dilation rate to 1;

对于输入层，输入层的输入端接收一幅原始输入图像的R通道分量、G通道分量和B通道分量，输入层的输出端输出原始输入图像的R通道分量、G通道分量和B通道分量给隐层；其中，要求输入层的输入端接收的原始输入图像的宽度为W、高度为H；For the input layer, the input end of the input layer receives the R channel component, G channel component and B channel component of an original input image, and the output end of the input layer outputs the R channel component, G channel component and B channel component of the original input image to Hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W and the height is H;

对于第1个Residual block，第1个Residual block的输入端接收输入层的输出端输出的原始输入图像的R通道分量、G通道分量和B通道分量，第1个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₁；其中，R₁中的每幅特征图的宽度为W、高度为H；For the first Residual block, the input end of the first Residual block receives the R channel component, G channel component and B channel component of the original input image output by the output end of the input layer, and the output end of the first Residual block outputs 32 images Feature map, the set composed of 32 feature maps is denoted as R ₁ ; wherein, the width of each feature map in R ₁ is W and the height is H;

对于第2个Residual block，第2个Residual block的输入端接收R₁中的所有特征图，第2个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₂；其中，R₂中的每幅特征图的宽度为W、高度为H；For the second Residual block, the input of the second Residual block receives all the feature maps in R ₁ , the output of the second Residual block outputs 32 feature maps, and the set of 32 feature maps is recorded as R ₂ ; wherein, the width of each feature map in R ₂ is W and the height is H;

对于第3个Residual block，第3个Residual block的输入端接收R₂中的所有特征图，第3个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₃；其中，R₃中的每幅特征图的宽度为W、高度为H；For the third Residual block, the input of the third Residual block receives all the feature maps in R ₂ , the output of the third Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R ₃ ; wherein, the width of each feature map in R ₃ is W and the height is H;

对于第4个Residual block，第4个Residual block的输入端接收R₃中的所有特征图，第4个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₄；其中，R₄中的每幅特征图的宽度为W、高度为H；For the 4th Residual block, the input of the 4th Residual block receives all the feature maps in _R3 , the output of the _4th Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R4 ; wherein, the width of each feature map in R ₄ is W and the height is H;

对于第5个Residual block，第5个Residual block的输入端接收R₄中的所有特征图，第5个Residual block的输出端输出128幅特征图，将128幅特征图构成的集合记为R₅；其中，R₅中的每幅特征图的宽度为W、高度为H；For the 5th Residual block, the input of the 5th Residual block receives all the feature maps in R ₄ , the output of the 5th Residual block outputs 128 feature maps, and the set of 128 feature maps is recorded as R ₅ ; wherein, the width of each feature map in R ₅ is W and the height is H;

对于第6个Residual block，第6个Residual block的输入端接收R₅中的所有特征图，第6个Residual block的输出端输出128幅特征图，将128幅特征图构成的集合记为R₆；其中，R₆中的每幅特征图的宽度为W、高度为H；For the 6th Residual block, the input of the 6th Residual block receives all the feature maps in R ₅ , the output of the 6th Residual block outputs 128 feature maps, and the set of 128 feature maps is recorded as R ₆ ; wherein, the width of each feature map in R ₆ is W and the height is H;

对于第7个Residual block，第7个Residual block的输入端接收R₆中的所有特征图，第7个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₇；其中，R₇中的每幅特征图的宽度为W、高度为H；For the 7th Residual block, the input of the 7th Residual block receives all the feature maps in R ₆ , the output of the 7th Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R ₇ ; wherein, the width of each feature map in R ₇ is W and the height is H;

对于第8个Residual block，第8个Residual block的输入端接收R₇中的所有特征图，第8个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₈；其中，R₈中的每幅特征图的宽度为W、高度为H；For the 8th Residual block, the input of the 8th Residual block receives all the feature maps in R ₇ , the output of the 8th Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R ₈ ; wherein, the width of each feature map in R ₈ is W and the height is H;

对于第9个Residual block，第9个Residual block的输入端接收R₈中的所有特征图，第9个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₉；其中，R₉中的每幅特征图的宽度为W、高度为H；For the ninth Residual block, the input of the ninth Residual block receives all the feature maps in R ₈ , the output of the ninth Residual block outputs 32 feature maps, and the set of 32 feature maps is recorded as R ₉ ; wherein, the width of each feature map in R ₉ is W and the height is H;

对于第10个Residual block，第10个Residual block的输入端接收R₉中的所有特征图，第10个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₁₀；其中，R₁₀中的每幅特征图的宽度为W、高度为H；For the 10th Residual block, the input of the 10th Residual block receives all the feature maps in R ₉ , the output of the 10th Residual block outputs 32 feature maps, and the set of 32 feature maps is recorded as R ₁₀ ; wherein, the width of each feature map in R ₁₀ is W and the height is H;

对于输出层，其由1个卷积层组成，输出层的输入端接收R₁₀中的所有特征图，输出层的输出端输出12幅与原始输入图像对应的语义分割预测图；其中，每幅语义分割预测图的宽度为W、高度为H；For the output layer, it consists of 1 convolutional layer, the input of the output layer receives all the feature maps in R ₁₀ , and the output of the output layer outputs 12 semantic segmentation prediction maps corresponding to the original input image; The width of the semantic segmentation prediction map is W and the height is H;

步骤1_3：将训练集中的每幅原始的道路场景图像作为原始输入图像，输入到卷积神经网络中进行训练，得到训练集中的每幅原始的道路场景图像对应的12幅语义分割预测图，将{I^q(i,j)}对应的12幅语义分割预测图构成的集合记为 Step 1_3: Take each original road scene image in the training set as the original input image, input it into the convolutional neural network for training, and obtain 12 semantic segmentation prediction maps corresponding to each original road scene image in the training set. The set of 12 semantic segmentation prediction maps corresponding to {I ^q (i, j)} is denoted as

步骤1_4：计算训练集中的每幅原始的道路场景图像对应的12幅语义分割预测图构成的集合与对应的真实语义分割图像处理成的12幅独热编码图像构成的集合之间的损失函数值，将与之间的损失函数值记为 Step 1_4: Calculate the loss function value between the set of 12 semantic segmentation prediction images corresponding to each original road scene image in the training set and the set of 12 one-hot encoded images processed from the corresponding real semantic segmentation image. ,Will and The loss function value between is denoted as

步骤1_5：重复执行步骤1_3和步骤1_4共V次，得到卷积神经网络分类训练模型，并共得到Q×V个损失函数值；然后从Q×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为卷积神经网络分类训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1；Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain a convolutional neural network classification training model, and obtain a total of Q×V loss function values; then find the loss with the smallest value from the Q×V loss function values. function value; then the weight vector and bias term corresponding to the loss function value with the smallest value are corresponding to the optimal weight vector and optimal bias term of the convolutional neural network classification training model, which are correspondingly recorded as W ^best and b ^best ; Wherein, V>1;

所述的测试阶段过程的具体步骤为：The specific steps of the test phase process are:

步骤2_1：令表示待语义分割的道路场景图像；其中，1≤i'≤W'，1≤j'≤H'，W'表示的宽度，H'表示的高度，表示中坐标位置为(i,j)的像素点的像素值；Step 2_1: Make Indicates the road scene image to be semantically segmented; where 1≤i'≤W', 1≤j'≤H', W' denotes width, H' means the height of, express The pixel value of the pixel whose middle coordinate position is (i, j);

步骤2_2：将的R通道分量、G通道分量和B通道分量输入到卷积神经网络分类训练模型中，并利用W^best和b^best进行预测，得到对应的预测语义分割图像，记为其中，表示中坐标位置为(i',j')的像素点的像素值。Step 2_2: Put the The R channel component, G channel component and B channel component are input into the convolutional neural network classification training model, and use W ^best and b ^best to predict, get The corresponding predicted semantic segmentation image, denoted as in, express The pixel value of the pixel whose middle coordinate position is (i', j').

所述的步骤1_4中，采用分类交叉熵获得。In the described steps 1-4, Obtained using categorical cross-entropy.

与现有技术相比，本发明的优点在于：Compared with the prior art, the advantages of the present invention are:

1)本发明方法在构建卷积神经网络的过程中，引入了ResNet(残差网络)中的嵌入恒等的shortcut连接的Residual block(残差网络块)，由堆叠的10个Residual block构成了卷积神经网络的隐层，Residual block的设置增大了特征信息的提取能力，充分吸收了残差网络基本模块的结构高效性，从而提高了训练得到的卷积神经网络分类训练模型的预测准确度。1) In the process of constructing the convolutional neural network, the method of the present invention introduces the Residual block (residual network block) embedded in the identical shortcut connection in the ResNet (residual network), which is composed of 10 stacked Residual blocks. The hidden layer of the convolutional neural network and the setting of the Residual block increase the ability to extract feature information and fully absorb the structural efficiency of the basic module of the residual network, thereby improving the prediction accuracy of the trained convolutional neural network classification training model. Spend.

2)本发明方法构建的卷积神经网络，其隐层仅采用了10个Residual block，大大减少了冗余性及数据量大等一系列问题的代价损失，从而使得其计算复杂度低；每个Residual block中采用了卷积层通过设置扩张率形成的扩张卷积层，扩张卷积很好地避免了因尺寸变换过程中而损失掉的信息，在扩大了感受野的同时，确保了特征图的分辨率不变，极大程度上保留了有效的深度信息，使得训练阶段得到的语义分割预测图和测试阶段得到的预测语义分割图像的分辨率高、边界精确、空间连续性好。2) In the convolutional neural network constructed by the method of the present invention, only 10 Residual blocks are used in the hidden layer, which greatly reduces the cost loss of a series of problems such as redundancy and large amount of data, thereby making its computational complexity low; Each Residual block uses a convolution layer formed by setting the dilation rate. The dilated convolution avoids the loss of information due to the size transformation process. While expanding the receptive field, it ensures the characteristics of The resolution of the graph remains unchanged, and the effective depth information is retained to a large extent, so that the semantic segmentation prediction graph obtained in the training phase and the predicted semantic segmentation image obtained in the testing phase have high resolution, accurate boundaries, and good spatial continuity.

3)本发明方法采用的Residual block，不仅大大提高了特征信息的提取力度，而且防止了模型过拟合，有极强鲁棒性，大大提升了分割效率。3) The Residual block adopted by the method of the present invention not only greatly improves the extraction strength of feature information, but also prevents the model from overfitting, has strong robustness, and greatly improves the segmentation efficiency.

附图说明Description of drawings

图1为本发明方法的总体实现框图；Fig. 1 is the overall realization block diagram of the method of the present invention;

图2为本发明方法创建的卷积神经网络的组成结构示意图；Fig. 2 is the composition structure schematic diagram of the convolutional neural network created by the method of the present invention;

图3a为选取的一幅待语义分割的道路场景图像；Fig. 3a is a selected road scene image to be semantically segmented;

图3b为图3a所示的待语义分割的道路场景图像对应的真实语义分割图像；Fig. 3b is the real semantic segmentation image corresponding to the road scene image to be semantically segmented shown in Fig. 3a;

图3c为利用本发明方法对图3a所示的待语义分割的道路场景图像进行预测，得到的预测语义分割图像；Figure 3c is a predicted semantic segmentation image obtained by using the method of the present invention to predict the road scene image to be semantically segmented as shown in Figure 3a;

图4a为选取的另一幅待语义分割的道路场景图像；Fig. 4a is another selected road scene image to be semantically segmented;

图4b为图4a所示的待语义分割的道路场景图像对应的真实语义分割图像；Fig. 4b is the real semantic segmentation image corresponding to the road scene image to be semantically segmented shown in Fig. 4a;

图4c为利用本发明方法对图4a所示的待语义分割的道路场景图像进行预测，得到的预测语义分割图像。Fig. 4c is a predicted semantic segmentation image obtained by using the method of the present invention to predict the road scene image to be semantically segmented as shown in Fig. 4a.

具体实施方式Detailed ways

以下结合附图实施例对本发明作进一步详细描述。The present invention will be further described in detail below with reference to the embodiments of the accompanying drawings.

本发明提出的一种基于残差网络和扩张卷积的道路场景分割方法，其总体实现框图如图1所示，其包括训练阶段和测试阶段两个过程。The overall implementation block diagram of a road scene segmentation method based on residual network and dilated convolution proposed by the present invention is shown in Figure 1, which includes two processes: a training phase and a testing phase.

步骤1_1：选取Q幅原始的道路场景图像及每幅原始的道路场景图像对应的真实语义分割图像，并构成训练集，将训练集中的第q幅原始的道路场景图像记为{I^q(i,j)}，将训练集中与{I^q(i,j)}对应的真实语义分割图像记为然后采用现有的独热编码技术(one-hot)将训练集中的每幅原始的道路场景图像对应的真实语义分割图像处理成12幅独热编码图像，将处理成的12幅独热编码图像构成的集合记为其中，道路场景图像为RGB彩色图像，Q为正整数，Q≥100，如取Q＝100，q为正整数，1≤q≤Q，1≤i≤W，1≤j≤H，W表示{I^q(i,j)}的宽度，H表示{I^q(i,j)}的高度，如取W＝352、H＝480，I^q(i,j)表示{I^q(i,j)}中坐标位置为(i,j)的像素点的像素值，表示中坐标位置为(i,j)的像素点的像素值；在此，原始的道路场景图像直接选用道路场景图像数据库CamVid训练集中的100幅图像。Step 1_1: Select Q original road scene images and the real semantic segmentation images corresponding to each original road scene image, and form a training set, and mark the qth original road scene image in the training set as {I ^q (i ,j)}, denote the real semantic segmentation images corresponding to {I ^q (i, j)} in the training set as Then, the existing one-hot encoding technology (one-hot) is used to process the real semantic segmentation images corresponding to each original road scene image in the training set into 12 one-hot encoded images. The set of 12 processed one-hot encoded images is denoted as Among them, the road scene image is an RGB color image, Q is a positive integer, Q≥100, if Q=100, q is a positive integer, 1≤q≤Q, 1≤i≤W, 1≤j≤H, W means The width of {I ^q (i, j)}, H represents the height of {I ^q (i, j)}, such as W=352, H=480, I ^q (i, j) represents {I ^q (i, j) j)} The pixel value of the pixel whose coordinate position is (i, j), express The pixel value of the pixel point whose middle coordinate position is (i, j); here, the original road scene image directly selects 100 images in the training set of the road scene image database CamVid.

步骤1_2：构建卷积神经网络：如图2所示，卷积神经网络包括输入层、隐层和输出层；隐层由10个依次设置的Residual block(残差网络块)组成，其中，第1个Residualblock中的每个卷积层通过设置扩张率(dilation rate)为1形成扩张卷积层，第2个Residual block中的每个卷积层通过设置扩张率为1形成扩张卷积层，第3个Residualblock中的每个卷积层通过设置扩张率为2形成扩张卷积层，第4个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第5个Residual block中的每个卷积层通过设置扩张率为4形成扩张卷积层，第6个Residual block中的每个卷积层通过设置扩张率为4形成扩张卷积层，第7个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第8个Residual block中的每个卷积层通过设置扩张率为2形成扩张卷积层，第9个Residual block中的每个卷积层通过设置扩张率为1形成扩张卷积层，第10个Residualblock中的每个卷积层通过设置扩张率为1形成扩张卷积层，10个Residual block中的扩张卷积层的卷积核大小保持不变均为3×3。Step 1_2: Build a convolutional neural network: As shown in Figure 2, the convolutional neural network includes an input layer, a hidden layer and an output layer; the hidden layer consists of 10 Residual blocks (residual network blocks) set in sequence, among which, the first Each convolutional layer in one Residual block forms a dilated convolutional layer by setting the dilation rate to 1, and each convolutional layer in the second Residual block forms a dilated convolutional layer by setting the dilation rate to 1, Each convolutional layer in the third Residual block forms a dilated convolutional layer by setting a dilation rate of 2, each convolutional layer in the fourth Residual block forms a dilated convolutional layer by setting a dilation rate of 2, and the fifth Each convolutional layer in the Residual block forms a dilated convolutional layer by setting the dilation rate to 4, and each convolutional layer in the sixth Residual block forms a dilated convolutional layer by setting the dilation rate to 4. The seventh Residual block forms a dilated convolutional layer. Each convolutional layer in the convolutional layer forms a dilated convolutional layer by setting the dilation rate to 2, and each convolutional layer in the 8th Residual block forms a dilated convolutional layer by setting the dilation rate to 2. In the ninth Residual block, the Each convolutional layer forms a dilated convolutional layer by setting the dilation rate to 1, each convolutional layer in the 10th Residualblock forms a dilated convolutional layer by setting the dilation rate to 1, and the dilated convolutional layer in the 10 Residual blocks The size of the convolution kernel remains the same as 3 × 3.

对于输入层，输入层的输入端接收一幅原始输入图像的R通道分量、G通道分量和B通道分量，输入层的输出端输出原始输入图像的R通道分量、G通道分量和B通道分量给隐层；其中，要求输入层的输入端接收的原始输入图像的宽度为W、高度为H。For the input layer, the input end of the input layer receives the R channel component, G channel component and B channel component of an original input image, and the output end of the input layer outputs the R channel component, G channel component and B channel component of the original input image to Hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W and the height is H.

对于第1个Residual block，第1个Residual block的输入端接收输入层的输出端输出的原始输入图像的R通道分量、G通道分量和B通道分量，第1个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₁；其中，R₁中的每幅特征图的宽度为W、高度为H。For the first Residual block, the input end of the first Residual block receives the R channel component, G channel component and B channel component of the original input image output by the output end of the input layer, and the output end of the first Residual block outputs 32 images For feature maps, the set of 32 feature maps is denoted as R ₁ ; wherein, the width of each feature map in R ₁ is W and the height is H.

对于第2个Residual block，第2个Residual block的输入端接收R₁中的所有特征图，第2个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₂；其中，R₂中的每幅特征图的宽度为W、高度为H。For the second Residual block, the input of the second Residual block receives all the feature maps in R ₁ , the output of the second Residual block outputs 32 feature maps, and the set of 32 feature maps is recorded as R ₂ ; where, the width of each feature map in R ₂ is W and the height is H.

对于第3个Residual block，第3个Residual block的输入端接收R₂中的所有特征图，第3个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₃；其中，R₃中的每幅特征图的宽度为W、高度为H。For the third Residual block, the input of the third Residual block receives all the feature maps in R ₂ , the output of the third Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R ₃ ; where, the width of each feature map in R ₃ is W and the height is H.

对于第4个Residual block，第4个Residual block的输入端接收R₃中的所有特征图，第4个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₄；其中，R₄中的每幅特征图的宽度为W、高度为H。For the 4th Residual block, the input of the 4th Residual block receives all the feature maps in _R3 , the output of the _4th Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R4 ; where, the width of each feature map in R ₄ is W and the height is H.

对于第5个Residual block，第5个Residual block的输入端接收R₄中的所有特征图，第5个Residual block的输出端输出128幅特征图，将128幅特征图构成的集合记为R₅；其中，R₅中的每幅特征图的宽度为W、高度为H。For the 5th Residual block, the input of the 5th Residual block receives all the feature maps in R ₄ , the output of the 5th Residual block outputs 128 feature maps, and the set of 128 feature maps is recorded as R ₅ ; where, the width of each feature map in R ₅ is W and the height is H.

对于第6个Residual block，第6个Residual block的输入端接收R₅中的所有特征图，第6个Residual block的输出端输出128幅特征图，将128幅特征图构成的集合记为R₆；其中，R₆中的每幅特征图的宽度为W、高度为H。For the 6th Residual block, the input of the 6th Residual block receives all the feature maps in R ₅ , the output of the 6th Residual block outputs 128 feature maps, and the set of 128 feature maps is recorded as R ₆ ; where, the width of each feature map in R ₆ is W and the height is H.

对于第7个Residual block，第7个Residual block的输入端接收R₆中的所有特征图，第7个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₇；其中，R₇中的每幅特征图的宽度为W、高度为H。For the 7th Residual block, the input of the 7th Residual block receives all the feature maps in R ₆ , the output of the 7th Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R ₇ ; wherein, the width of each feature map in R ₇ is W and the height is H.

对于第8个Residual block，第8个Residual block的输入端接收R₇中的所有特征图，第8个Residual block的输出端输出64幅特征图，将64幅特征图构成的集合记为R₈；其中，R₈中的每幅特征图的宽度为W、高度为H。For the 8th Residual block, the input of the 8th Residual block receives all the feature maps in R ₇ , the output of the 8th Residual block outputs 64 feature maps, and the set of 64 feature maps is recorded as R ₈ ; where, the width of each feature map in R ₈ is W and the height is H.

对于第9个Residual block，第9个Residual block的输入端接收R₈中的所有特征图，第9个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₉；其中，R₉中的每幅特征图的宽度为W、高度为H。For the ninth Residual block, the input of the ninth Residual block receives all the feature maps in R ₈ , the output of the ninth Residual block outputs 32 feature maps, and the set of 32 feature maps is recorded as R ₉ ; where, the width of each feature map in R ₉ is W and the height is H.

对于第10个Residual block，第10个Residual block的输入端接收R₉中的所有特征图，第10个Residual block的输出端输出32幅特征图，将32幅特征图构成的集合记为R₁₀；其中，R₁₀中的每幅特征图的宽度为W、高度为H。For the 10th Residual block, the input of the 10th Residual block receives all the feature maps in R ₉ , the output of the 10th Residual block outputs 32 feature maps, and the set of 32 feature maps is recorded as R ₁₀ ; where, the width of each feature map in R ₁₀ is W and the height is H.

对于输出层，其由1个卷积层组成，输出层的输入端接收R₁₀中的所有特征图，输出层的输出端输出12幅与原始输入图像对应的语义分割预测图；其中，每幅语义分割预测图的宽度为W、高度为H。For the output layer, it consists of 1 convolutional layer, the input of the output layer receives all the feature maps in R ₁₀ , and the output of the output layer outputs 12 semantic segmentation prediction maps corresponding to the original input image; The width of the semantic segmentation prediction map is W and the height is H.

步骤1_4：计算训练集中的每幅原始的道路场景图像对应的12幅语义分割预测图构成的集合与对应的真实语义分割图像处理成的12幅独热编码图像构成的集合之间的损失函数值，将与之间的损失函数值记为采用分类交叉熵(categorical crossentropy)获得。Step 1_4: Calculate the loss function value between the set of 12 semantic segmentation prediction images corresponding to each original road scene image in the training set and the set of 12 one-hot encoded images processed from the corresponding real semantic segmentation image. ,Will and The loss function value between is denoted as Obtained using categorical crossentropy.

步骤1_5：重复执行步骤1_3和步骤1_4共V次，得到卷积神经网络分类训练模型，并共得到Q×V个损失函数值；然后从Q×V个损失函数值中找出值最小的损失函数值；接着将值最小的损失函数值对应的权值矢量和偏置项对应作为卷积神经网络分类训练模型的最优权值矢量和最优偏置项，对应记为W^best和b^best；其中，V＞1，在本实施例中取V＝300。Step 1_5: Repeat steps 1_3 and 1_4 for a total of V times to obtain a convolutional neural network classification training model, and obtain a total of Q×V loss function values; then find the loss with the smallest value from the Q×V loss function values. function value; then the weight vector and bias term corresponding to the loss function value with the smallest value are corresponding to the optimal weight vector and optimal bias term of the convolutional neural network classification training model, which are correspondingly recorded as W ^best and b ^best ; Wherein, V>1, in this embodiment, take V=300.

步骤2_1：令表示待语义分割的道路场景图像；其中，1≤i'≤W'，1≤j'≤H'，W'表示的宽度，H'表示的高度，表示中坐标位置为(i,j)的像素点的像素值。Step 2_1: Make Indicates the road scene image to be semantically segmented; where 1≤i'≤W', 1≤j'≤H', W' denotes width, H' means the height of, express The pixel value of the pixel whose middle coordinate position is (i, j).

为了进一步验证本发明方法的可行性和有效性，进行实验。In order to further verify the feasibility and effectiveness of the method of the present invention, experiments were carried out.

使用基于python的深度学习库Keras2.1.5搭建卷积神经网络的架构。采用道路场景图像数据库CamVid测试集来分析利用本发明方法预测得到的道路场景图像的分割效果如何。这里，利用评估语义分割方法的3个常用客观参量作为评价指标，即像素精度(PixelAccuracy，PA)、均像素精度(Mean Pixel Accuracy，MPA)、均交并比(Mean Intersectionover Union，MIoU)来评价预测语义分割图像的分割性能。Use the python-based deep learning library Keras2.1.5 to build the architecture of the convolutional neural network. The road scene image database CamVid test set is used to analyze the segmentation effect of the road scene image predicted by the method of the present invention. Here, three common objective parameters for evaluating semantic segmentation methods are used as evaluation indicators, namely Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), and Mean Intersectionover Union (MIoU) to evaluate Predicting segmentation performance for semantically segmented images.

利用本发明方法对道路场景图像数据库CamVid测试集中的每幅道路场景图像进行预测，得到每幅道路场景图像对应的预测语义分割图像，反映本发明方法的语义分割效果的像素精度PA、均像素精度MPA、均交并比MIoU如表1所列，像素精度PA、均像素精度MPA、均交并比MIoU的值越高，说明有效性和预测准确率越高。从表1所列的数据可知，按本发明方法得到的道路场景图像的分割结果是较好的，表明利用本发明方法来获取道路场景图像对应的预测语义分割图像是可行性且有效的。The method of the present invention is used to predict each road scene image in the test set of the road scene image database CamVid, and the predicted semantic segmentation image corresponding to each road scene image is obtained, and the pixel accuracy PA and the average pixel accuracy of the semantic segmentation effect of the method of the present invention are reflected. MPA, mean intersection ratio MIoU are listed in Table 1, the higher the value of pixel precision PA, mean pixel precision MPA, and mean intersection ratio MIoU, the higher the validity and prediction accuracy. From the data listed in Table 1, it can be seen that the segmentation result of the road scene image obtained by the method of the present invention is good, indicating that the method of the present invention is feasible and effective to obtain the predicted semantic segmentation image corresponding to the road scene image.

表1利用本发明方法在测试集上的评测结果Table 1. Evaluation results on the test set using the method of the present invention

图3a给出了选取的一幅待语义分割的道路场景图像；图3b给出了图3a所示的待语义分割的道路场景图像对应的真实语义分割图像；图3c给出了利用本发明方法对图3a所示的待语义分割的道路场景图像进行预测，得到的预测语义分割图像；图4a给出了选取的另一幅待语义分割的道路场景图像；图4b给出了图4a所示的待语义分割的道路场景图像对应的真实语义分割图像；图4c给出了利用本发明方法对图4a所示的待语义分割的道路场景图像进行预测，得到的预测语义分割图像。对比图3b和图3c，对比图4b和图4c，可以看出利用本发明方法得到的预测语义分割图像的分割精度较高，接近真实语义分割图像。Fig. 3a shows a selected road scene image to be semantically segmented; Fig. 3b shows the real semantic segmentation image corresponding to the road scene image to be semantically segmented shown in Fig. 3a; Fig. 3c shows the use of the method of the present invention Predict the road scene image to be semantically segmented as shown in Fig. 3a, and obtain the predicted semantic segmentation image; Fig. 4a shows another selected road scene image to be semantically segmented; Fig. 4b shows the image shown in Fig. 4a Figure 4c shows the predicted semantic segmentation image obtained by predicting the road scene image to be semantically segmented as shown in Figure 4a by using the method of the present invention. Comparing Fig. 3b and Fig. 3c, and Fig. 4b and Fig. 4c, it can be seen that the segmentation accuracy of the predicted semantic segmentation image obtained by the method of the present invention is high, which is close to the real semantic segmentation image.

Claims

1. A road scene segmentation method based on residual error network and expansion convolution is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original road scene images and real semantic segmentation images corresponding to each original road scene image, forming a training set, and recording the Q-th original road scene image in the training set as { I }^q(I, j) }, the training set is summed with { I }^q(i, j) } corresponding real semantic segmentation imageIs marked asThen, processing the real semantic segmentation image corresponding to each original road scene image in the training set into 12 single-hot coded images by adopting a single-hot coding technology, and processing the single-hot coded imagesThe processed set of 12 one-hot coded images is denoted asThe road scene image is an RGB color image, Q is a positive integer, Q is more than or equal to 100, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, and W represents { I ≦ I^q(I, j) }, H denotes { I }^qHeight of (I, j) }, I^q(I, j) represents { I^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j);

step 1_ 2: constructing a convolutional neural network: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layers are composed of 10 sequentially arranged Residual blocks, wherein each convolution layer in the 1 st Residual block forms an expanded convolution layer by setting the expansion ratio to 1, each convolution layer in the 2 nd Residual block forms an expanded convolution layer by setting the expansion ratio to 1, each convolution layer in the 3 rd Residual block forms an expanded convolution layer by setting the expansion ratio to 2, each convolution layer in the 4 th Residual block forms an expanded convolution layer by setting the expansion ratio to 2, each convolution layer in the 5 th Residual block forms an expanded convolution layer by setting the expansion ratio to 4, each convolution layer in the 6 th Residual block forms an expanded convolution layer by setting the expansion ratio to 4, each convolution layer in the 7 th Residual block forms an expanded convolution layer by setting the expansion ratio to 2, each convolution layer in the 8 th Residual block forms an expanded convolution layer by setting the expansion ratio to 2, each convolution layer in the 9 th Residual block forms an expanded convolution layer by setting the expansion rate to be 1, and each convolution layer in the 10 th Residual block forms an expanded convolution layer by setting the expansion rate to be 1;

for an input layer, the input end of the input layer receives an R channel component, a G channel component and a B channel component of an original input image, and the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer; wherein, the width of the original input image received by the input end of the input layer is required to be W, and the height of the original input image is required to be H;

for the 1 st Residual block, the input end of the 1 st Residual block receives the R channel component, the G channel component and the B channel component of the original input image output by the output end of the input layer, the output end of the 1 st Residual block outputs 32 characteristic graphs, and the set formed by the 32 characteristic graphs is recorded as R₁(ii) a Wherein R is₁Each feature map in (1) has a width W and a height H;

for the 2 nd Residual block, the input terminal of the 2 nd Residual block receives R₁The output end of the 2 nd Residual block outputs 32 characteristic graphs, and the set formed by the 32 characteristic graphs is marked as R₂(ii) a Wherein R is₂Each feature map in (1) has a width W and a height H;

for the 3 rd Residual block, the input terminal of the 3 rd Residual block receives R₂The output end of the 3 rd Residual block outputs 64 characteristic graphs, and the set of the 64 characteristic graphs is marked as R₃(ii) a Wherein R is₃Each feature map in (1) has a width W and a height H;

for the 4 th Residual block, the input terminal of the 4 th Residual block receives R₃The output end of the 4 th Residual block outputs 64 characteristic graphs, and the set of the 64 characteristic graphs is marked as R₄(ii) a Wherein R is₄Each feature map in (1) has a width W and a height H;

for the 5 th Residual block, the input terminal of the 5 th Residual block receives R₄The output end of the 5 th Residual block outputs 128 characteristic graphs, and the set of the 128 characteristic graphs is marked as R₅(ii) a Wherein R is₅Each feature map in (1) has a width W and a height H;

for the 6 th Residual block, the input terminal of the 6 th Residual block receives R₅The output end of the 6 th Residual block outputs 128 characteristic graphs, and the set of the 128 characteristic graphs is marked as R₆(ii) a Wherein R is₆Each feature map in (1) has a width W and a height H;

for the 7 th Residual block, the input terminal of the 7 th Residual block receives R₆The output end of the 7 th Residual block outputs 64 characteristic graphs, and the set of the 64 characteristic graphs is marked as R₇(ii) a Wherein R is₇Each feature map in (1) has a width W and a height H;

for the 8 th Residual block, the input terminal of the 8 th Residual block receives R₇The output end of the 8 th Residual block outputs 64 characteristic graphs, and the set of the 64 characteristic graphs is marked as R₈(ii) a Wherein R is₈Each feature map in (1) has a width W and a height H;

for the 9 th Residual block, the input terminal of the 9 th Residual block receives R₈The output end of the 9 th Residual block outputs 32 characteristic graphs, and the set formed by the 32 characteristic graphs is marked as R₉(ii) a Wherein R is₉Each feature map in (1) has a width W and a height H;

for the 10 th Residual block, the input terminal of the 10 th Residual block receives R₉The 10 th Residual block outputs 32 characteristic graphs, and the set of the 32 characteristic graphs is marked as R₁₀(ii) a Wherein R is₁₀Each feature map in (1) has a width W and a height H;

for the output layer, which consists of 1 convolutional layer, the input of the output layer receives R₁₀All characteristic diagrams in (1)The output end of the output layer outputs 12 semantic segmentation prediction graphs corresponding to the original input image; wherein the width of each semantic segmentation prediction graph is W, and the height of each semantic segmentation prediction graph is H;

step 1_ 3: taking each original road scene image in the training set as an original input image, inputting the original input image into a convolutional neural network for training to obtain 12 semantic segmentation prediction graphs corresponding to each original road scene image in the training set, and performing semantic segmentation on the { I } graph^q(i, j) } the set of 12 semantic segmentation prediction graphs is recorded as

Step 1_ 4: calculating loss function values between a set formed by 12 semantic segmentation prediction images corresponding to each original road scene image in the training set and a set formed by 12 single-hot coded images processed by corresponding real semantic segmentation images, and converting the loss function values into the loss function valuesAndthe value of the loss function in between is recorded as

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as W^bestAnd b^best(ii) a Wherein V is greater than 1;

the test stage process comprises the following specific steps:

step 2_ 1: order toRepresenting a road scene image to be semantically segmented; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsWidth of (A), H' representsThe height of (a) of (b),to representThe middle coordinate position is the pixel value of the pixel point of (i, j);

step 2_ 2: will be provided withThe R channel component, the G channel component and the B channel component are input into a convolutional neural network classification training model and are subjected to W-based classification^bestAnd b^bestMaking a prediction to obtainCorresponding predictive semantic segmentation image, denotedWherein,to representAnd the pixel value of the pixel point with the middle coordinate position of (i ', j').

2. The method of claim 1, wherein the road scene segmentation method is based on residual error network and dilation convolutionCharacterized in that in the step 1-4,and obtaining by adopting classification cross entropy.