CN115171074A

CN115171074A - Vehicle target identification method based on multi-scale yolo algorithm

Info

Publication number: CN115171074A
Application number: CN202210806937.2A
Authority: CN
Inventors: 王英立; 史肖波
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-10-11

Abstract

A vehicle target recognition method based on a multi-scale yolo algorithm belongs to the field of target recognition methods. The problems of instantaneity, accuracy, robustness and the like of the existing target detection method in a complex environment are to be improved. A vehicle target identification method based on a multi-scale yolo algorithm is realized by the following steps: preprocessing a data set; extracting features of the backbone network; performing feature fusion on the PANet; a step of non-maximum inhibition of NMS; outputting a target calibration decision; in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.

Description

A vehicle target recognition method based on multi-scale yolo algorithm

技术领域technical field

本发明涉及一种车辆识别方法，特别涉及一种基于多尺度yolo算法的车辆目标识别方法。The invention relates to a vehicle identification method, in particular to a vehicle target identification method based on a multi-scale yolo algorithm.

背景技术Background technique

深度学习具有很好的自学能力，并且还有很强的表达与处理能力，如今的目标检测领域中通常使用深度学习来完成。卷积神经网络(Convolutional Neural Networks，CNN)是深度学习中应用最为广泛的表达式之一，卷积神经网络R-CNN[6]到Fast R-CNN、Faster R-CNN、Cascade R-CNN等模型，算法进过不断地发展改进，在检测精度和检测效率上有大大提高。因此，基于深度学习的算法模型在目标检测领域中将成为最广泛的之一。Deep learning has good self-learning ability, and also has strong expression and processing ability. Today's target detection field is usually done using deep learning. Convolutional Neural Networks (CNN) is one of the most widely used expressions in deep learning. Convolutional Neural Networks R-CNN [6] to Fast R-CNN, Faster R-CNN, Cascade R-CNN, etc. The model and algorithm have been continuously developed and improved, and the detection accuracy and detection efficiency have been greatly improved. Therefore, the algorithm model based on deep learning will become one of the most extensive in the field of object detection.

智能辅助驾驶技术中需要大量的图像识别与处理工作，往往车辆采集到的信息以视频或者图像的形式作为输入，而车载计算机需要以这些视觉信息为基础，识别出其中有价值的目标与内容，为下一步的车辆行为决策提供保障。因此，图像中的目标能够被正确、快速的识别出来，是智能辅助驾驶技术的基础。而目标检测技术虽然已经取得了一些成果，但其在较为复杂环境中所表现出的实时性、准确性、鲁棒性等问题如何提升，依然是亟待研究的热门领域。Intelligent assisted driving technology requires a lot of image recognition and processing work, often the information collected by the vehicle is input in the form of video or image, and the on-board computer needs to identify valuable targets and content based on these visual information. Provide a guarantee for the next vehicle behavior decision. Therefore, the target in the image can be correctly and quickly identified, which is the basis of intelligent assisted driving technology. Although target detection technology has achieved some results, how to improve its real-time, accuracy, robustness and other issues in a more complex environment is still a hot area that needs to be studied urgently.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决现有的目标检测方法在在较为复杂环境中所表现出的实时性、准确性、鲁棒性等问题有待提升的问题，而提出一种基于多尺度yolo算法的车辆目标识别方法。The purpose of the present invention is to solve the problems of real-time, accuracy, robustness and other problems that the existing target detection methods show in a relatively complex environment to be improved, and propose a vehicle based on the multi-scale yolo algorithm. target recognition method.

一种基于多尺度yolo算法的车辆目标识别方法，所述方法通过以下步骤实现：A vehicle target recognition method based on multi-scale yolo algorithm, the method is realized by the following steps:

数据集预处理的步骤；Data set preprocessing steps;

主干网络特征提取的步骤；The steps of backbone network feature extraction;

PANet进行特征融合的步骤；The steps of feature fusion in PANet;

NMS非极大抑制的步骤；The step of NMS non-maximal suppression;

目标标定决策输出的步骤；The steps of target calibration decision output;

另外，基于样本数据集的类别不平衡性问题对分类精度的影响，采用了多损失函数交替训练策略，将交叉熵损失函数与聚焦损失函数在网络训练的不同阶段交替使用，改善了样本不均衡的问题。In addition, based on the impact of the class imbalance problem of the sample data set on the classification accuracy, an alternate training strategy of multiple loss functions is adopted, and the cross-entropy loss function and the focus loss function are used alternately in different stages of network training to improve the sample imbalance. The problem.

优选地，所述的主干网络特征提取的步骤，具体包括：Preferably, the described backbone network feature extraction step specifically includes:

(1)设计卷积算法的步骤；(1) The steps of designing the convolution algorithm;

卷积运算是指输出图像中每个像素都由输入图像的对应位置的小区域的像素通过加权平均所得，这个区域叫做卷积核。通过图像与卷积核做卷积运算，提取出图像的某些特征；The convolution operation means that each pixel in the output image is obtained by the weighted average of the pixels in the small area of the corresponding position of the input image, and this area is called the convolution kernel. Through the convolution operation between the image and the convolution kernel, some features of the image are extracted;

一个8*8的二维灰度图像的像素阵列和一个3*3的卷积核；若卷积核以步长为1进行移动，即每次移动只移动一个小格，当卷积核移动至第i行、第j列时，输入图像与卷积核对应位置数值依次相乘后取均值，可以确定输出图像的第i行、第j列的输出值，以图3所示的像素值为例，则输出值的第2行、第3列的值应为：[1*1+2*0+3*(-1)+4*1+5*0+6*(-1)+7*1+8*0+9*(-1)]，即为-6；卷积核的层数与输入数据的层数相等，若输入图像为三维彩色图像，则卷积核层数也应为三层，三维卷积运算与二维卷积运算基本相同，输出值都是输入值与卷积核对应位置的值取加权平均数；A pixel array of an 8*8 two-dimensional grayscale image and a 3*3 convolution kernel; if the convolution kernel moves with a step size of 1, that is, only one small grid is moved each time it moves, and when the convolution kernel moves When reaching the i-th row and the j-th column, the input image and the corresponding position values of the convolution kernel are multiplied in turn and then averaged, and the output value of the i-th row and the j-th column of the output image can be determined. For example, the value of the second row and the third column of the output value should be: [1*1+2*0+3*(-1)+4*1+5*0+6*(-1)+ 7*1+8*0+9*(-1)], which is -6; the number of layers of the convolution kernel is equal to the number of layers of the input data. If the input image is a three-dimensional color image, the number of layers of the convolution kernel is also It should be three layers, the three-dimensional convolution operation is basically the same as the two-dimensional convolution operation, and the output value is the weighted average of the value of the input value and the corresponding position of the convolution kernel;

一个卷积层内含有多个卷积核，经过卷积层后的输出像素阵列的层数与卷积核的个数有关，若卷积层内含有n个卷积核，则经过卷积层后的输出阵列的层数也为n；A convolution layer contains multiple convolution kernels. The number of layers of the output pixel array after the convolution layer is related to the number of convolution kernels. If the convolution layer contains n convolution kernels, the convolution layer The number of layers of the output array is also n;

(2)设计激活函数的步骤；(2) the steps of designing the activation function;

经过卷积层后，使用激活函数将数据非线性化，如果输入输出一直为线性关系，那么经过很多层后，总的输入和输出仍未的线性关系，那么中间无论有多少层，都相当于一层，如公式所示；After the convolution layer, the activation function is used to nonlinearize the data. If the input and output are always linear, then after many layers, the total input and output are still not linear, so no matter how many layers there are in the middle, it is equivalent to one layer, as shown in the formula;

Y＝aX+b；Z＝cY+dY=aX+b; Z=cY+d

Z＝c(aX+b)+d＝(ac)X+(bc+d)Z=c(aX+b)+d=(ac)X+(bc+d)

选用ReLU函数作为激活函数；Select the ReLU function as the activation function;

(3)设计池化算法的步骤；(3) The steps of designing the pooling algorithm;

输入图像经过卷积运算后，进行池化运算；After the input image is subjected to the convolution operation, the pooling operation is performed;

若输出数据为池化窗口对应位置的输入数据的最大值，则为最大值池化；若输出数据为池化窗口对应位置的输入数据的平均值，则为平均池化；If the output data is the maximum value of the input data at the corresponding position of the pooling window, it is maximum pooling; if the output data is the average value of the input data at the corresponding position of the pooling window, it is average pooling;

(4)进行空间金字塔池化结构的步骤；(4) the steps of performing a spatial pyramid pooling structure;

引入SPP结构，建立候选区域与输入特征图之间的映射关系；Introduce the SPP structure to establish the mapping relationship between the candidate region and the input feature map;

(5)设计MobileNetv3网络结构的步骤；(5) The steps of designing the MobileNetv3 network structure;

将输入的D×D×3的特征图，经过3×3的卷积核卷积后输出D×D×N特征图；标准卷积的过程为N个3×3的卷积核与输入特征图每个通道进行卷积，最后得到通道数为N的新特征图；The input D×D×3 feature map is convolved with a 3×3 convolution kernel to output a D×D×N feature map; the standard convolution process is N 3×3 convolution kernels and input features Each channel of the graph is convolved, and finally a new feature map with N channels is obtained;

深度可分离卷积先用3个3×3的卷积核与输入特征图的各个通道分别进行卷积，得到一个输入通道等于输出通道的特征图，再用N个1×1的卷积核对此特征图进行卷积得到一个N通道的新特征图；The depthwise separable convolution first uses three 3×3 convolution kernels to convolve with each channel of the input feature map respectively, and obtains a feature map whose input channel is equal to the output channel, and then uses N 1×1 convolution kernels to check. This feature map is convolved to obtain a new feature map of N channels;

分别计算两种卷积所用的参数量，结果如下述公式所示：Calculate the amount of parameters used by the two convolutions respectively, and the results are shown in the following formulas:

P₁＝D×D×3×N (1)P ₁ =D×D×3×N (1)

P₂＝D×D×3+D×D×1×N (2)P ₂ =D×D×3+D×D×1×N (2)

P₁是标准卷积所用参数量，P₂为使用深度可分离卷积时所用的参数量，D为输入特征图的长和宽，N为卷积核的数量；P ₁ is the amount of parameters used in standard convolution, P ₂ is the amount of parameters used when using depthwise separable convolution, D is the length and width of the input feature map, and N is the number of convolution kernels;

进行标准卷积的时候输入通道数远小于输出通道数，将公式(1)、公式(2)进行比较得到公式(3):When performing standard convolution, the number of input channels is much smaller than the number of output channels, and formula (3) is obtained by comparing formula (1) and formula (2):

可以看到，P₂/P₁结果远小于1，使用深度可分离卷积后，在得到与标准卷积差不多效果的同时可以大大减少卷积所用的参数量；It can be seen that the result of P ₂ /P ₁ is much less than 1. After using the depthwise separable convolution, the amount of parameters used in the convolution can be greatly reduced while the effect is similar to that of the standard convolution;

(6)设计RFBs结构的步骤；(6) The steps of designing the structure of RFBs;

结构模块中引入空洞卷积；RFBs结构首先将特征图进行1×1卷积进行通道变换，然后进行多分支空洞结构处理，用来获取目标多尺度信息特征；多分支结构中采用普通卷积层结合空洞卷积层的结构，普通卷积层中原RFB结构中的3×3卷积核用并联的1×3和3×1卷积核代替，5×5卷积核用两个串联的1×3和3×1卷积核代替，其中，空洞卷积层分别由3个大小为3×3的卷积核组成，卷积核的扩张率分别为1、3、5，防止扩张率太大造成卷积层退化；最后将经过多分支空洞结构处理过后的不同尺寸的特征层进行Concat操作，并输出新的融合特征层；The hole convolution is introduced into the structure module; the RFBs structure first performs 1×1 convolution on the feature map for channel transformation, and then performs multi-branch hole structure processing to obtain the multi-scale information features of the target; the multi-branch structure uses a common convolution layer. Combined with the structure of the hole convolution layer, the 3×3 convolution kernel in the original RFB structure in the ordinary convolution layer is replaced by the parallel 1×3 and 3×1 convolution kernels, and the 5×5 convolution kernel is replaced by two series 1 convolution kernels. Instead of ×3 and 3×1 convolution kernels, the hole convolution layer consists of 3 convolution kernels with a size of 3 × 3, and the expansion rates of the convolution kernels are 1, 3, and 5, respectively, to prevent the expansion rate from being too high. The convolution layer is degraded due to the large size; finally, the feature layers of different sizes processed by the multi-branch hole structure are subjected to the Concat operation, and a new fusion feature layer is output;

(7)改进的SPPNet的步骤；(7) The steps of the improved SPPNet;

受到YOLOv4中主干特征提取网络CSPDarknet的启发，将CSP结构引入到SPP中，在融合多尺度特征前，将网络分成两个部分，一部分特征经过捷径连接，直接与SPP融合后的特征合并。Inspired by the backbone feature extraction network CSPDarknet in YOLOv4, the CSP structure is introduced into SPP. Before fusing multi-scale features, the network is divided into two parts, and some features are connected by shortcuts and directly merged with the features after SPP fusion.

优选地，所述的PANet进行特征融合的步骤具体为：Preferably, the step that the described PANet performs feature fusion is specifically:

CSPDarknet-53网络包含大量的卷积操作，使用了大量的3×3，1×1卷积组成的残差模块相堆叠，并在其中使用步长为2的3×3卷积层来将特征图尺寸进行1/2的缩小；Residual为一个残差模块，矩形框右侧边的n×即为这个残差模块重复使用的次数；每次经过残差模块后，可以发现使用了步长为2的3×3卷积层，共使用了5次，每使用一次卷积层会使特征图的长和宽变为原来的1/2，替代卷积网络使用池化层进行下采样的功能；在预测阶段，将CSPDarknet-53特征提取网络得到的特征层F1、F2和F3分别输入到多尺度预测网络中，F3经过卷积操作得到粗尺度特征层3，用于检测大尺度目标；特征层3经过上采样先与F2融合，再经过卷积得到中尺度特征层2，用于检测中尺度目标；特征层2再经过上采样与F1融合卷积，继而得到细尺度特征层1，用于检测小尺度目标，这种特征金字塔网络(FeaturePyramid Network,FPN)结构使算法对不同大小、不同尺度的目标都能达到较好的检测效果；最后，将三个不同尺度特征层的预测信息进行组合，经过非极大抑制后处理算法得到最终的检测结果；The CSPDarknet-53 network contains a large number of convolution operations, using a large number of 3 × 3, 1 × 1 convolutional residual modules to stack, and use a 3 × 3 convolutional layer with a stride of 2 in it to combine features. The size of the graph is reduced by 1/2; Residual is a residual module, and the n× on the right side of the rectangular box is the number of times the residual module is reused; after each pass through the residual module, it can be found that the step size used is 2 of the 3×3 convolutional layers are used 5 times in total. Each time the convolutional layer is used, the length and width of the feature map will become 1/2 of the original, instead of the convolutional network using the pooling layer for downsampling. ; In the prediction stage, the feature layers F1, F2 and F3 obtained by the CSPDarknet-53 feature extraction network are respectively input into the multi-scale prediction network, and F3 obtains the coarse-scale feature layer 3 through the convolution operation, which is used to detect large-scale targets; After upsampling, layer 3 is first fused with F2, and then convolved to obtain mesoscale feature layer 2, which is used to detect mesoscale targets; feature layer 2 is then upsampled and convolved with F1 fusion, and then fine-scale feature layer 1 is obtained. In order to detect small-scale targets, this Feature Pyramid Network (FPN) structure enables the algorithm to achieve better detection results for targets of different sizes and scales; Combination, the final detection result is obtained through the non-maximum suppression post-processing algorithm;

得到的9种不同大小的预测框分别为(10×13)，(16×30)，(33×23)，(30×61)，(62×45)，(59×119)，(116×90)，(156×198)，(373×326)；在13×13的粗尺度特征层3上应用(116×90)，(156×198)，(373×326)，检测较大的目标；在26×26的中尺度特征层2上应用中等尺寸的预测框(30×61)，(62×45)，(59×119)，用来检测中等大小的目标；在52×52的细尺度特征层1上应用(10×13)，(16×30)，(33×23)，用于检测较小的目标。The obtained 9 different sizes of prediction boxes are (10×13), (16×30), (33×23), (30×61), (62×45), (59×119), (116× 90), (156×198), (373×326); apply (116×90), (156×198), (373×326) on the 13×13 coarse-scale feature layer 3 to detect larger objects ;Apply medium-sized prediction boxes (30×61), (62×45), (59×119) on the 26×26 mesoscale feature layer 2 to detect medium-sized objects; The scale features (10×13), (16×30), (33×23) are applied on layer 1 for detecting smaller objects.

本发明的有益效果为：The beneficial effects of the present invention are:

本发明是通过对YOLOV4网络的研究了解到YOLOv4的基本实验方法和实验原理依据，明确了YOLOv4的基本结构以及各个网络直接是如何进行数据信息的传递的，根据实验环境的可行性和其网络结构所存在小目标漏检，遮挡问题、复杂背景下的检测问题、多尺度目标检测等问题对网络进行修改，提高网络的目标检测精确度，并对网络进行剪枝降低网络参数，提高模型的训练速度。The present invention understands the basic experimental method and experimental principle basis of YOLOv4 through the research on the YOLOV4 network, and clarifies the basic structure of YOLOv4 and how each network directly transmits data information. According to the feasibility of the experimental environment and its network structure The existing problems such as missed detection of small targets, occlusion problems, detection problems in complex backgrounds, and multi-scale target detection are modified to the network to improve the target detection accuracy of the network, and the network is pruned to reduce network parameters and improve model training. speed.

1.相较于YOLOv4的主干特征提取网络结构CSPDarknet残差跳跃连接方式，本文采用MobileNetv3代替原有的主干特征提取网络因为MobileNetv3的深度可分离卷积有利于优化网络结构，减少网络训练参数，因此本文使用MobileNetv3作为主干特征提取网络，加强网络中特征的有效利用与传递，网络能够学习到更多特征信息，使得网络的结构更加轻量化，进而提升目标检测的速度。1. Compared with YOLOv4's backbone feature extraction network structure CSPDarknet residual skip connection method, this paper uses MobileNetv3 instead of the original backbone feature extraction network because the depth separable convolution of MobileNetv3 is conducive to optimizing the network structure and reducing network training parameters, so In this paper, MobileNetv3 is used as the backbone feature extraction network to strengthen the effective use and transmission of features in the network. The network can learn more feature information, making the structure of the network more lightweight, thereby improving the speed of target detection.

2.受到YOLOv4中主干特征提取网络CSPDarknet的启发，本论文将CSP结构引入到SPP中，在融合多尺度特征前，将网络分成两个部分，一部分特征经过捷径连接，直接与SPP融合后的特征合并，这一操作减少了40％的计算量。这一结构是由多个浅的网络融合而成,浅的网络在训练时不会出现消失的梯度问题,能够加速网络的收敛。2. Inspired by the backbone feature extraction network CSPDarknet in YOLOv4, this paper introduces the CSP structure into SPP. Before fusing multi-scale features, the network is divided into two parts, and some features are connected by shortcuts and directly fused with SPP features Combined, this operation reduces the amount of computation by 40%. This structure is formed by the fusion of multiple shallow networks. The shallow network does not have the problem of vanishing gradient during training, which can accelerate the convergence of the network.

3.由于低层的特征图语义信息比较少，目标位置准确；高层的特征图语义信息比较丰富，目标位置比较粗略，通过引进RFBNet对网络的原有的感受野进行多重融合，引入空洞卷积，实现增大感受野、融合不同尺寸特征的目的。3. Since the low-level feature map has less semantic information, the target location is accurate; the high-level feature map has rich semantic information and rough target location. By introducing RFBNet, the original receptive field of the network is multi-integrated and the hole convolution is introduced. To achieve the purpose of increasing the receptive field and fusing features of different sizes.

附图说明Description of drawings

图1为本发明的方法流程图；Fig. 1 is the method flow chart of the present invention;

图2为本发明涉及的卷积运算基本方法；Fig. 2 is the basic method of convolution operation involved in the present invention;

图3为本发明涉及的像素阵列经过卷积层后的输出层数；Fig. 3 is the output layer number of the pixel array involved in the present invention after passing through the convolution layer;

图4为本发明涉及的两种激活函数；Fig. 4 is two kinds of activation functions involved in the present invention;

图5为本发明涉及的两种池化的方法；Fig. 5 is two kinds of pooling methods involved in the present invention;

图6为本发明涉及的像素阵列经过池化层后的输出层数；6 is the number of output layers of the pixel array involved in the present invention after passing through the pooling layer;

图7为本发明涉及的固定尺寸(21维)输出的SPP结构；Fig. 7 is the SPP structure of the fixed size (21 dimensions) output involved in the present invention;

图8为本发明涉及的YOLOv3的网络结构。FIG. 8 is the network structure of YOLOv3 involved in the present invention.

具体实施方式Detailed ways

具体实施方式一：Specific implementation one:

本实施方式的一种基于多尺度yolo算法的车辆目标识别方法，如图1所示，所述方法通过以下步骤实现：A vehicle target recognition method based on the multi-scale yolo algorithm of the present embodiment, as shown in FIG. 1 , the method is implemented through the following steps:

数据集预处理的步骤；Data set preprocessing steps;

PANet进行特征融合的步骤；The steps of feature fusion in PANet;

NMS非极大抑制的步骤；The step of NMS non-maximal suppression;

其中，样本不均衡主要包括两方面，一方面是指数据集中的大部分样本是比较容易学习的，即所谓的“简单样本”。此类样本在整个数据集中占据了大多数，其特征比较鲜明且分布集中，其周边背景干扰不强，能够很容易地被网络学习和识别。与之相反是的“困难样本”，这类样本由于自身尺寸小、与其他类样本间距较近等原因导致自身特征弱，或是受到周边遮挡、光影变化而导致主体特征缺失等原因，使网络很难学习到其判别性特征，检测效果不佳。除此之外，困难样本在数据集中出现频率低，与简单样本比例不均衡，在训练时难以得到充分的训练，这更加剧了困难样本学习难，检测不准等问题。Among them, the sample imbalance mainly includes two aspects. On the one hand, it means that most of the samples in the data set are relatively easy to learn, that is, the so-called "simple samples". Such samples occupy the majority of the entire dataset, and their features are relatively distinct and concentrated, and their surrounding background interference is not strong, so they can be easily learned and identified by the network. On the contrary, it is a "difficult sample". This kind of sample has weak features due to its small size and close distance from other samples, or is subject to lack of main features due to surrounding occlusion and changes in light and shadow. It is difficult to learn its discriminative features, and the detection effect is not good. In addition, the frequency of difficult samples in the data set is low, and the proportion of difficult samples is not balanced with that of simple samples. It is difficult to obtain sufficient training during training, which further exacerbates the problems of difficult learning and inaccurate detection.

本发明没有使用聚焦损失函数完全代替交叉熵损失函数进行网络的训练，是由于我们在实际对XDUAV数据集的训练和测试中发现，单纯的使用聚焦损失函数函数在对某些数量稀少的类别上(如公交车、油罐车)效果提升较大，但对于数量稀少且尺寸较小，自身特征也较弱的类别上(如自行车、摩托车)效果并不明显。我们分析认为，这些样本由于自身特征十分有限，尽管聚焦损失函数加重了这些样本的训练权重，但数据量的缺乏仍使得网络无法较好学习到这些类别的特征表达。因此，我们在训练初期使用交叉熵损失函数函数，其目的是学习样本的整体特征分布。对于自行车、摩托车等数据较少的类别，由于其特征比较相似，交叉熵损失函数函数能够使网络对这些相近的类别整体进行学习。随后我们换用聚焦损失函数，训练网络学习这些相似类别样本之间的判别性特征，从而更好地区分这些样本。The present invention does not use the focus loss function to completely replace the cross-entropy loss function for network training, because we found in the actual training and testing of the XDUAV data set that simply using the focus loss function function can be used for some rare categories. (such as buses and oil tankers), the effect is greatly improved, but the effect is not obvious for categories with few and small sizes, and their own characteristics are also weak (such as bicycles and motorcycles). Our analysis believes that these samples have very limited characteristics. Although the focus loss function increases the training weight of these samples, the lack of data still makes it difficult for the network to learn the feature expressions of these categories. Therefore, we use the cross-entropy loss function in the early stage of training, the purpose of which is to learn the overall feature distribution of the samples. For categories with less data such as bicycles and motorcycles, since their features are relatively similar, the cross-entropy loss function enables the network to learn these similar categories as a whole. Then we switch to a focused loss function and train the network to learn discriminative features between these samples of similar categories, so as to better distinguish these samples.

具体实施方式二：Specific implementation two:

与具体实施方式一不同的是，本实施方式的一种基于多尺度yolo算法的车辆目标识别方法，所述的主干网络特征提取的步骤，具体包括：Different from the first embodiment, in a vehicle target recognition method based on a multi-scale yolo algorithm in this embodiment, the steps of the backbone network feature extraction specifically include:

卷积运算是指输出图像中每个像素都由输入图像的对应位置的小区域的像素通过加权平均所得，这个区域叫做卷积核。通过图像与卷积核做卷积运算，提取出图像的某些特征，卷积运算基本方法如图2所示。The convolution operation means that each pixel in the output image is obtained by the weighted average of the pixels in the small area of the corresponding position of the input image, and this area is called the convolution kernel. Through the convolution operation between the image and the convolution kernel, some features of the image are extracted. The basic method of the convolution operation is shown in Figure 2.

图中有一个8*8的二维灰度图像的像素阵列和一个3*3的卷积核；若卷积核以步长为1进行移动，即每次移动只移动一个小格，当卷积核移动至第i行、第j列时，输入图像与卷积核对应位置数值依次相乘后取均值，可以确定输出图像的第i行、第j列的输出值，以图3所示的像素值为例，则输出值的第2行、第3列的值应为：[1*1+2*0+3*(-1)+4*1+5*0+6*(-1)+7*1+8*0+9*(-1)]，即为-6；卷积核的层数与输入数据的层数相等，若输入图像为三维彩色图像，则卷积核层数也应为三层，三维卷积运算与二维卷积运算基本相同，输出值都是输入值与卷积核对应位置的值取加权平均数；在设计网络时，设置步长的大小以及是否对输入数据进行填充；In the figure, there is an 8*8 two-dimensional grayscale image pixel array and a 3*3 convolution kernel; if the convolution kernel moves with a step size of 1, that is, only one small cell is moved each time, when the convolution kernel moves When the product kernel moves to the ith row and the jth column, the input image and the corresponding position values of the convolution kernel are multiplied in turn and then averaged, and the output value of the ith row and the jth column of the output image can be determined, as shown in Figure 3 For example, the value of the second row and the third column of the output value should be: [1*1+2*0+3*(-1)+4*1+5*0+6*(- 1)+7*1+8*0+9*(-1)], which is -6; the number of layers of the convolution kernel is equal to the number of layers of the input data. If the input image is a three-dimensional color image, the convolution kernel The number of layers should also be three layers. The three-dimensional convolution operation is basically the same as the two-dimensional convolution operation. The output value is the weighted average of the value of the input value and the corresponding position of the convolution kernel; when designing the network, set the size of the step size. and whether to fill in the input data;

一个卷积层内含有多个卷积核，经过卷积层后的输出像素阵列的层数与卷积核的个数有关，若卷积层内含有n个卷积核，则经过卷积层后的输出阵列的层数也为n，如图3所示。A convolution layer contains multiple convolution kernels. The number of layers of the output pixel array after the convolution layer is related to the number of convolution kernels. If the convolution layer contains n convolution kernels, the convolution layer The number of layers of the latter output array is also n, as shown in Figure 3.

经过卷积层后，使用激活函数将数据非线性化，使输入和输出的关系不是线性，这样可以刻画输入中更为复杂的变化。如果输入输出一直为线性关系，那么经过很多层后，总的输入和输出仍未的线性关系，那么中间无论有多少层，都相当于一层，如公式所示；After the convolutional layer, the activation function is used to non-linearize the data, so that the relationship between input and output is not linear, which can capture more complex changes in the input. If the input and output are always in a linear relationship, then after many layers, the total input and output are still not in a linear relationship, then no matter how many layers there are in the middle, it is equivalent to one layer, as shown in the formula;

Y＝aX+b；Z＝cY+dY=aX+b; Z=cY+d

Z＝c(aX+b)+d＝(ac)X+(bc+d)Z=c(aX+b)+d=(ac)X+(bc+d)

常用的激活函数有ReLU函数、Mish函数，函数图像如图4所示；由于Mish函数在输入值过大或过小时，输出值变化不大，因此ReLU函数在避免梯度消失、提高训练速度上有更大优势；Commonly used activation functions include ReLU function and Mish function. The function image is shown in Figure 4. Since the output value of the Mish function does not change much when the input value is too large or too small, the ReLU function can avoid the disappearance of the gradient and improve the training speed. greater advantage;

输入图像经过卷积运算后，图形中的每个像素点可能影响多个输出点，可能造成信息冗余，降低算法性能，所以要进行池化运算，池化运算中最常见的算法有平均池化和最大值池化，如图5所示，普遍使用较多的是最大值池化；After the input image is subjected to the convolution operation, each pixel in the graph may affect multiple output points, which may cause information redundancy and reduce the performance of the algorithm. Therefore, a pooling operation is required. The most common algorithm in the pooling operation is the average pooling. As shown in Figure 5, the most commonly used is the maximum pooling;

池化算法与卷积算法相似，但卷积核的层数需要与输入数据的层数相同，但池化层为一个二维数据阵列，池化窗口逐行、逐列、逐层移动，经过池化层后的输出数据的层数与输入数据层数相同，如图6所示。The pooling algorithm is similar to the convolution algorithm, but the number of layers of the convolution kernel needs to be the same as the number of layers of the input data, but the pooling layer is a two-dimensional data array, and the pooling window moves row by row, column by column, and layer by layer. The number of layers of the output data after the pooling layer is the same as the number of layers of the input data, as shown in Figure 6.

(4)进行空间金字塔池化结构(SPPNet)的步骤；(4) The step of performing the spatial pyramid pooling structure (SPPNet);

SPP结构一般处于卷积层和全连接层之间，用于对任意尺寸的输入产生固定大小的输出，其处理过程如下：并行使用1×1、2×2和4×4的窗格(池化层)对任意尺寸的输入特征图进行池化，得到1+4+16＝21种不同的图片块，然后分别计算每一个图片块的最大值(最大池化)或者平均值(平均池化)即可得到固定尺寸为21维的特征向量，可以直接连接至全连接层。通过对窗格的尺寸以及数量进行调整，从不同大小的感受野提取特征；然后将3种规格的池化层输出和输入特征图进行拼接，得到融合的特征向量。即可产生任意尺寸的输出，是SPP结构最重要的特点。图7为SPP结构。The SPP structure is generally between the convolutional layer and the fully connected layer, and is used to generate a fixed-size output for an input of any size. The processing process is as follows: Parallel use of 1×1, 2×2 and 4×4 layer) to pool the input feature map of any size to obtain 1+4+16=21 different image blocks, and then calculate the maximum value (maximum pooling) or average value (average pooling) of each image block respectively. ) to obtain a feature vector with a fixed size of 21 dimensions, which can be directly connected to the fully connected layer. By adjusting the size and number of panes, features are extracted from receptive fields of different sizes; then the output and input feature maps of the pooling layer of three specifications are spliced to obtain a fused feature vector. The output of any size can be generated, which is the most important feature of the SPP structure. Figure 7 shows the SPP structure.

以经典的两阶段目标检测算法R-CNN为例，算法在区域提议阶段生成了大量的候选区域(2000个左右)，候选区域又通过剪裁或者扭曲等操作产生固定尺寸的输入图像，然后使用CNN模型对输入图像进行处理以得到固定尺寸的特征向量，该特征向量才能输入全连接层进行分类和回归。可以看到，生成的所有的候选区域都需要送入CNN模型以产生固定尺寸的特征向量，而候选区域庞大的数量必然会导致算法的低效率。在引入SPP结构对R-CNN算法进行优化之后，直接将输入图像送入CNN模型即可得到整体的输入特征图，然后建立候选区域与输入特征图之间的映射关系，即可得到所有候选区域对应的特征向量(无需经过CNN模型)，最后对这些特征向量进行类似的分类和回归即可。引入SPP结构，只需要建立候选区域与输入特征图之间的映射关系，不需要重复进行CNN模型的前向计算，算法的检测速度相比R-CNN提高了24～120倍，并且实现了更高的检测精度；Taking the classic two-stage target detection algorithm R-CNN as an example, the algorithm generates a large number of candidate regions (about 2000) in the region proposal stage, and the candidate regions generate fixed-size input images through operations such as cropping or distortion, and then use CNN. The model processes the input image to obtain a fixed-size feature vector, which can then be fed into the fully connected layer for classification and regression. It can be seen that all the generated candidate regions need to be fed into the CNN model to generate fixed-size feature vectors, and the huge number of candidate regions will inevitably lead to inefficiency of the algorithm. After introducing the SPP structure to optimize the R-CNN algorithm, the input image can be directly sent to the CNN model to obtain the overall input feature map, and then the mapping relationship between the candidate region and the input feature map can be established to obtain all candidate regions. The corresponding feature vectors (without going through the CNN model), and finally perform similar classification and regression on these feature vectors. Introducing the SPP structure, only the mapping relationship between the candidate region and the input feature map needs to be established, and the forward calculation of the CNN model does not need to be repeated. High detection accuracy;

MobileNet的核心思想是将一个完整的卷积运算分解为两步进行，分别为逐深度卷积与逐点卷积。高效的神经网络主要通过：1.减少参数数量；2.量化参数，减少每个参数占用内存。目前的研究总结来看分为两个方向：一是对训练好的复杂模型进行压缩得到小模型；二是直接设计小模型并进行训练(Mobile Net属于这类)。The core idea of MobileNet is to decompose a complete convolution operation into two steps, namely depthwise convolution and pointwise convolution. Efficient neural network mainly through: 1. Reduce the number of parameters; 2. Quantize parameters to reduce the memory occupied by each parameter. The current research summary is divided into two directions: one is to compress the trained complex model to obtain a small model; the other is to directly design and train the small model (Mobile Net belongs to this category).

P₁＝D×D×3×N (1)P ₁ =D×D×3×N (1)

P₂＝D×D×3+D×D×1×N (2)P ₂ =D×D×3+D×D×1×N (2)

为了捕获行人的多尺度特征信息，在改进后的主干网络后连接RFBs结构模块；该结构模块中引入空洞卷积，实现增大感受野、融合不同尺寸特征；In order to capture the multi-scale feature information of pedestrians, the RFBs structural module is connected after the improved backbone network; the hollow convolution is introduced into the structural module to increase the receptive field and fuse features of different sizes;

RFBs结构首先将特征图进行1×1卷积进行通道变换，然后进行多分支空洞结构处理，用来获取目标多尺度信息特征；多分支结构中采用普通卷积层结合空洞卷积层的结构，普通卷积层中原RFB结构中的3×3卷积核用并联的1×3和3×1卷积核代替，5×5卷积核用两个串联的1×3和3×1卷积核代替，这样可以有效减少网络的计算量，保证整个网络轻量化。其中，空洞卷积层分别由3个大小为3×3的卷积核组成，卷积核的扩张率分别为1、3、5，防止扩张率太大造成卷积层退化；最后将经过多分支空洞结构处理过后的不同尺寸的特征层进行Concat操作，并输出新的融合特征层；为了保留输入特征图的原有信息，将新的融合特征层经过1×1卷积层变换通道，与原特征图构成的大残差边进行累加操作后输出。The RFBs structure first performs 1×1 convolution on the feature map for channel transformation, and then performs multi-branch hole structure processing to obtain the multi-scale information features of the target; the multi-branch structure adopts the structure of ordinary convolution layer combined with hole convolution layer, The 3×3 convolution kernel in the original RFB structure in the ordinary convolution layer is replaced by parallel 1×3 and 3×1 convolution kernels, and the 5×5 convolution kernel is replaced by two series 1×3 and 3×1 convolution kernels Instead of cores, this can effectively reduce the computational load of the network and ensure that the entire network is lightweight. Among them, the hole convolution layer is composed of three convolution kernels with a size of 3 × 3, and the expansion rates of the convolution kernels are 1, 3, and 5, respectively, to prevent the convolution layer from degrading due to too much expansion rate; The feature layers of different sizes processed by the branch hole structure are subjected to Concat operation, and a new fusion feature layer is output; in order to retain the original information of the input feature map, the new fusion feature layer is transformed through a 1×1 convolutional layer. The large residual edges formed by the original feature map are accumulated and output.

(7)改进的SPPNet的步骤；(7) The steps of the improved SPPNet;

受到YOLOv4中主干特征提取网络CSPDarknet的启发，将CSP结构引入到SPP中，在融合多尺度特征前，将网络分成两个部分，一部分特征经过捷径连接，直接与SPP融合后的特征合并，这一操作减少了40％的计算量。这一结构是由多个浅的网络融合而成，浅的网络在训练时不会出现消失的梯度问题,能够加速网络的收敛。Inspired by the backbone feature extraction network CSPDarknet in YOLOv4, the CSP structure is introduced into SPP. Before fusing multi-scale features, the network is divided into two parts. Some features are connected by shortcuts and directly merged with the features after SPP fusion. Operations reduce the amount of computation by 40%. This structure is formed by the fusion of multiple shallow networks. The shallow network will not have the problem of vanishing gradient during training, which can accelerate the convergence of the network.

具体实施方式三：Specific implementation three:

与具体实施方式二不同的是，本实施方式的一种基于多尺度yolo算法的车辆目标识别方法，所述的PANet进行特征融合的步骤具体为：Different from the second embodiment, in a vehicle target recognition method based on the multi-scale yolo algorithm in this embodiment, the steps of the PANet for feature fusion are specifically:

YOLOv4算法是基于YOLOv2和YOLOv23的改进算法，不同于Faster R-CNN等Two-stage目标检测算法，它将图像分割成不同网格，每个网格负责相应的物体，支持多类别目标检测，能够在保持精度的情况下达到更快的检测速度。YOLOv3的网络结构如图8所示，是一种端到端的实时目标检测框架，主要包含Darknet-53主干特征提取网络和多尺度特征融合的预测网络。The YOLOv4 algorithm is an improved algorithm based on YOLOv2 and YOLOv23. It is different from the Two-stage target detection algorithm such as Faster R-CNN. It divides the image into different grids, each grid is responsible for the corresponding object, supports multi-category target detection, and can Achieve faster detection while maintaining accuracy. The network structure of YOLOv3 is shown in Figure 8. It is an end-to-end real-time target detection framework, which mainly includes Darknet-53 backbone feature extraction network and multi-scale feature fusion prediction network.

CSPDarknet-53网络包含大量的卷积操作，使用了大量的3×3，1×1卷积组成的残差模块相堆叠，并在其中使用步长为2的3×3卷积层来将特征图尺寸进行1/2的缩小；图中矩形框内的Residual为一个残差模块，矩形框右侧边的n×即为这个残差模块重复使用的次数；每次经过残差模块后，可以发现使用了步长为2的3×3卷积层，共使用了5次，每使用一次卷积层会使特征图的长和宽变为原来的1/2，从而替代了传统卷积网络使用池化层进行下采样的功能；例如输入一张256×256的图像，经过这样的全卷积网络后会进行5次的1/2缩小，就能得到大小为8×8的特征图(1/2的五次方为1/32)。其主要特点是使用了残差单元Residual Block，通过使用残差跳跃连接结构，增加输入层与输出层之间的学习通道，减少了加深网络层数时出现的梯度消失等问题。Residual Block中的卷积层交替使用1×1和3×3两种不同尺寸的卷积核，每一次卷积之后连接BN批归一化层，将输入数据做均值为0、方差为1的规范化处理，在BN层之后使用Leaky Re LU激活函数进行非线性操作，使网络可以应用在非线性模型中。The CSPDarknet-53 network contains a large number of convolution operations, using a large number of 3 × 3, 1 × 1 convolutional residual modules to stack, and use a 3 × 3 convolutional layer with a stride of 2 in it to combine features. The size of the figure is reduced by 1/2; the Residual in the rectangular box in the figure is a residual module, and the n× on the right side of the rectangular box is the number of times the residual module is reused; after each residual module, you can It was found that a 3×3 convolutional layer with a stride of 2 was used for a total of 5 times. Each time a convolutional layer was used, the length and width of the feature map would become 1/2 of the original, thus replacing the traditional convolutional network. Use the pooling layer for downsampling; for example, input a 256×256 image, after such a fully convolutional network, it will perform 1/2 reduction 5 times, and a feature map with a size of 8×8 can be obtained ( 1/2 to the fifth power is 1/32). Its main feature is the use of residual unit Residual Block. By using the residual skip connection structure, the learning channel between the input layer and the output layer is increased, and the problem of gradient disappearance when deepening the number of network layers is reduced. The convolution layer in the Residual Block alternately uses two convolution kernels of different sizes, 1×1 and 3×3. After each convolution, the BN batch normalization layer is connected, and the input data is set to have a mean value of 0 and a variance of 1. Normalization processing, using Leaky Re LU activation function for nonlinear operation after the BN layer, enables the network to be applied in nonlinear models.

在预测阶段，将CSPDarknet-53特征提取网络得到的特征层F1、F2和F3分别输入到多尺度预测网络中，F3经过卷积操作得到粗尺度特征层3，用于检测大尺度目标；特征层3经过上采样先与F2融合，再经过卷积得到中尺度特征层2，用于检测中尺度目标；特征层2再经过上采样与F1融合卷积，继而得到细尺度特征层1，用于检测小尺度目标，这种特征金字塔网络(Feature Pyramid Network,FPN)结构使算法对不同大小、不同尺度的目标都能达到较好的检测效果；最后，将三个不同尺度特征层的预测信息进行组合，经过非极大抑制后处理算法得到最终的检测结果；In the prediction stage, the feature layers F1, F2 and F3 obtained by the CSPDarknet-53 feature extraction network are respectively input into the multi-scale prediction network, and F3 is obtained through the convolution operation to obtain the coarse-scale feature layer 3 for detecting large-scale targets; 3 After upsampling, it is first fused with F2, and then convolved to obtain a mesoscale feature layer 2, which is used to detect mesoscale targets; the feature layer 2 is then upsampled and convolved with F1 fusion, and then a fine-scale feature layer 1 is obtained. Detecting small-scale targets, this feature pyramid network (Feature Pyramid Network, FPN) structure enables the algorithm to achieve better detection results for targets of different sizes and scales; Combination, the final detection result is obtained through the non-maximum suppression post-processing algorithm;

由于卷积层越多，输入图像的特征信息丢失就越多，因此YOLOv4网络采用K-means聚类得到预测框的尺寸，为特征金字塔网络中的每种尺度设置3种不同的预测框，总共得到的9种不同大小的预测框分别为(10×13)，(16×30)，(33×23)，(30×61)，(62×45)，(59×119)，(116×90)，(156×198)，(373×326)；在13×13的粗尺度特征层3上应用(116×90)，(156×198)，(373×326)，检测较大的目标；在26×26的中尺度特征层2上应用中等尺寸的预测框(30×61)，(62×45)，(59×119)，用来检测中等大小的目标；在52×52的细尺度特征层1上应用(10×13)，(16×30)，(33×23)，用于检测较小的目标；Since the more convolutional layers, the more the feature information of the input image is lost, so the YOLOv4 network uses K-means clustering to obtain the size of the prediction box, and sets 3 different prediction boxes for each scale in the feature pyramid network, a total of The obtained 9 different sizes of prediction boxes are (10×13), (16×30), (33×23), (30×61), (62×45), (59×119), (116× 90), (156×198), (373×326); apply (116×90), (156×198), (373×326) on the 13×13 coarse-scale feature layer 3 to detect larger objects ;Apply medium-sized prediction boxes (30×61), (62×45), (59×119) on the 26×26 mesoscale feature layer 2 to detect medium-sized objects; Apply (10×13), (16×30), (33×23) on scale feature layer 1 for detecting smaller objects;

利用YOLOv4网络提高检测精度和速度。Improve detection accuracy and speed with YOLOv4 network.

仿真实验：Simulation:

创建实验环境实验使用的硬件配置为Inter(R)Core i7-9700K CPU，NVIDIAGeforce RTX 2080Ti显卡，Windows操作系统，软件环境为CUDA11.0，Cudnn8.0，采用Tensorflow1.14深度学习框架。训练时采用Adam优化器，初始学习率为0.001，动量大小为0.9，Batchsize设为8，迭代次数500次，采用Mosaic数据增强技巧和Dropblock正则化方式。将数据按6:2:2随机分为训练集、验证集、测试集。训练前，将输入图像大小调整为608×608，仿照PASCAL VOC数据集格式，对标注信息的边界框宽、高和中心点坐标进行归一化处理，以减少异常样本对数据的影响。The hardware configuration used to create the experimental environment is Inter(R)Core i7-9700K CPU, NVIDIA Geforce RTX 2080Ti graphics card, Windows operating system, software environment is CUDA11.0, Cudnn8.0, using Tensorflow1.14 deep learning framework. The Adam optimizer is used for training, the initial learning rate is 0.001, the momentum size is 0.9, the Batchsize is set to 8, the number of iterations is 500, and the Mosaic data enhancement technique and Dropblock regularization method are used. The data are randomly divided into training set, validation set, and test set according to 6:2:2. Before training, the input image size was adjusted to 608×608, and the bounding box width, height and center point coordinates of the annotation information were normalized in accordance with the PASCAL VOC dataset format to reduce the impact of abnormal samples on the data.

检测过程中，利用预测结果与实际目标的Io U值大小衡量目标位置是否被成功预测，IoU阈值设置为0.5，即IoU大于0.5记作预测正确，否则记为预测错误。利用平均准确度均值(Mean Average Precision,mAP)和召回率作为网络的精度指标，In the detection process, the prediction result and the Io U value of the actual target are used to measure whether the target position is successfully predicted. The IoU threshold is set to 0.5, that is, if the IoU is greater than 0.5, the prediction is correct, otherwise, the prediction is wrong. Using Mean Average Precision (mAP) and recall as the precision metrics of the network,

P表示精确率(Precision)，指所有检测出的目标中检测正确的概率，R表示召回率(Recall)，指所有正样本中正确检测的概率，Precision、Recall的计算公式如下：P stands for Precision, which refers to the probability of correct detection among all detected targets, and R stands for Recall, which refers to the probability of correct detection among all positive samples. The calculation formulas of Precision and Recall are as follows:

TP指预测正确的正样本数，FP指预测为正样本但实际为的负样本的数量，FN指预测为负样本但实际为正样本的数量。在实际应用中，网络往往要部署在移动设备上，所以网络的规模和检测速度是不可忽视的。网络的规模由参数量或权重大小决定，检测速度由FPS(Frames Per Second)决定，FPS定义为每秒能够检测的图片数量。TP refers to the number of correctly predicted positive samples, FP refers to the number of predicted positive samples but actually were negative samples, and FN refers to the number of predicted negative samples but actually positive samples. In practical applications, the network is often deployed on mobile devices, so the scale and detection speed of the network cannot be ignored. The size of the network is determined by the amount of parameters or weights, and the detection speed is determined by FPS (Frames Per Second), which is defined as the number of images that can be detected per second.

相较于YOLOv4的主干特征提取网络结构CSPDarknet残差跳跃连接方式，MobileNetv3的深度可分离卷积有利于优化网络结构，减少网络训练参数，因此本文使用MobileNetv3结构替换CSPDarknet模块，加强网络中特征的有效利用与传递，使得网络学习到更多特征信息，进而提升目标检测的速度；由于低层的特征图语义信息比较少，目标位置准确；高层的特征图语义信息比较丰富，目标位置比较粗略，通过引进RFBNet对网络的原有的感受野进行多重融合，引入空洞卷积，实现增大感受野、融合不同尺寸特征的目的；受到YOLOv4中主干特征提取网络CSPDarknet的启发，本论文将CSP结构引入到SPP中，在融合多尺度特征前，将网络分成两个部分，一部分特征经过捷径连接，直接与SPP融合后的特征合并，这一操作减少了40％的计算量。这一结构是由多个浅的网络融合而成,浅的网络在训练时不会出现消失的梯度问题,能够加速网络的收敛。Compared with YOLOv4's backbone feature extraction network structure CSPDarknet residual skip connection, MobileNetv3's depth separable convolution is beneficial to optimize network structure and reduce network training parameters. Therefore, this paper uses MobileNetv3 structure to replace CSPDarknet module to enhance the effectiveness of features in the network. Utilization and transmission enable the network to learn more feature information, thereby improving the speed of target detection; because the low-level feature map has less semantic information, the target position is accurate; the high-level feature map has richer semantic information, and the target position is relatively rough. RFBNet multi-integrates the original receptive field of the network, and introduces atrous convolution to achieve the purpose of increasing the receptive field and fusing features of different sizes; inspired by the backbone feature extraction network CSPDarknet in YOLOv4, this paper introduces the CSP structure into SPP In , before fusing multi-scale features, the network is divided into two parts, and a part of the features is connected by shortcuts and directly merged with the features after SPP fusion, this operation reduces the amount of computation by 40%. This structure is formed by the fusion of multiple shallow networks. The shallow network does not have the problem of vanishing gradient during training, which can accelerate the convergence of the network.

本发明的实施例公布的是较佳的实施例，但并不局限于此，本领域的普通技术人员，极易根据上述实施例，领会本发明的精神，并做出不同的引申和变化，但只要不脱离本发明的精神，都在本发明的保护范围内。The embodiment of the present invention announces the preferred embodiment, but is not limited to this, those of ordinary skill in the art can easily understand the spirit of the present invention according to the above-mentioned embodiment, and make different extensions and changes, However, as long as they do not depart from the spirit of the present invention, they are all within the protection scope of the present invention.

Claims

1. a vehicle target recognition method based on multi-scale yolo algorithm, is characterized in that: described method realizes by following steps:

Data set preprocessing steps;

The steps of backbone network feature extraction;

The steps of feature fusion in PANet;

The step of NMS non-maximal suppression;

The steps of target calibration decision output;

In addition, based on the impact of the class imbalance problem of the sample data set on the classification accuracy, an alternate training strategy of multiple loss functions is adopted, and the cross-entropy loss function and the focus loss function are used alternately in different stages of network training to improve the sample imbalance. The problem.

2. a kind of vehicle target identification method based on multi-scale yolo algorithm according to claim 1, is characterized in that: the step of described backbone network feature extraction, specifically comprises:

(1) The steps of designing the convolution algorithm;

The convolution operation means that each pixel in the output image is obtained by the weighted average of the pixels in the small area of the corresponding position of the input image. certain characteristics;

A pixel array of an 8*8 two-dimensional grayscale image and a 3*3 convolution kernel; if the convolution kernel moves with a step size of 1, that is, only one small grid is moved each time it moves, and when the convolution kernel moves When reaching the i-th row and the j-th column, the input image and the corresponding position values of the convolution kernel are multiplied in turn and then averaged, and the output value of the i-th row and the j-th column of the output image can be determined. For example, the value of the second row and the third column of the output value should be: [1*1+2*0+3*(-1)+4*1+5*0+6*(-1)+ 7*1+8*0+9*(-1)], which is -6; the number of layers of the convolution kernel is equal to the number of layers of the input data. If the input image is a three-dimensional color image, the number of layers of the convolution kernel is also It should be three layers, the three-dimensional convolution operation is basically the same as the two-dimensional convolution operation, and the output value is the weighted average of the value of the input value and the corresponding position of the convolution kernel;

A convolution layer contains multiple convolution kernels. The number of layers of the output pixel array after the convolution layer is related to the number of convolution kernels. If the convolution layer contains n convolution kernels, the convolution layer The number of layers of the output array is also n;

(2) the steps of designing the activation function;

After the convolution layer, the activation function is used to nonlinearize the data. If the input and output are always linear, then after many layers, the total input and output are still not linear, so no matter how many layers there are in the middle, it is equivalent to one layer, as shown in the formula;

Y=aX+b; Z=cY+d

Z=c(aX+b)+d=(ac)X+(bc+d)

Select the ReLU function as the activation function;

(3) The steps of designing the pooling algorithm;

After the input image is subjected to the convolution operation, the pooling operation is performed;

If the output data is the maximum value of the input data at the corresponding position of the pooling window, it is maximum pooling; if the output data is the average value of the input data at the corresponding position of the pooling window, it is average pooling;

(4) the steps of performing a spatial pyramid pooling structure;

Introduce the SPP structure to establish the mapping relationship between the candidate region and the input feature map;

(5) The steps of designing the MobileNetv3 network structure;

The input D×D×3 feature map is convolved with a 3×3 convolution kernel to output a D×D×N feature map; the standard convolution process is N 3×3 convolution kernels and input features Each channel of the graph is convolved, and finally a new feature map with N channels is obtained;

The depthwise separable convolution first uses three 3×3 convolution kernels to convolve with each channel of the input feature map respectively, and obtains a feature map whose input channel is equal to the output channel, and then uses N 1×1 convolution kernels to check. This feature map is convolved to obtain a new feature map of N channels;

Calculate the amount of parameters used by the two convolutions respectively, and the results are shown in the following formulas:

P ₁ =D×D×3×N (1)

P ₂ =D×D×3+D×D×1×N (2)

P ₁ is the amount of parameters used in standard convolution, P ₂ is the amount of parameters used when using depthwise separable convolution, D is the length and width of the input feature map, and N is the number of convolution kernels;

When performing standard convolution, the number of input channels is much smaller than the number of output channels, and formula (3) is obtained by comparing formula (1) and formula (2):

It can be seen that the result of P ₂ /P ₁ is much less than 1. After using the depthwise separable convolution, the amount of parameters used in the convolution can be greatly reduced while the effect is similar to that of the standard convolution;

(6) The steps of designing the structure of RFBs;

The hole convolution is introduced into the structure module; the RFBs structure first performs 1×1 convolution on the feature map for channel transformation, and then performs multi-branch hole structure processing to obtain the multi-scale information features of the target; the multi-branch structure uses a common convolution layer. Combined with the structure of the hole convolution layer, the 3×3 convolution kernel in the original RFB structure in the ordinary convolution layer is replaced by the parallel 1×3 and 3×1 convolution kernels, and the 5×5 convolution kernel is replaced by two series 1 convolution kernels. Instead of ×3 and 3×1 convolution kernels, the hole convolution layer consists of 3 convolution kernels with a size of 3 × 3, and the expansion rates of the convolution kernels are 1, 3, and 5, respectively, to prevent the expansion rate from being too high. The convolution layer is degraded due to the large size; finally, the feature layers of different sizes processed by the multi-branch hole structure are subjected to the Concat operation, and a new fusion feature layer is output;

(7) The steps of the improved SPPNet;

Inspired by the backbone feature extraction network CSPDarknet in YOLOv4, the CSP structure is introduced into SPP. Before fusing multi-scale features, the network is divided into two parts, and some features are connected by shortcuts and directly merged with the features after SPP fusion.

3. a kind of vehicle target recognition method based on multi-scale yolo algorithm according to claim 1 and 2, is characterized in that: the step that described PANet carries out feature fusion is specifically:

The CSPDarknet-53 network contains a large number of convolution operations, using a large number of 3 × 3, 1 × 1 convolutional residual modules to stack, and use a 3 × 3 convolutional layer with a stride of 2 in it to combine features. The size of the graph is reduced by 1/2; Residual is a residual module, and the n× on the right side of the rectangular box is the number of times the residual module is reused; after each pass through the residual module, it can be found that the step size used is 2 of the 3×3 convolutional layers are used 5 times in total. Each time the convolutional layer is used, the length and width of the feature map will become 1/2 of the original, instead of the convolutional network using the pooling layer for downsampling. ; In the prediction stage, the feature layers F1, F2 and F3 obtained by the CSPDarknet-53 feature extraction network are respectively input into the multi-scale prediction network, and F3 obtains the coarse-scale feature layer 3 through the convolution operation, which is used to detect large-scale targets; After upsampling, layer 3 is first fused with F2, and then convolved to obtain mesoscale feature layer 2, which is used to detect mesoscale targets; feature layer 2 is then upsampled and convolved with F1 fusion, and then fine-scale feature layer 1 is obtained. In order to detect small-scale targets, this feature pyramid network (Feature Pyramid Network, FPN) structure enables the algorithm to achieve better detection results for targets of different sizes and scales; Combination, the final detection result is obtained through the non-maximum suppression post-processing algorithm;

The obtained 9 different sizes of prediction boxes are (10×13), (16×30), (33×23), (30×61), (62×45), (59×119), (116× 90), (156×198), (373×326); apply (116×90), (156×198), (373×326) on the 13×13 coarse-scale feature layer 3 to detect larger objects ;Apply medium-sized prediction boxes (30×61), (62×45), (59×119) on the 26×26 mesoscale feature layer 2 to detect medium-sized objects; The scale features (10×13), (16×30), (33×23) are applied on layer 1 for detecting smaller objects.