CN115690704A

CN115690704A - LG-CenterNet model-based complex road scene target detection method and device

Info

Publication number: CN115690704A
Application number: CN202211179337.4A
Authority: CN
Inventors: 高尚兵; 李�杰; 胡序洋; 李少凡; 刘宇; 余骥远; 陈浩霖; 于永涛; 张海艳; 陈晓兵; 李翔
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-02-03
Anticipated expiration: 2042-09-27
Also published as: CN115690704B

Abstract

The invention discloses a complex road scene target detection method and a device based on an LG-CenterNet model, which are used for collecting an original road image data set to prepare a data set, extracting a feature pair by constructing the LG-CenterNet network model and using ResNet50 as a backsbone of the model, and guiding the features of different levels while improving the sense of the feature pattern of a main network by adopting a level guiding attention mechanism; inputting the feature graph processed by the level guide mechanism into a scaleseEncoder module for processing; performing characteristic pixel reduction by adopting a deconvolution module; a new feature enhancement module is adopted for the restored features to solve the problem of feature information loss in the pixel restoration process; and finally, inputting the enhanced feature map into a Center points prediction module for road target class identification and position positioning. The average recognition precision of the self-built complex road scene data set is 86.93%, the road scene target image detection speed reaches 50 frames/s, and the requirements on accurate detection and real-time detection of a road scene can be met.

Description

Object detection method and device for complex road scenes based on LG-CenterNet model

技术领域technical field

本发明属于语义分割、图像处理及智能驾驶领域，具体涉及一种基于LG-CenterNet模型的复杂道路场景目标检测方法及装置。The invention belongs to the fields of semantic segmentation, image processing and intelligent driving, and in particular relates to a complex road scene object detection method and device based on an LG-CenterNet model.

背景技术Background technique

近些年汽车数目稳健上升导致交通事故频发，这严重威胁到了人民的生命安全。如今，随着自动驾驶技术的发展，研究人员也从汽车被动安全技术研究转向了汽车主动安全技术研究。实现汽车的自动化必须使用一些先进的技术手段才能完成部分的汽车驾驶任务。采用深度学习方法对道路场景目标进行智能检测是解决汽车主动安全技术的关键。现阶段的目标检测网络主要是通过主干网络进行特征提取，但是在对底层的多尺度问题没有进行过多的考虑，可能会导致多尺度目标检测能力不足的情况。In recent years, the steady increase in the number of cars has led to frequent traffic accidents, which seriously threaten the safety of people's lives. Nowadays, with the development of autonomous driving technology, researchers have also shifted from research on passive safety technology to active safety technology. To realize the automation of the car, some advanced technical means must be used to complete part of the car driving task. Using deep learning method to intelligently detect road scene objects is the key to solve the vehicle active safety technology. The target detection network at this stage mainly uses the backbone network for feature extraction, but does not give too much consideration to the underlying multi-scale problems, which may lead to insufficient multi-scale target detection capabilities.

发明内容Contents of the invention

发明目的：针对现阶段复杂道路场景目标检测应用效果不佳，常规的检测方法不能满足实际道路环境的检测要求，提供一种基于LG-CenterNet模型的复杂道路场景目标检测方法及装置。Purpose of the invention: To provide a complex road scene target detection method and device based on the LG-CenterNet model in view of the poor application effect of complex road scene target detection at the present stage, and conventional detection methods that cannot meet the detection requirements of the actual road environment.

技术方案：本发明提出一种基于LG-CenterNet模型的复杂道路场景目标检测方法，具体包括以下步骤：Technical solution: The present invention proposes a complex road scene target detection method based on the LG-CenterNet model, which specifically includes the following steps:

(1)对复杂道路场景的图像进行处理，获取到含有多种类别的道路目标图像，对图像中的道路目标进行类别和位置标记，构建出复杂的道路场景数据集并进行预处理；(1) Process the images of complex road scenes, obtain road target images containing multiple categories, mark the categories and positions of the road targets in the images, construct complex road scene datasets and perform preprocessing;

(2)构建目标检测LG-CenterNet模型，并将上述的道路目标数据集通过LG-CenterNet模型进行训练得到模型S；所述LG-CenterNet模型包括Backbone模块、层级引导注意力模块、Scales Encoder模块、反卷积模块、特征增强模块和Centerpoints预测模块；(2) build target detection LG-CenterNet model, and above-mentioned road target data set is trained to obtain model S by LG-CenterNet model; Described LG-CenterNet model comprises Backbone module, hierarchical guidance attention module, Scales Encoder module, Deconvolution module, feature enhancement module and Centerpoints prediction module;

(3)使用训练好的模型S对复杂道路目标通过Center points预测模块以热力图的形式进行目标定位、边框大小划分和类别预测，并将得到的结果在视频或者图像上进行显示输入相应的效果。(3) Use the trained model S to perform target positioning, frame size division and category prediction in the form of heat maps for complex road targets through the Center points prediction module, and display the obtained results on videos or images and input corresponding effects .

进一步地，步骤(1)所述的对道路场景数据集预处理是通过将像素不一和复杂道路场景的图像进行归一化处理，将图像的大小归一化为512×512像素大小，再通过批标准化、ReLU激活函数和最大池化操作得到分布均匀的特征目标样本。Further, the preprocessing of the road scene data set described in step (1) is to normalize the image size of the image to 512×512 pixel size by normalizing the images of different pixels and complex road scenes, and then Uniformly distributed feature target samples are obtained through batch normalization, ReLU activation function, and maximum pooling operations.

进一步地，所述步骤(2)实现过程如下：Further, the implementation process of the step (2) is as follows:

(21)LG-CenterNet模型中提出新的MresneIt50作为Backbone模块，MresneIt50由多个残差块组成，其中将4个残差模块提取到的特征图记为E1，通道数为512；将6个残差块提取到的特征图记为E2，通道数为1024；将3个通道数提取到的特征图记为E3，通道数为2048；(21) In the LG-CenterNet model, a new MresneIt50 is proposed as the Backbone module. MresneIt50 is composed of multiple residual blocks. The feature map extracted by the 4 residual modules is marked as E1, and the number of channels is 512; the 6 residual blocks are The feature map extracted from the difference block is marked as E2, and the number of channels is 1024; the feature map extracted from 3 channels is marked as E3, and the number of channels is 2048;

(22)将Backbone提取到的特征图E1、E2、E3输入到层级引导注意力模块中，其主要的结构包括两个分支：全局池化分支和层级引导分支，将通道数为512的特征图E1输入到全局池化分支，通过全局最大池化层和上采样层操作获得EC1；将通道数为512、1024、2048的特征图E1、E2、E3输入到层级引导分支中，通过一系列的平均池化和卷积操作并配合上采样得到EC2；将EC1和EC2使用add进行特征联合获得EC3，从而减少计算参数；(22) Input the feature maps E1, E2, and E3 extracted by Backbone into the hierarchical guided attention module. Its main structure includes two branches: the global pooling branch and the hierarchical guided branch, and the feature map with 512 channels E1 is input to the global pooling branch, and EC1 is obtained through the operation of the global maximum pooling layer and the up-sampling layer; the feature maps E1, E2, and E3 with the number of channels of 512, 1024, and 2048 are input into the hierarchical guidance branch, through a series of The average pooling and convolution operations are combined with upsampling to obtain EC2; EC1 and EC2 are combined using add to obtain EC3, thereby reducing calculation parameters;

(23)将提取到的EC3输入到Scales Encoder模块，进行一系列的卷积和残差模块运算后得到EC4；(23) Input the extracted EC3 to the Scales Encoder module, and perform a series of convolution and residual module operations to obtain EC4;

(24)将提取到的EC4输入到反卷积模块，反卷积模块由3个deconv组组成，通过每次deconv组的卷积运算将特征图尺寸不断放大，同时通道数不断降低，得到尺度为128×128×64的特征图记为EC5；(24) Input the extracted EC4 to the deconvolution module. The deconvolution module is composed of 3 deconv groups. The size of the feature map is continuously enlarged through the convolution operation of each deconv group, and the number of channels is continuously reduced to obtain the scale. The feature map of 128×128×64 is marked as EC5;

(25)将特征图EC5输入到特征增强模块进行卷积运算得到尺度大小为128×128×64特征图EC6，P-FEM由3×3的Poly-Scale Convolution、批标准化、ReLU激活函数和Sigmoid激活函数构成，主要是为了提高特征图中的局部信息的相关性，增强其对特征的表达能力。(25) Input the feature map EC5 to the feature enhancement module for convolution operation to obtain a feature map EC6 with a scale size of 128×128×64. P-FEM consists of 3×3 Poly-Scale Convolution, batch normalization, ReLU activation function and Sigmoid The composition of the activation function is mainly to improve the correlation of local information in the feature map and enhance its ability to express features.

进一步地，所述步骤(3)实现过程如下：Further, the implementation process of the step (3) is as follows:

Centerpoints预测模块通过对训练好的模型S对输入的图片进行分类预测，将原始图像生成尺度与EC6大小一致的heatmap图，随后通过分别计算热力图的损失值记为L_h，目标长宽的损失值记为L_s和中心点偏移量的损失值记为L_f来确定目标的位置和大小并生成最后的分类定位的heatmap；其中总体的网络损失为：The Centerpoints prediction module classifies and predicts the input images through the trained model S, generates a heatmap image with the same scale as EC6 from the original image, and then calculates the loss value of the heat map separately as L _h , the loss of the target length and width The value is recorded as L _s and the loss value of the center point offset is recorded as L _f to determine the position and size of the target and generate the final heatmap for classification and positioning; the overall network loss is:

L_d＝L_k+λ_sL_s+λ_fL_f L _d = L _k +λ _s L _s +λ _f L _f

其中λ_s＝0.1，λ_f＝1；对于输入图片大小为512×215的图像来说其通过该网络生成的特征图为H×W×C，则L_k、L_s和L_f计算公式分别为：Where λ _s =0.1, λ _f =1; for an image with an input image size of 512×215, the feature map generated by the network is H×W×C, then the calculation formulas of L _k , L _s and L _f are respectively for:

其中，A_HWC为图像中目标标注的真实值，A'_HWC为图像的预测值，α和β分别为2和4，N为图像中关键点的个数，s'_pk为预测尺寸，s_k为真实尺寸，p为图像中目标的中心点位置。Among them, A _HWC is the actual value of the target label in the image, A' _HWC is the predicted value of the image, α and β are 2 and 4 respectively, N is the number of key points in the image, s' _pk is the predicted size, s _k is the real size, and p is the center point position of the target in the image.

基于相同的发明构思，本发明还提供一种基于LG-CenterNet模型的复杂道路场景目标检测装置，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述计算机程序被加载至处理器时实现上述的基于LG-CenterNet模型的复杂道路场景目标检测方法。Based on the same inventive concept, the present invention also provides a complex road scene object detection device based on the LG-CenterNet model, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the computer program When loaded into the processor, the above-mentioned complex road scene target detection method based on the LG-CenterNet model is realized.

有益效果：与现有技术相比，本发明的有益效果：1、通过改进LG-CenterNet模型的主干网络，提出MresneIt50加强特征提取效果；2、提出一种层级引导注意力模块对主干网络提取到的特征图的进行特征融合；3、提出新的Scales Encoder模块和特征增强模块注重对局部特征的提取，避免在反卷积模块中出现的特征丢失问题；4、改进后的LG-CenterNet目标检测模型对比原来的CenterNet框架的平均精度mAP(meanAveragePrecision)提升了5个百分点；5、本发明在应对复杂道路场景也有较高的检测精度。Beneficial effect: compared with prior art, the beneficial effect of the present invention: 1, by improving the backbone network of LG-CenterNet model, propose MresneIt50 to strengthen feature extraction effect; 3. A new Scales Encoder module and a feature enhancement module are proposed to focus on the extraction of local features to avoid the problem of feature loss in the deconvolution module; 4. The improved LG-CenterNet target detection Compared with the average precision mAP (meanAveragePrecision) of the original CenterNet framework, the model has improved by 5 percentage points; 5. The present invention also has higher detection precision in dealing with complex road scenes.

附图说明Description of drawings

图1是基于LG-CenterNet模型的复杂道路场景目标检测方法的流程图；Figure 1 is a flowchart of a complex road scene target detection method based on the LG-CenterNet model;

图2是本发明提出的基于LG-CenterNet目标检测模型示意图；Fig. 2 is a schematic diagram based on the LG-CenterNet target detection model proposed by the present invention;

图3是本发明提出的残差块结构Mblock结构示意图；Fig. 3 is a structural schematic diagram of the residual block structure Mblock proposed by the present invention;

图4是层级引导注意力模型结构示意图；Figure 4 is a schematic diagram of the structure of the hierarchical guided attention model;

图5是Scales Encoder模块结构示意图；Figure 5 is a schematic diagram of the Scales Encoder module structure;

图6是特征增强模块结构示意图；Fig. 6 is a schematic structural diagram of a feature enhancement module;

图7是采用LG-CenterNet目标检测模型后得到的检测效果图。Figure 7 is a detection effect diagram obtained after using the LG-CenterNet target detection model.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

本实施方式中涉及大量变量，现将各变量作如下说明。如表1所示。A large number of variables are involved in this embodiment, and each variable is described as follows. As shown in Table 1.

表1变量说明表Table 1 variable description table

变量variable 变量说明variable specification SS 3×3，通道数为1024的卷积核3×3 convolution kernel with 1024 channels E1E1 Backbone模块中4个残差块提取到的特征图Feature map extracted from 4 residual blocks in the Backbone module E2E2 Backbone模块中6个残差块提取到的特征图The feature map extracted from the 6 residual blocks in the Backbone module E3E3 Backbone模块中3个残差块提取到的特征图Feature map extracted from 3 residual blocks in the Backbone module EC1EC1 E1经过全局池化分支得到的特征图The feature map obtained by E1 through the global pooling branch EC2EC2 E1、E2、E3经由层级引导分支得到的特征图Feature maps obtained by E1, E2, and E3 via hierarchical guidance branches EC3EC3 特征图EC2经由ScalesEncoder模块处理后的特征图The feature map of EC2 processed by the ScalesEncoder module EC4EC4 特征图EC3经由ScalesEncoder模块处理后的特征图The feature map of EC3 processed by the ScalesEncoder module EC5EC5 特征图EC4经由反卷积模块处理后的特征图The feature map of EC4 processed by the deconvolution module EC6EC6 特征图EC5经由特征增强模块处理后的特征图The feature map of EC5 processed by the feature enhancement module

本发明提供一种基于LG-CenterNet模型的复杂道路场景目标检测方法，通过采集道路场景不同目标图像并进行标记制作成复杂道路场景数据集，利用提出的MresneIt50作为主干网络进行特征提取，对于主干网络中提取到的不同尺度的特征图输入到层级引导注意力模块当中，随后通过Scales Encoder模块得到多个感受野特征，之后利用反卷积模块进行特征像素还原，利用Poly-Scale Convolution(简称PSConv)构建特征增强模块提高局部特征的信息相关性。最终利用Center points预测模块对目标的中心点位置、预测框的尺度大小和中心点的偏移进行预测，同时识别出目标类别。如图1所示，具体包括以下步骤：The present invention provides a complex road scene target detection method based on the LG-CenterNet model. By collecting different target images of the road scene and marking them into complex road scene data sets, the proposed MresneIt50 is used as the backbone network for feature extraction. For the backbone network The feature maps of different scales extracted in are input into the hierarchical guidance attention module, and then multiple receptive field features are obtained through the Scales Encoder module, and then the feature pixels are restored by the deconvolution module, and the Poly-Scale Convolution (PSConv for short) is used. Build a feature enhancement module to improve the information relevance of local features. Finally, the center points prediction module is used to predict the position of the center point of the target, the scale size of the prediction frame and the offset of the center point, and at the same time identify the target category. As shown in Figure 1, it specifically includes the following steps:

步骤1：对复杂道路场景的图像进行处理，获取到含有多种类别的道路目标图像进行预处理，并对图像中的道路目标进行类别和位置标记，构建出复杂的道路场景数据集。Step 1: Process the image of the complex road scene, obtain the road target image containing multiple categories for preprocessing, and mark the category and position of the road target in the image to construct a complex road scene dataset.

对道路场景数据集预处理主要是通过将像素不一和复杂道路场景的图像进行归一化处理，将图像的大小归一化为512×512像素大小，再通过批标准化(BatchNormalizaition)、ReLU激活函数和最大池化操作得到目标样本处于在图像分布较为均匀。The preprocessing of the road scene dataset is mainly by normalizing the images of different pixels and complex road scenes, normalizing the size of the image to 512×512 pixel size, and then through batch normalization (BatchNormalizaition), ReLU activation The function and the maximum pooling operation get the target samples in a relatively uniform distribution in the image.

步骤2、构建目标检测LG-CenterNet模型，LG-CenterNet模型结构如图2所示，并将上述的道路目标数据集通过LG-CenterNet模型进行训练得到模型S，其中LG-CenterNet网络主要包括Backbone模块、层级引导注意力模块(Levels guide attention，简称LGA)、Scales Encoder模块、反卷积模块、特征增强模块(P-Feature enhancement module，P-FEM)和Centerpoints预测模块。Step 2. Build the target detection LG-CenterNet model. The structure of the LG-CenterNet model is shown in Figure 2, and the above-mentioned road target data set is trained through the LG-CenterNet model to obtain the model S. The LG-CenterNet network mainly includes the Backbone module , Levels guide attention module (Levels guide attention, LGA for short), Scales Encoder module, deconvolution module, feature enhancement module (P-Feature enhancement module, P-FEM) and Centerpoints prediction module.

(21)LG-CenterNet模型中提出新的MresneIt50作为Backbone模块，MresneIt50由多个残差块Mblock组成，残差块结构Mblock如图3所示，其中将4个残差块提取到的特征图记为E1，通道数为512；将6个残差块提取到的特征图记为E2，通道数为1024；将3个通道数提取到的特征图记为E3，通道数为2048。(21) In the LG-CenterNet model, a new MresneIt50 is proposed as the Backbone module. MresneIt50 is composed of multiple residual blocks Mblock. The residual block structure Mblock is shown in Figure 3. The feature maps extracted from the four residual blocks are marked is E1, the number of channels is 512; the feature map extracted from 6 residual blocks is marked as E2, and the number of channels is 1024; the feature map extracted from 3 channels is marked as E3, and the number of channels is 2048.

(22)将Backbone提取到的特征图E1、E2、E3输入到层级引导注意力模块(Levelsguide attention，简称LGA)中，LGA模块结构如图4所示，其主要的结构包括两个分支：全局池化分支和层级引导分支，将通道数为512的特征图E1输入到全局池化分支，通过全局最大池化层和上采样层操作获得EC1；将通道数为512、1024、2048的特征图E1、E2、E3输入到层级引导分支中，通过一系列的平均池化和卷积操作并配合上采样得到EC2。将EC1和EC2使用add进行特征联合获得EC3，从而减少计算参数。(22) Input the feature maps E1, E2, and E3 extracted by Backbone into the Levelsguide attention module (Levelsguide attention, referred to as LGA). The structure of the LGA module is shown in Figure 4. Its main structure includes two branches: global The pooling branch and the hierarchical guidance branch input the feature map E1 with 512 channels into the global pooling branch, and obtain EC1 through the global maximum pooling layer and upsampling layer operation; the feature maps with 512, 1024, and 2048 channels E1, E2, and E3 are input to the hierarchical guidance branch, and EC2 is obtained through a series of average pooling and convolution operations with upsampling. Combine EC1 and EC2 with add to obtain EC3, thereby reducing calculation parameters.

(23)将提取到的EC3输入到Scales Encoder模块，Scales Encoder模块结构如图5所示，进行一系列的卷积和残差模块运算后得到EC4。(23) Input the extracted EC3 to the Scales Encoder module. The structure of the Scales Encoder module is shown in Figure 5, and EC4 is obtained after a series of convolution and residual module operations.

(24)将提取到的EC4输入到反卷积模块，反卷积模块由3个deconv组组成，通过每次deconv组的卷积运算将特征图尺寸不断放大，同时通道数不断降低，得到尺度为128×128×64的特征图记为EC5。(24) Input the extracted EC4 to the deconvolution module. The deconvolution module is composed of 3 deconv groups. The size of the feature map is continuously enlarged through the convolution operation of each deconv group, and the number of channels is continuously reduced to obtain the scale. A feature map of 128×128×64 is marked as EC5.

(25)将特征图EC5输入到P-FEM进行卷积运算得到尺度大小为128×128×64特征图EC6，P-FEM由3×3的Poly-Scale Convolution(简称PSConv)、批标准化(BatchNormalizaition)、ReLU激活函数和Sigmoid激活函数构成，主要是为了提高特征图中的局部信息的相关性，增强其对特征的表达能力。P-FEM结构如图6所示。(25) Input feature map EC5 to P-FEM for convolution operation to obtain feature map EC6 with a scale size of 128×128×64. P-FEM consists of 3×3 Poly-Scale Convolution (PSConv for short), Batch Normalization (BatchNormalization ), ReLU activation function and Sigmoid activation function, mainly to improve the correlation of local information in the feature map and enhance its ability to express features. The P-FEM structure is shown in Fig. 6.

步骤3：使用训练好的模型S对道路场景目标通过Centerpoints预测模块以热力图的形式进行目标定位、边框大小划分和类别预测，并将得到的结果在视频或者图像上进行显示输入相应的效果。Step 3: Use the trained model S to perform target positioning, frame size division, and category prediction in the form of heat maps through the Centerpoints prediction module for road scene targets, and display the obtained results on videos or images and input corresponding effects.

Centerpoints预测模块通过对训练好的模型S对输入的图片进行分类预测，将原始图像生成尺度与EC6大小一致的heatmap图，随后通过分别计算热力图的损失值记为L_h，目标长宽(size)的损失值记为L_s和中心点偏移量(offset)的损失值记为L_f来确定目标的位置和大小并生成最后的分类定位的heatmap。其中总体的网络损失为L_d。The Centerpoints prediction module classifies and predicts the input pictures through the trained model S, and generates a heatmap with the same scale as EC6 from the original image, and then calculates the loss value of the heat map separately as L _h , and the target length and width (size ) is recorded as L _s and the loss value of the center point offset (offset) is recorded as L _f to determine the position and size of the target and generate the final heatmap for classification and positioning. where the overall network loss is L _d .

L_d＝L_k+λ_sL_s+λ_fL_f L _d = L _k +λ _s L _s +λ _f L _f

其中λ_s＝0.1，λ_f＝1。对于输入图片大小为512×215的图像来说其通过该网络生成的特征图为H×W×C，则L_k、L_s和L_f计算公式分别为：Wherein λ _s =0.1, λ _f =1. For an image with an input image size of 512×215, the feature map generated by the network is H×W×C, then the calculation formulas of L _k , L _s and L _f are respectively:

基于相同的发明构思，本发明还提供一种基于LG-CenterNet模型的复杂道路场景目标检测装置，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述计算机程序被加载至处理器时实现上述的基于LG-CenterNet模型的复杂道路场景目标检测方法。如图7所示。Based on the same inventive concept, the present invention also provides a complex road scene object detection device based on the LG-CenterNet model, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the computer program When loaded into the processor, the above-mentioned complex road scene target detection method based on the LG-CenterNet model is realized. As shown in Figure 7.

将自建的复杂场景数据集通过LG-CenterNet网络进行训练，得到可以识别复杂场景目标的模型，通过数据集中的验证集进行模型性能验证，如图7所示。本发明在自建的复杂道路场景数据集的识别平均精度为86.93％，道路场景目标图像检测速度达到50帧/s，能够满足对道路场景的准确检测和实时检测的要求。The self-built complex scene data set is trained through the LG-CenterNet network to obtain a model that can recognize complex scene objects, and the model performance is verified through the verification set in the data set, as shown in Figure 7. The present invention has an average recognition accuracy of 86.93% in the self-built complex road scene data set, and the detection speed of the road scene object image reaches 50 frames/s, which can meet the requirements of accurate detection and real-time detection of the road scene.

其中，Precision为精确度，Recall为召回率，AP为精度，mAP为平均精度，FPS为帧数，t为检测单张图片的时间。数据集中有较多样本类别(如car，person等)，n表示样本个数，TP(True Positives)为正样本并被认定为正样本的数量(即为car的样本被认定为car的总数)；TN(True Negatives)为负样本模型识别也为负样本的总数；FP(FalsePositives)为负样本模型认定为正样本的总数(即样本不为car，模型认定为car的总数)；FN(False Negatives)为负样本模型认定为正样本的总数。Among them, Precision is the precision, Recall is the recall rate, AP is the precision, mAP is the average precision, FPS is the number of frames, and t is the time to detect a single picture. There are many sample categories in the data set (such as car, person, etc.), n represents the number of samples, TP (True Positives) is the number of positive samples and is recognized as positive samples (that is, the total number of samples that are car are recognized as car) ; TN (True Negatives) is the total number of negative samples identified by the negative sample model; FP (FalsePositives) is the total number of positive samples identified by the negative sample model (that is, the total number of samples that are not car, and the model is identified as car); FN (FalsePositives) Negatives) is the total number of positive samples identified by the negative sample model.

上面结合附图对本发明的实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments, and can also be made without departing from the gist of the present invention within the scope of knowledge possessed by those of ordinary skill in the art. Variations.

Claims

1. A complex road scene target detection method based on an LG-CenterNet model is characterized by comprising the following steps:

(1) Processing images of complex road scenes to obtain road target images containing various categories, marking the categories and positions of road targets in the images, constructing a complex road scene data set and preprocessing the complex road scene data set;

(2) Constructing an LG-CenterNet model for target detection, and training the road target data set through the LG-CenterNet model to obtain a model S; the LG-CenterNet model comprises a Backbone module, a hierarchy attention-directing module, a Scales Encoder module, a deconvolution module, a feature enhancement module and a Centerpoints prediction module;

(3) And (3) performing target positioning, frame size division and category prediction on the complex road target in a thermodynamic diagram mode by using the trained model S through a Center points prediction module, and displaying and inputting the obtained result on a video or an image to obtain a corresponding effect.

2. The method for detecting the complex road scene target based on the LG-centrnet model as claimed in claim 1, wherein the preprocessing of the road scene data set in step (1) is to normalize the image of the complex road scene by pixel inconsistency, normalize the image size to 512 x 512 pixel size, and obtain the uniformly distributed feature target samples by batch normalization, reLU activation function and max pooling.

3. The LG-CenterNet model-based complex road scene target detection method according to claim 1, wherein the step (2) is realized by the following steps:

(21) A new Mresneit50 is proposed in the LG-CenterNet model to serve as a Backbone module, the Mresneit50 is composed of a plurality of residual blocks, a feature diagram extracted by 4 residual blocks is marked as E1, and the number of channels is 512; marking a feature map extracted by the 6 residual blocks as E2, wherein the number of channels is 1024; the feature map extracted by the number of 3 channels is marked as E3, and the number of the channels is 2048;

(22) The character diagrams E1, E2 and E3 extracted from the Backbone are input into a hierarchical attention-directing module, and the main structure of the hierarchical attention-directing module comprises two branches: the method comprises the steps of global pooling branching and level guiding branching, inputting a feature map E1 with the channel number being 512 into the global pooling branching, and obtaining EC1 through operation of a global maximum pooling layer and an upper sampling layer; inputting feature maps E1, E2 and E3 with the channel numbers of 512, 1024 and 2048 into a hierarchical guide branch, and obtaining EC2 through a series of average pooling and convolution operations and matching with upsampling; performing feature combination on EC1 and EC2 by using add to obtain EC3, thereby reducing calculation parameters;

(23) Inputting the extracted EC3 into a Scales Encoder module, and carrying out a series of convolution and residual module operations to obtain EC4;

(24) Inputting the extracted EC4 into a deconvolution module, wherein the deconvolution module consists of 3 deconv groups, the size of the characteristic graph is continuously enlarged through convolution operation of the deconv groups each time, and the number of channels is continuously reduced at the same time, so that the characteristic graph with the dimension of 128 multiplied by 64 is obtained and recorded as EC5;

(25) Inputting the feature map EC5 into a feature enhancement module to carry out Convolution operation to obtain a feature map EC6 with the dimension of 128 × 128 × 64, wherein the P-FEM is composed of 3 × 3 Poly-Scale conversion, batch standardization, reLU activation function and Sigmoid activation function, and is mainly used for improving the correlation of local information in the feature map and enhancing the expression capability of the feature map on the features.

4. The LG-CenterNet model-based complex road scene target detection method according to claim 1, characterized in that the step (3) is realized by the following steps:

the Centerpoints prediction module classifies and predicts the input pictures through the trained model S, generates a heatmap image with the scale consistent with the size of EC6 from the original image, and then records the loss value of the original image as L through respectively calculating the thermodynamic diagrams _h The loss value of the target length and width is expressed as L _s And the loss value of the center point offset is recorded as L _f To determine the position and size of the target and to generate a final sorted positioned heatmap; where the overall network loss is:

L _d ＝L _k +λ _s L _s +λ _f L _f

wherein λ _s ＝0.1，λ _f =1; for an image with an input picture size of 512 × 215, the feature map generated by the network is H × W × C, and L _k 、L _s And L _f The calculation formulas are respectively as follows:

wherein, A _HWC Is the true value, A ', of the target annotation in the image' _HWC Is a predicted value of the image, alpha and beta are respectively 2 and 4, N is the number of key points in the image, s' _pk To predict the size, s _k For true size, p is the location of the center point of the object in the image.

5. An LG-cenernet model-based complex road scene object detection device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements the LG-cenernet model-based complex road scene object detection method according to any one of claims 1 to 4.