CN115690704B

CN115690704B - LG-CenterNet model-based complex road scene target detection method and device

Info

Publication number: CN115690704B
Application number: CN202211179337.4A
Authority: CN
Inventors: 高尚兵; 李�杰; 胡序洋; 李少凡; 刘宇; 余骥远; 陈浩霖; 于永涛; 张海艳; 陈晓兵; 李翔
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-08-22
Anticipated expiration: 2042-09-27
Also published as: CN115690704A

Abstract

The invention discloses a complex road scene target detection method and device based on an LG-CenterNet model, which are characterized in that an original road image dataset is collected to manufacture a dataset, an LG-CenterNet network model is built, a back bone 50 is used as a model to extract feature pairs, a hierarchical directing attention mechanism is adopted to guide features of different levels while feature images of different scales of a Backbone network are improved to improve the receptive field; inputting the feature map processed by the hierarchical guiding mechanism into a ScalesEncoder module for processing; adopting a deconvolution module to restore the characteristic pixels; adopting a new feature enhancement module to restore the restored features to solve the problem of feature information loss in the pixel restoring process; and finally, inputting the enhanced feature map to a Center points prediction module for road target category identification and position location. The recognition average precision of the self-built complex road scene data set is 86.93%, the detection speed of the road scene target image reaches 50 frames/s, and the requirements of accurate detection and real-time detection of the road scene can be met.

Description

LG-CenterNet model-based complex road scene target detection method and device

Technical Field

The invention belongs to the fields of semantic segmentation, image processing and intelligent driving, and particularly relates to a complex road scene target detection method and device based on an LG-CenterNet model.

Background

The steady increase in the number of automobiles in recent years causes frequent traffic accidents, which seriously threatens the life safety of people. Today, with the development of automatic driving technology, researchers have also shifted from passive safety technology research into active safety technology research of automobiles. Some advanced technical means are necessary to realize the automation of the automobile to complete part of the automobile driving task. The intelligent detection of road scene targets by adopting a deep learning method is a key for solving the active safety technology of automobiles. The current target detection network mainly performs feature extraction through a backbone network, but does not take excessive consideration on the underlying multi-scale problem, which may result in insufficient multi-scale target detection capability.

Disclosure of Invention

The invention aims to: aiming at the problem that the existing complex road scene target detection application effect is poor, the conventional detection method cannot meet the detection requirement of the actual road environment, and the complex road scene target detection method and device based on the LG-CenterNet model are provided.

The technical scheme is as follows: the invention provides a complex road scene target detection method based on an LG-CenterNet model, which specifically comprises the following steps:

(1) Processing the image of the complex road scene to obtain a road target image containing various categories, marking the categories and positions of the road targets in the image, constructing a complex road scene data set and preprocessing the complex road scene data set;

(2) Constructing a target detection LG-CenterNet model, and training the road target data set through the LG-CenterNet model to obtain a model S; the LG-CenterNet model comprises a Backbone module, a hierarchical directing attention module, a Scales Encoder module, a deconvolution module, a feature enhancement module and a Centerpoints prediction module;

(3) And performing target positioning, frame size division and category prediction on a complex road target in a thermodynamic diagram mode through a Center points prediction module by using the trained model S, and displaying and inputting the obtained result on a video or an image to obtain a corresponding effect.

Further, the preprocessing of the road scene data set in the step (1) is to normalize the images of the road scene with different pixels and complex road scene, normalize the sizes of the images to 512×512 pixel sizes, and obtain uniformly distributed feature target samples through batch normalization, reLU activation function and maximum pooling operation.

Further, the implementation process of the step (2) is as follows:

(21) The LG-CenterNet model proposes a new Mresneit50 as a backbond module, wherein the Mresneit50 consists of a plurality of residual blocks, a characteristic diagram extracted by 4 residual blocks is marked as E1, and the channel number is 512; the characteristic diagram extracted by the 6 residual blocks is marked as E2, and the number of channels is 1024; the feature map extracted by the number of 3 channels is marked as E3, and the number of the channels is 2048;

(22) The feature maps E1, E2 and E3 extracted by the backstone are input into the hierarchical directing attention module, and the main structure of the feature maps comprises two branches: the global pooling branch and the hierarchical guiding branch are used for inputting a characteristic diagram E1 with the channel number of 512 into the global pooling branch, and EC1 is obtained through the operation of a global maximum pooling layer and an up-sampling layer; inputting the characteristic diagrams E1, E2 and E3 with the channel numbers of 512, 1024 and 2048 into a hierarchical guide branch, and obtaining EC2 through a series of averaging pooling and convolution operations and matching up sampling; combining the characteristics of the EC1 and the EC2 by using add to obtain EC3, thereby reducing calculation parameters;

(23) Inputting the extracted EC3 into a Scales Encoder module, and carrying out a series of convolution and residual error module operation to obtain EC4;

(24) The extracted EC4 is input into a deconvolution module, the deconvolution module consists of 3 deconv groups, the feature map size is continuously amplified through convolution operation of each deconv group, and meanwhile, the channel number is continuously reduced, so that a feature map with the dimension of 128 multiplied by 64 is obtained and is marked as EC5;

(25) The feature map EC5 is input into a feature enhancement module to carry out convolution operation to obtain a feature map EC6 with the size of 128 multiplied by 64, and the P-FEM is composed of 3 multiplied by 3 Poly-Scale Convolution, batch standardization, reLU activation function and Sigmoid activation function, and is mainly used for improving the correlation of local information in the feature map and enhancing the expression capability of the feature map on the feature.

Further, the implementation process of the step (3) is as follows:

the centroids prediction module generates a hetmap with the scale consistent with the EC6 size from the original image by classifying and predicting the input image by the trained model S, and then marks the loss value of the thermodynamic diagram as L by calculating the loss value of the thermodynamic diagram respectively _h The loss value of the target length and width is recorded as L _s And the loss value of the offset of the center point is recorded as L _f To determine the location and size of the target and to generate a final classically located hetmap; wherein the overall network loss is:

L _d ＝L _k +λ _s L _s +λ _f L _f

wherein lambda is _s ＝0.1，λ _f =1; for an image of 512×215 input picture size, the feature map generated by the network is h×w×c, then L _k 、L _s And L _f The calculation formulas are respectively as follows:

wherein A is _HWC For the true value of the target mark in the image, A' _HWC Alpha and beta are respectively 2 and 4 as predicted values of the image, N is the number of key points in the image, and s' _pk To predict the size s _k And p is the position of the center point of the target in the image.

Based on the same inventive concept, the invention also provides a complex road scene target detection device based on the LG-CenterNet model, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the complex road scene target detection method based on the LG-CenterNet model when being loaded to the processor.

The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that: 1. through improving the backbone network of the LG-CenterNet model, the Mresneit50 is proposed to strengthen the feature extraction effect; 2. the method comprises the steps of providing a hierarchical directing attention module for feature fusion of feature graphs extracted from a backbone network; 3. the new Scales Encoder module and the feature enhancement module are put forward to pay attention to the extraction of local features, so that the problem of feature loss in the deconvolution module is avoided; 4. the average precision mAP (meanAveragePrecision) of the improved LG-CenterNet target detection model is improved by 5 percentage points compared with the average precision mAP (meanAveragePrecision) of the original CenterNet framework; 5. the invention has higher detection precision in coping with complex road scenes.

Drawings

FIG. 1 is a flow chart of a complex road scene object detection method based on the LG-CenterNet model;

FIG. 2 is a schematic diagram of an LG-CenterNet-based target detection model according to the present invention;

fig. 3 is a schematic diagram of a residual block structure Mblock structure proposed by the present invention;

FIG. 4 is a schematic diagram of a hierarchical guided attention model structure;

FIG. 5 is a schematic diagram of the structure of the Scales Encoder module;

FIG. 6 is a schematic diagram of a feature enhancement module architecture;

FIG. 7 is a graph showing the detection effect obtained by using the LG-CenterNet target detection model.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

In this embodiment, a large number of variables are involved, and the variables will be described below. As shown in table 1.

Table 1 variable description table

Variable(s)	Description of variables
		S	3 x 3, convolutional kernel with 1024 channels
E1	Feature map extracted from 4 residual blocks in back bone module
		E2	Feature map extracted from 6 residual blocks in back bone module
E3	Feature map extracted from 3 residual blocks in back bone module
		EC1	E1 is a feature map obtained through global pooling branching
EC2	E1, E2, E3 feature graphs obtained via hierarchical guided branching
		EC3	Feature map EC2 is a feature map processed by a ScalesEncoder module
EC4	Feature map EC3 feature map processed by ScalesEncoder module
		EC5	Feature map EC4 is a feature map processed by deconvolution module
EC6	Feature map EC5 is a feature map processed by the feature enhancement module

The invention provides a complex road scene target detection method based on an LG-CenterNet model, which is characterized in that different target images of a road scene are collected and marked to manufacture a complex road scene data set, a proposed Mresneit50 is used as a main network for feature extraction, feature images with different Scales extracted from the main network are input into a hierarchical directing attention module, a plurality of receptive field features are obtained through a scale Encoder module, then feature pixel reduction is carried out through a deconvolution module, and a feature enhancement module is constructed through Poly-Scale Convolution (PSConv for short) to improve the information correlation of local features. And finally, predicting the position of the Center point of the target, the scale size of the prediction frame and the offset of the Center point by using a Center points prediction module, and identifying the category of the target. As shown in fig. 1, the method specifically comprises the following steps:

step 1: and processing the image of the complex road scene, preprocessing the obtained road target image containing various categories, and marking the categories and positions of the road targets in the image to construct a complex road scene data set.

The preprocessing of the road scene data set mainly comprises the steps of normalizing images of different pixels and complex road scenes, normalizing the sizes of the images to 512 multiplied by 512 pixels, and obtaining a target sample which is in uniform distribution in the images through batch normalization (Batch Normalizaition), reLU activation function and maximum pooling operation.

And 2, constructing a target detection LG-CenterNet model, wherein the LG-CenterNet model structure is shown in figure 2, and training the road target data set through the LG-CenterNet model to obtain a model S, and the LG-CenterNet network mainly comprises a Backbone module, a hierarchical directing attention module (Levels guide attention, LGA for short), a Scales Encoder module, a deconvolution module, a feature enhancement module (P-Feature enhancement module, P-FEM) and a Centerpoints prediction module.

(21) The LG-CenterNet model proposes a new Mresneit50 as a backbond module, wherein the Mresneit50 is composed of a plurality of residual blocks Mlock, the residual block structure Mlock is shown in figure 3, a feature map extracted from 4 residual blocks is marked as E1, and the channel number is 512; the characteristic diagram extracted by the 6 residual blocks is marked as E2, and the number of channels is 1024; the feature map extracted by the 3-channel number is denoted as E3, and the channel number is 2048.

(22) The feature maps E1, E2, E3 extracted by the back bone are input into a hierarchical directing attention module (Levels guide attention, LGA for short), and the LGA module structure is shown in fig. 4, and the main structure of the LGA module structure comprises two branches: the global pooling branch and the hierarchical guiding branch are used for inputting a characteristic diagram E1 with the channel number of 512 into the global pooling branch, and EC1 is obtained through the operation of a global maximum pooling layer and an up-sampling layer; the feature maps E1, E2 and E3 with the channel numbers of 512, 1024 and 2048 are input into the hierarchical guide branches, and EC2 is obtained through a series of averaging pooling and convolution operations and is matched with up-sampling. EC1 and EC2 were feature-combined using add to obtain EC3, thereby reducing computational parameters.

(23) The extracted EC3 is input to a Scales Encoder module, the structure of which is shown in FIG. 5, and a series of convolution and residual module operations are performed to obtain EC4.

(24) The extracted EC4 is input into a deconvolution module, the deconvolution module consists of 3 deconv groups, the feature map size is continuously amplified through convolution operation of each deconv group, and meanwhile, the channel number is continuously reduced, so that the feature map with the dimension of 128 multiplied by 64 is obtained and is marked as EC5.

(25) The feature map EC5 is input into a P-FEM for convolution operation to obtain a feature map EC6 with the scale of 128 multiplied by 64, wherein the P-FEM is composed of 3 multiplied by 3 Poly-Scale Convolution (PSConv for short), batch standardization (batch standardization), reLU activation function and Sigmoid activation function, and mainly aims to improve the correlation of local information in the feature map and enhance the expression capability of the feature map on the feature. The P-FEM structure is shown in FIG. 6.

Step 3: and performing target positioning, frame size division and category prediction on a road scene target in a thermodynamic diagram mode through a central points prediction module by using the trained model S, and displaying and inputting the obtained result on a video or an image to obtain a corresponding effect.

The centroids prediction module generates a hetmap with the scale consistent with the EC6 size from the original image by classifying and predicting the input image by the trained model S, and then marks the loss value of the thermodynamic diagram as L by calculating the loss value of the thermodynamic diagram respectively _h The loss value of the target length and width (size) is recorded as L _s And the loss value of the center point offset (offset) is denoted as L _f To determine the location and size of the target and to generate the final classically located hetmap. Wherein the overall network loss is L _d 。

L _d ＝L _k +λ _s L _s +λ _f L _f

Wherein lambda is _s ＝0.1，λ _f =1. For an image of input picture size 512 x 215The feature map generated by the network is H×W×C, L _k 、L _s And L _f The calculation formulas are respectively as follows:

Based on the same inventive concept, the invention also provides a complex road scene target detection device based on the LG-CenterNet model, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the complex road scene target detection method based on the LG-CenterNet model when being loaded to the processor. As shown in fig. 7.

Training a self-built complex scene data set through an LG-centrnet network to obtain a model capable of identifying a complex scene target, and performing model performance verification through a verification set in the data set, as shown in fig. 7. The recognition average precision of the self-built complex road scene data set is 86.93%, the detection speed of the road scene target image reaches 50 frames/s, and the requirements of accurate detection and real-time detection of the road scene can be met.

Precision is Precision, recall is Recall, AP is Precision, mAP is average Precision, FPS is frame number, and t is time for detecting a single picture. There are more sample categories in the dataset (e.g., car, person, etc.), n represents the number of samples, TP (True Positives) is and is considered as the number of positive samples (i.e., the samples that are car are considered as the total number of car); TN (True Negatives) is the negative sample model identification and is the total number of negative samples; FP (False Positives) is the total number of negative samples for which the model is considered positive (i.e., samples other than car, the model is considered the total number of car); FN (False Negatives) is the total number of positive samples that the negative sample model considers to be.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. The complex road scene target detection method based on the LG-CenterNet model is characterized by comprising the following steps of:

(3) Performing target positioning, frame size division and category prediction on a complex road target in a thermodynamic diagram mode through a Center points prediction module by using a trained model S, and displaying and inputting the obtained result on a video or an image to obtain a corresponding effect;

the implementation process of the step (2) is as follows:

2. The method for detecting complex road scene targets based on LG-centrnet model as claimed in claim 1, wherein the step (1) of constructing complex road scene data set and preprocessing is to normalize images of different pixels and complex road scene, normalize the image size to 512×512 pixels, and obtain uniformly distributed feature target samples by batch normalization, reLU activation function and max pooling operation.

3. The complex road scene target detection method based on the LG-centrnet model as set forth in claim 1, wherein the implementation procedure of the step (3) is as follows:

L _d ＝L _k +λ _s L _s +λ _f L _f

4. An LG-centrnet model-based complex road scene object detection device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when loaded into the processor implements the LG-centrnet model-based complex road scene object detection method according to any of claims 1-3.