CN116912796A

CN116912796A - Novel dynamic cascade YOLOv 8-based automatic driving target identification method and device

Info

Publication number: CN116912796A
Application number: CN202310899627.4A
Authority: CN
Inventors: 洪远; 姜明新; 杜强; 黄俊闻; 项靖; 王杰
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-10-20

Abstract

The invention discloses an automatic driving target recognition method and device based on novel dynamic cascade YOLOv8, which are used for preprocessing a pre-acquired original image of a traffic vehicle and dividing the pre-acquired original image into a training set and a testing set; constructing an automatic driving target recognition network based on novel dynamic cascading YOLOv 8; the automatic driving target identification network integrally replaces a Backbone network of a backbond in the YOLOv8 network with a novel dynamic cascade Backbone network; replacing the detection head in the last part of the YOLOv8 network with a shareseep head detection head of a new cross-scale shared convolution weight; adopting improved PolyLoss as a loss function of an automatic driving target recognition network; training the automatic driving target recognition network type by utilizing the training set; inputting the test set into a trained automatic driving target recognition network, and evaluating the automatic driving target recognition network. The invention can improve the identification precision and speed of the target in the automatic driving and provide guarantee for the safety of the automatic driving.

Description

Novel dynamic cascade YOLOv 8-based automatic driving target identification method and device

Technical Field

The invention belongs to the application of deep learning in the field of computer vision, and particularly relates to an automatic driving target recognition method and device based on novel dynamic cascade YOLOv 8.

Background

As one of the core problems of computer vision, target detection, which aims to find the category and position of a specific target in an image, is widely used in various fields such as automatic driving, remote sensing images, video monitoring, medical detection, and the like.

YOLO was developed, with version updates since 2016, and v8 has been reached to date. In 2016, a single-Stage (One-Stage) target detection method represented by YOLOv1 has emerged. In view of the development history of the single-Stage target detection method, the first single-Stage target detection method YOLOv1 was proposed to 2023, and the YOLOv series target detection method was developed along with the development of the single-Stage target detection, and has been a typical representative of One-Stage methods.

Although YOLOv8 can perform object detection quickly when processing simple images, it requires more time to detect when there are a large number of vehicles and pedestrians when facing complex scenes such as traffic jams in reality. The real-time performance of automatic driving is important for decision making, and the improvement of the processing speed is still to be improved. Also, accuracy is improved, and highly accurate target detection results are required for automatic driving to ensure correct response to various traffic conditions. Although YOLOv8 may perform well in some situations, in some situations where traffic is complex, the detection accuracy still needs to be improved. The background backbone of the prior art YOLOv8 is fast when processing simple images, but requires more time when encountering complex images with many targets; the existing YOLOv8 detection head model contains more parameters, so that the calculation complexity is higher. In autopilot systems, computational resources are limited, and thus more efficient model design is needed to ensure target detection in embedded or resource-constrained environments.

Disclosure of Invention

The invention aims to: the invention provides an automatic driving target recognition method and device based on novel dynamic cascade YOLOv8, which can accurately detect targets in automatic driving.

The technical scheme is as follows: the invention provides an automatic driving target identification method based on novel dynamic cascade YOLOv8, which specifically comprises the following steps:

(1) Preprocessing a pre-acquired original image of a traffic vehicle, and dividing the pre-acquired original image into a training set and a testing set;

(2) Constructing an automatic driving target recognition network based on novel dynamic cascading YOLOv 8; the automatic driving target identification network integrally replaces a Backbone network of a backbond in the YOLOv8 network with a novel dynamic cascade Backbone network; replacing the detection head in the last part of the YOLOv8 network with a shareseep head detection head of a new cross-scale shared convolution weight;

(3) Adopting improved PolyLoss as a loss function of an automatic driving target recognition network;

(4) Training the automatic driving target recognition network type by utilizing the training set;

(5) Inputting the test set into a trained automatic driving target recognition network, and evaluating the automatic driving target recognition network.

Further, the novel dynamic cascade backbone network in the step (2) is provided with two cascade backbone networks, and a dynamic router is inserted between the two backbone networks to automatically select an optimal route for each image to be detected; the image to be detected is subjected to first-stage multi-scale feature extraction through a first backbone network, and the multi-scale feature is sent to a dynamic router to judge the difficulty level of the image; mapping the features to the difficulty scores through two linear mapping layers; if the image is judged to be a 'simple' image, the first-stage multi-scale feature is sent to the head part of YOLOv 8; if the image is judged to be a difficult image, the image to be detected and the first-stage multi-scale feature thereof are sent to a second backbone network, the second-stage multi-scale feature is extracted and obtained, and the second-stage multi-scale feature is sent to the head part of YOLOv 8.

Further, the implementation process of the novel dynamic cascade backbone network in the step (2) is as follows:

for the input image x, firstly, extracting the multi-scale feature F1 of the input image x, wherein the first trunk B1 is as follows:

wherein L is the number of stages, namely the number of multi-scale features; the router R will then use these multi-scale features F1 to predict the difficulty score φ ε (0, 1) for the image as:

if the router classifies the input image as a "simple" image, the immediately following neck header D1 will output the detection result y as:

if the router classifies the input image as a "complex" image, the multi-scale feature will require further enhancement of the second backbone, embedding the multi-scale feature F1 into H through a composite connection module G:

wherein G is DHLC implementing CBNet; the input image x is fed into the second trunk, and the characteristics of the second trunk are enhanced by summing the elements corresponding to the embedded H at each stage in turn, and are marked as follows:

as a result of the detection, the head and neck portion D2 of the second time is decoded as:

y＝D ₂ (F ₁ )。

further, in the step (2), the shareseephead detection head shares convolution weights among different layers, and independently calculates statistics of BN; the ShareLepohead comprises a first convolution layer, a first depth separable convolution layer, a second convolution layer and a BN normalization layer which are connected in sequence.

Further, the first convolution layer is a 3x3 convolution layer, and the channel number of the input feature map is changed from x to c2 x 2; the first depth separable convolution layer firstly applies convolution operation to each input channel respectively, and then combines the characteristics among the channels; the second depth separable convolution layer reduces the number of channels of the input feature map from c2 to c2; the second convolution layer is a 1*1 convolution layer, changing the channel number of the input feature map from c2 to 4 x self reg_max; each detection head improves gradient propagation and training speed through BN normalization, which is performed by normalizing each small batch of data.

Further, step (3) said modified PolyLoss comprises combining a loss function and a weighted binary cross entropy loss; polyLoss combines binary cross entropy Loss and Focal Loss together, and improves the balance processing capacity between a difficult sample and positive and negative samples by adjusting the weight and shape of a Loss function; calculating binary cross entropy loss between a prediction result and a real label by using weighted binary cross entropy loss, and measuring the matching degree of the prediction and the real label; introducing an alpha_factor to weight the loss, so that the loss of the positive sample and the loss of the negative sample are adjusted to different degrees in calculation; polynomial adjustment factors are incorporated to increase the uncertainty of the sample probability predictions.

Further, the "simple" image is a single target image; the "difficult" images are two or more target images.

Based on the same inventive concept, the present invention proposes an apparatus device comprising a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

a processor for executing the steps of the novel dynamic cascade YOLOv 8-based automatic driving target recognition method as described above when running the computer program.

Based on the same inventive concept, the present invention proposes a storage medium having stored thereon a computer program which, when executed by at least one processor, implements the novel dynamic cascade YOLOv 8-based automatic driving target recognition method steps as described above.

The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that: the automatic driving target recognition network based on the novel dynamic cascade YOLOv8 constructed by the invention enables the YOLOv8 backbone network to adaptively select reasoning routes for input images with different difficulties, and improves the feature extraction efficiency; in order to improve the detection precision of the YOLOv8, a brand new and improved PolyLoss loss function is used, a super-parameter search space is simplified, and polynomial coefficients are adjusted; to upgrade the YOLOv8 detection head, save more parameter, be more efficient, improve accuracy, use novel shared detection head, in order to enhance the model capacity to obtain higher performance; and finally, the target detection of automatic driving is more accurate.

Drawings

FIG. 1 is a schematic diagram of a dynamic cascade backbone network architecture;

FIG. 2 is a schematic diagram of a test head structure sharing convolution weights and separate batch normalization layers.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention provides an automatic driving target identification method based on novel dynamic cascade YOLOv8, which specifically comprises the following steps:

step S1: the invention selects a KITTI data set, wherein the divided data set comprises a test set and a training set. Performance assessment is performed on the autopilot dataset.

Step S2: based on the YOLOv8 network foundation, the Backbone of the backhaul is entirely replaced by a novel Dynamic Cascade (Dynamic Cascade) Backbone.

As shown in fig. 1, a novel Dynamic Cascade (Dynamic Cascade) backbone network has two cascaded backbone networks, and a Dynamic router is inserted between the two backbone networks to automatically select an optimal route for each image to be detected.

An adaptive router: to better judge the difficulty level of the image, and to make a difficulty judgment based on the input multi-scale characteristic information. Assuming that the first backbone network has output multi-scale features, in order to reduce the computational complexity of the dynamic router, information compression is performed on the first backbone network to obtain compressed features, wherein the compressed features are global pooling operation and channel dimension splicing operation. The features are then mapped to difficulty scores by two linear mapping layers.

Two cascaded networks: for the input image x, firstly, extracting the multi-scale feature F1 of the input image x, wherein the first trunk B1 is as follows:

wherein L is the number of stages, i.e., the number of multi-scale features. The router R will then use these multi-scale features F1 to predict the difficulty score φ ε (0, 1) for the image as:

the "simple" image exits at the first trunk, while the "complex" image requires further processing. The simple image is a simple image of a single target pedestrian or a single vehicle; a "complex" image is two or more multiple classes of images. Specifically, if the router classifies an input image as a "simple" image, the immediately following neck header D1 will output a detection result y of:

conversely, if the router classifies the input image as a "complex" image, the multi-scale feature would require further enhancement of the second stem, rather than being immediately decoded by the neck head D1. In particular, the multiscale feature F1 is embedded into H by a composite connection module G:

where G is DHLC implementing CBNet. The input image x is then fed into a second trunk, whose features are enhanced by summing the elements of the embedded H corresponding at each stage in turn, noted as:

y＝D ₂ (F ₁ )。

through the above procedure, a "simple" image will only handle one backbone, while a "complex" image will handle two backbones. Obviously, with such an architecture, a tradeoff can be made between computation (i.e., speed) and accuracy.

Step S3: based on the YOLOv8 network, the default CIoU loss function is modified into a new PolyLoss classification loss function, and the detection precision is improved.

The Loss function combines the ideas of binary cross entropy Loss (BCEWithLogitsLoss) and Focal Loss (FL) for object classification in object detection tasks. Comprises the following parts:

combining the loss functions: polyLoss combines binary cross entropy Loss with Focal Loss to improve the model's ability to handle balance between difficult and positive and negative samples by adjusting the weight and shape of the Loss function.

Weighted binary cross entropy loss: polylos first calculates the binary cross entropy loss between the predicted outcome and the true label using nn. BCEWITHLogitLoss. This partial loss is used to measure how well the predictions match the real tags.

Focal Loss adjustment: to handle difficult samples, polyLoss introduced the idea in Focal Loss. By adjusting the prediction probability value, the sample with lower prediction probability plays a larger role in loss calculation, so that the attention to difficult samples is improved.

And (3) loss weight adjustment: by introducing alpha_factor, polyLoss weights the losses. This factor is determined according to the value of the real label, so that the losses of the positive and negative samples are adjusted to different degrees in the calculation.

Polynomial adjustment: in the last step, polyLoss introduces polynomial adjustment factors for increasing the uncertainty of the sample probability predictions. By adjusting the shape and coefficients of the polynomials, the loss can be increased when the sample probability is low or high, thereby further enhancing the attention to difficult samples.

The PolyLoss Loss function combines the ideas of binary cross entropy Loss and Focal Loss in a target detection task, and provides a Loss calculation mode capable of processing difficult samples and balancing positive and negative samples through polynomial adjustment and weight adjustment. This may help the model better learn and handle challenging target classification tasks.

Step S4: based on the YOLOv8 network, the detection head in the last part of the YOLOv8 network is modified to be a novel ShareLepohead detection head which is of a cross-scale and shared convolution weight, and an automatic driving target identification network based on novel dynamic cascading YOLOv8 is formed.

The YOLOv8 original detection head is the last layer of the network and is responsible for generating a prediction result of target detection. It maps feature maps to grids of different scales depending on the size of the input image and the design of the network. Each grid cell is responsible for detecting and locating one or more targets. At each scale, the detection head outputs a set of prediction boxes, each consisting of a plurality of attributes, typically including the coordinates of the bounding box (center coordinates and width-height), the probability of the target class, and the confidence score of the presence of the target. These prediction frames are post-processed by a non-maximal suppression (NMS) for filtering the overlapping frames and preserving the most accurate detection results. The detection head typically employs a combination of convolution and full-connection layers, with different convolution kernel sizes and strides to accommodate different scale target detection. The output of the detection head typically employs appropriate activation functions and normalization operations to ensure that the predicted results are within a proper range and to provide good interpretability and robustness.

As shown in fig. 2, the novel shared convolution weight shared across scales sharesepfead detection head: convolution weights are shared between the different layers, but the statistics of BN (battnorm) are calculated independently. This is a shared detection head, and real-time target detectors typically use separate detection heads for different feature scales to enhance model capabilities for higher performance, rather than sharing one detection head across multiple scales. The cross-scale shared detection head parameters are selected this time, but different batch normalization layer BN layers are adopted, so that the detection head parameters are reduced, and meanwhile, the precision is kept. BN is also more efficient than other normalization layers because it uses statistics computed in training directly in reasoning.

After passing through the YOLOv8head, the image starts to enter the sharesepfead detection head part prediction result. Each head comprises a first convolution layer, a first depth separable convolution layer, a second convolution layer and a BN normalization layer which are connected in sequence.

The first Conv 3x3 convolutional layer, which is a 3x3 convolutional layer, changes the number of channels of the input feature map from x to c2 x 2. It helps extract features and increase the number of channels to better capture information of the target.

A second DWConv 3x3 depth separable convolutional layer first applies a convolutional operation to each input channel separately and then combines the features between the channels. This helps to reduce the amount of computation and improve the efficiency of the model.

The third part DWConv 3x3 depth separable convolutional layer reduces the number of channels of the input signature from c2 to c2. Similar to the previous step, this layer continues to reduce the number of channels and extract higher level features.

The fourth partial Conv1 x 1 convolution layer changes the number of channels of the input signature from c2 to 4 x self reg_max. Is responsible for predicting the coordinate information of the bounding box.

And sharing the parameter information of the detection head among each head.

The gradient propagation and training speed of each detection head is improved through BN normalization, and the BN normalization can enable the activation value in the network to be kept in a relatively small range through normalization processing of data of each small batch, so that the problems of gradient disappearance and gradient explosion are relieved, gradient propagation is promoted, and the training process of the network is accelerated.

Step S5: and (3) training the automatic driving target recognition network based on the novel dynamic cascade YOLOv8 constructed in the step (S4) by using the divided data set. And evaluating the performance of the trained automatic driving target recognition network based on the novel dynamic cascade YOLOv8, and finally realizing target recognition in automatic driving.

Based on the same inventive concept, the present invention proposes an apparatus device comprising a memory and a processor, wherein: a memory for storing a computer program capable of running on the processor; a processor for executing the steps of the novel dynamic cascade YOLOv 8-based automatic driving target recognition method as described above when running the computer program.

Based on the same inventive concept, the invention also proposes a storage medium having stored thereon a computer program which, when executed by at least one processor, implements the novel dynamic cascade YOLOv 8-based automatic driving target recognition method steps as described above.

Thus far, the technical solution of the present invention has been described in connection with the specific experimental procedure shown in the drawings, but the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. An automatic driving target recognition method based on novel dynamic cascade YOLOv8 is characterized by comprising the following steps:

2. The method for automatically identifying a driving target based on novel dynamic cascade YOLOv8 according to claim 1, wherein the novel dynamic cascade backbone network in step (2) has two cascade backbone networks, and a dynamic router is inserted between the two backbone networks to automatically select an optimal route for each image to be detected; the image to be detected is subjected to first-stage multi-scale feature extraction through a first backbone network, and the multi-scale feature is sent to a dynamic router to judge the difficulty level of the image; mapping the features to the difficulty scores through two linear mapping layers; if the image is judged to be a 'simple' image, the first-stage multi-scale feature is sent to the head part of YOLOv 8; if the image is judged to be a difficult image, the image to be detected and the first-stage multi-scale feature thereof are sent to a second backbone network, the second-stage multi-scale feature is extracted and obtained, and the second-stage multi-scale feature is sent to the head part of YOLOv 8.

3. The method for automatically identifying a driving target based on novel dynamic cascade YOLOv8 of claim 1, wherein the implementation process of the novel dynamic cascade backbone network in step (2) is as follows:

y＝D ₂ (F1)。

4. the automatic driving target recognition method based on novel dynamic cascade YOLOv8 of claim 1, wherein the shareseep head detection head in step (2) shares convolution weights among different layers, and calculates statistics of BN independently; the ShareLepohead comprises a first convolution layer, a first depth separable convolution layer, a second convolution layer and a BN normalization layer which are connected in sequence.

5. The method for automatically identifying a driving target based on novel dynamic cascade YOLOv8 according to claim 4, wherein the first convolution layer is a 3x3 convolution layer, and the number of channels of the input feature map is changed from x to c2 x 2; the first depth separable convolution layer firstly applies convolution operation to each input channel respectively, and then combines the characteristics among the channels; the second depth separable convolution layer reduces the number of channels of the input feature map from c2 to c2; the second convolution layer is a 1*1 convolution layer, changing the channel number of the input feature map from c2 to 4 x self reg_max; each detection head improves gradient propagation and training speed through BN normalization, which is performed by normalizing each small batch of data.

6. The method of automatically driving target recognition based on novel dynamic cascading YOLOv8 of claim 1, wherein the modified polylass of step (3) comprises a combined loss function and weighted binary cross entropy loss; polyLoss combines binary cross entropy Loss and Focal Loss together, and improves the balance processing capacity between a difficult sample and positive and negative samples by adjusting the weight and shape of a Loss function; calculating binary cross entropy loss between a prediction result and a real label by using weighted binary cross entropy loss, and measuring the matching degree of the prediction and the real label; introducing an alpha_factor to weight the loss, so that the loss of the positive sample and the loss of the negative sample are adjusted to different degrees in calculation; polynomial adjustment factors are incorporated to increase the uncertainty of the sample probability predictions.

7. The method for automatically identifying a driving target based on novel dynamic cascade YOLOv8 according to claim 2, wherein the "simple" image is a single target image; the "difficult" images are two or more target images.

8. An apparatus device comprising a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

processor for executing the method steps of the novel dynamic cascade YOLOv 8-based automatic driving target recognition method according to any of the claims 1-7 when running the computer program.

9. A storage medium, characterized in that it has stored thereon a computer program which, when executed by at least one processor, implements the method steps of the novel dynamic cascade YOLOv 8-based automatic driving target recognition method according to any one of claims 1-7.