CN116012686A

CN116012686A - Improved YOLOv6 target detection method introducing dynamic position loss

Info

Publication number: CN116012686A
Application number: CN202310072259.6A
Authority: CN
Inventors: 孙俊; 肇启明
Original assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Current assignee: Uni Entropy Intelligent Technology Wuxi Co ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-04-25

Abstract

The invention provides an improved YOLOv6 target detection method introducing dynamic position loss, and belongs to the field of target detection. The target detection is one of the most hot directions in the field of computer vision, in order to further improve the performance of a target detection algorithm, solve the limitation of a position loss function in the training process of the target detection algorithm, provide a dynamic cross-ratio loss function based on cross-ratio, fully consider the connection among all components in the position loss function, and dynamically give different weights to the position loss components in different stages of training so as to achieve the aim of restricting a network more pertinently. Secondly, in order to solve the defect of the target detection network in the feature fusion link, the deformable convolution is applied to the PAN structure, and the plug-and-play DePAN Neck based on the deformable convolution is designed to improve the fusion capability of the model to the multi-scale features.

Description

Improved YOLOv6 target detection method introducing dynamic position loss

Technical Field

The invention belongs to the field of target detection, and particularly relates to an improved YOLOv6 target detection method introducing dynamic position loss.

Background

The object detection task is to find out the objects of interest contained in the image, including their position in the image and the category to which they belong. Other directions of computer vision, such as object tracking, pedestrian re-recognition, etc., are also pre-tasked with object detection, and therefore object detection is one of the hottest directions, both academia and industry. With the rapid development of deep learning, the target detection is changed from the traditional digital image processing algorithm to a target detection algorithm based on the deep learning, and the target detection algorithm based on the deep learning can be divided into two types at present. The first is a double-Stage (Two-Stage) detection method represented by Fast-Rcnn (Fast regions with convolutional neural network features) series, the method divides the detection task into Two stages, namely, the first Stage provides a candidate region set through a proposal network, and the second Stage classifies the candidate region by using a convolutional neural network (convolutional neural network, CNN) and carries out bounding box regression so as to obtain a final result. The second type is a single-Stage (One-Stage) detection method represented by YOLO (you onlylook once) series, and the method is characterized by fast speed but low initial accuracy rate by directly carrying out regression on a bounding box after extracting image features and obtaining a class and almost realizing end-to-end detection. With the continuous improvement of the algorithm, the current One-Stage detection algorithm has a higher speed than the Two-Stage algorithm, and can ensure higher precision.

Although a wide variety of target detection algorithms are currently proposed, almost all One-Stage detection networks can be regarded as abstractly consisting of Backbone, neck, head three parts. As shown in fig. 1, the back box portion is configured to extract features of different dimensions in the image, the neg portion performs operations such as feature fusion on the multi-scale features extracted by the back box, and the Head portion predicts using the features processed by the neg fusion, and finally outputs the bounding box position of the object in the image and the category to which the object belongs.

The large difference in object sizes in the images has been a challenge in the task of target detection, and the initial One-Stage detector, such as YOLOv1, uses only a single-scale feature map for detection (fig. 2 (a)), resulting in poor network detection effect on small targets, and later YOLOv3 uses FPN (feature pyramid n etwork) (fig. 2 (b)) as a neg fusion multi-scale feature, so that the detection effect is greatly improved. Various kinds of next, for example SPP (spatial pyramid pooling), PAN (path aggregation network), were proposed, and it can be seen that the next part plays a key role in detecting network.

The loss function for object detection is typically composed of two parts, category loss and location loss, respectively. The YOLO series additionally requires a confidence penalty. The class loss is used to calculate the error between the target prediction class and the actual class, and the position loss is used to calculate the error between the target prediction bounding box and the actual bounding box.

The One-Stage algorithm has unsatisfactory accuracy, on One hand, because the designed Neck fusion feature has limited capability, and the small target detection effect is poor. On the other hand, due to the deficiency of the position loss function, the existing position loss function does not consider the relation among the components in the loss function, so that the network optimization process is not accurate enough.

Disclosure of Invention

Aiming at the defects of the two parts of the current target detection task, the invention firstly provides a dynamic cross ratio loss function based on the IOU, and the dynamic cross ratio loss function can dynamically give different weights to each component part in the position loss according to different position states between a real boundary frame and a predicted boundary frame, so that the network can more pertinently optimize the boundary frame position at different stages. And secondly, the invention applies the deformable convolution to the Neck part, designs a plug-and-play Neck combining the deformable convolution based on the PAN structure to perform feature fusion, and improves the detection capability of the network on small targets.

Experimental part the present invention uses YOLOv6 as a basis line network. To evaluate the performance of the method, the present invention was trained and tested on the same COCO2017 dataset as the original paper. The main contributions of the invention are as follows:

(1) The position loss function and the Neck part aiming at target detection are designed and can be applied to various target detection networks by plug and play.

(2) The method is applied to the YOLOv6 models with three body volumes to realize the improvement of the performance.

(3) The validity of the proposed method of the present invention was verified on the COCO2017 dataset.

The technical scheme of the invention is as follows:

an improved YOLOv6 target detection method introducing dynamic position loss comprises the following steps:

(1) Performing Mosaic data enhancement on the data, randomly cutting and scaling 4 pictures, and then randomly arranging and splicing the pictures to form one picture;

(2) Constructing a DePAN Neck based on deformable convolution, and carrying out multi-scale feature fusion;

(3) The DePAN Neck is used in the YOLOv6 to replace the original Neck part;

(4) Calculating a position loss of the predicted bounding box coordinates and the label bounding box coordinates using the dynamic cross-ratio loss as a position loss function;

(5) Calculating the class Loss of the model output and the label by using the VariFocal Loss as a class Loss function, and combining the position Loss as the final total Loss;

(6) Setting a learning rate by a Cosine wakeup strategy, namely preheating the first n epochs by using a smaller learning rate, and adjusting the learning rate by using a preset learning rate after the n epochs according to a Cosine annealing strategy;

(7) Counter-propagating the loss by using an SGD optimizer with momentum, and updating a parameter training model;

(8) And (3) performing target detection by using the model trained in the step (7).

The DePAN Neck in the step (2) receives the different scale features extracted by the backstbone, and for the high-level features, up-sampling is carried out through a convolution of 1 multiplied by 1 and then through transposed convolution, splicing operation is carried out on the high-level features and the adjacent low-level features in the channel dimension, and then the features are fused through DeConvBlock; for the low-layer features, splicing the low-layer features with adjacent high-layer features in the channel dimension through 3X 3 convolution downsampling, and fusing the spliced features through DeConvBlock to finally obtain fused features which are used for inputting the Head part to carry out classification and regression prediction; the DeConvBlock structure is formed by repeatedly stacking N DeConv cells.

The loss function of the dynamic cross ratio loss in the step (4) consists of IoU loss, distance loss, angle loss and shape loss 4 parts, and specifically comprises the following steps:

(1) Loss of angle: optimizing the condition that the angle loss is not subdivided, and optimizing the angle loss to pi/4 no matter whether the included angle alpha is smaller than pi/4 or not, namely restraining the prediction boundary frame to approach the real boundary frame along the direction of the included angle pi/4;

the angular loss Λ is defined as:

wherein b _x 、b _y Respectively representing the horizontal and vertical coordinates of the central point of the boundary frame, gt represents the real boundary frame, c _w 、c _h Respectively representing the horizontal and vertical distances between the center point of the real boundary frame and the center point of the prediction boundary frame;

(2) The distance loss Δ and shape loss Ω are defined as:

wherein ρ and c in the distance loss represent the center point distance between the prediction bounding box and the real bounding box, respectively; w, h, w in shape loss ^gt 、h ^gt Representing the width and height of the prediction bounding box and the real bounding box respectively; θ is used to control global weight of shape loss;

(3) DYIoU Loss is defined as:

L _dyiou ＝1-IoU+λ(Δ+γΛ+υΩ) (20)

γ＝1-e ^-Δ (21)

where λ is a global parameter used to weigh the contribution of IoU and three losses of distance, angle, shape, unchanged throughout the training task; for the angle loss and the shape loss, weights gamma and v are respectively given, and the two weights respectively decrease and increase along with the decrease of the distance loss delta, namely the contribution of the angle loss is dynamically reduced and the contribution of the shape loss is increased in the network training process.

The invention has the beneficial effects that: in order to solve the problem that the feature fusion capability is limited during multi-scale prediction, the invention designs a new Neck based on deformable convolution to perform feature fusion, adaptively adjusts the position of a sampling point, increases the receptive field and effectively improves the feature fusion capability. Secondly, aiming at the limitation of the position loss function in the target detection algorithm, the invention provides a dynamic cross ratio loss function, which can dynamically adjust the contribution of different losses in different stages of training, is more in line with the target detection training logic, and can pertinently restrict the network. The method improves the YOLOv6 model based on the method, remarkably improves the model precision, can be used in a plug-and-play mode, and can be conveniently applied to training tasks of other target detection models. For the dynamic cross ratio loss function, the invention firstly puts forward the idea of dynamic loss by analyzing the boundary box characteristics of the whole flow of the target detection task, but the specific dynamic weights, namely the gamma and the upsilon in the formula 20, have room for improvement in the follow-up design.

Drawings

FIG. 1 One-Stage detector abstract structure.

The different scale detection method of fig. 2, wherein (a) is single scale feature detection and (b) is multi-scale fusion feature detection.

Fig. 3 SIoU Loss of angle.

Fig. 4 trains the initial bounding box state.

Fig. 5 trains middle and late bounding box states.

FIG. 6 DYIOU Loss of angle.

Fig. 7 DYIoU Loss of distance and shape Loss.

Fig. 8 convolutions with a deformable convolutions sampling pattern, where (a) is a convolution and (a) is a deformable convolution.

Fig. 9DePAN negk structure.

Fig. 10DeConv Cell structure.

FIG. 11 is a comparison of the results of YOLOv6-N reasoning with the results of improved YOLOv6-N reasoning.

FIG. 12 (a) shows the position loss during training, and FIG. 12 (b) shows mAP ^0.5:0.95 FIG. 12 (c) shows mAP ^0.5 Graph diagram.

Detailed Description

1.1 target detection

The object detection task is to detect the position and class of an object in the image. The conventional target detection mainly extracts the conventional features of the target region based on digital image processing, such as a direction gradient histogram (histogram of oriented gradient, HOG), scale invariant feature transform (scale invariant feature transform, SIFT) and the like, and then utilizes a classifier, such as a support vector machine (support vector machine, SVM), to classify the region, which has the problems of low precision, poor robustness and the like. In recent years, with the rapid development of deep learning, the target detection method based on the deep learning is far superior to the traditional target detection in terms of speed and precision, and has achieved great success. Algorithms can be generally classified into two categories according to detection procedures: the object region is firstly extracted, and then the classification and identification double-stage method and the single-stage method for directly obtaining the target position coordinates and the categories by regression of the images are carried out. A representative of the two-stage detection algorithm is the Fast-Rcnn series. The R-CNN network first applies deep learning to a target detection task, first generates a large number of proposed regions on an image using a selective search algorithm (selective search), which corresponds to the first stage of the dual stage, extracts features of each proposed region using CNN, and finally trains an SVM as a classifier for each class to predict the class of the proposed region, which also corresponds to the second stage of the dual stage. Since R-CNN extracts features for each proposed region, which need to be re-input into CNN, the speed is extremely slow. The latter Fast-RCNN only uses CNN to extract the feature once, so as to obtain the feature map of the image, then the proposed area in the original image is directly mapped onto the feature map without re-extracting the feature, thus saving a lot of calculation time, and the final regression and classification are also put into CNN, so that the speed is obviously improved compared with R-CNN. The fast-RCNN is used for successfully fusing the regional proposal, the characteristic extraction, the regression and the classification into the whole model, so that the speed and the precision are greatly improved. The advantage of the dual-stage detection algorithm is high accuracy, but because of the inherent limitations of the dual-stage detection structure, the speed is far less than that of single-stage detection.

The single-stage detection algorithm discards the heavy area proposal stage, performs end-to-end detection, and represents the algorithm as YOLO series. YOLOv1 uses CNN to extract original image characteristics and directly carries out regression and classification on the boundary box, so that the speed is greatly improved compared with that of the two stages, but the accuracy is not ideal. The YOLOv2 provides an anchor frame mechanism, and the prediction of the position is not blindly and violently regressed, but the deviation from the actual boundary frame is calculated based on the preset anchor frame size, so that the convergence speed and the detection precision are greatly improved. YOLOv3 is one of the most widely used models in industry, and FPN is added as neg, and multi-scale features are fused to predict, so that the speed and accuracy of the model are further improved. Recently, the Mei Tuo team designs more efficient EfficientRep Backbone, repPAN NECK and Efficient Decoupled Head based on the idea of RepVGG, and uses a brand new SIoU Loss as a position Loss function, and proposes a Yolov6 algorithm which exceeds other isovolumetric algorithms in speed and precision.

1.2 loss function

The loss function of the object detection task is typically composed of two parts, namely a category loss and a location loss. YOLO is additionally provided with a confidence error, which is the probability of an object being included in the bounding box of the model prediction.

Common category losses are Cross Entropy Loss, focal Loss, etc. For the position loss, YOLOv1 uses the mean square error to calculate the coordinates of the central points of the prediction frame and the real frame and the error of the width and the height, and the model precision is not high because the constraint capacity of the mean square error loss is limited and the model is sensitive to the dimension of the boundary frame. The current position Loss is mostly developed based on the cross-over ratio (intersection over union, ioU), such as IoU Loss (intersection over unionloss), GIoU Loss (generalized IoU Loss), DIoU Loss (distance), CIoU Loss (complex ios), etc. And IoU refers to the ratio of the intersection to the union between two bounding boxes. IoU Loss is defined as:

L _iou ＝1-IoU (1)

however, the biggest disadvantage of IoULoss is that when two bounding boxes do not intersect, no matter how far apart the two bounding boxes are lost, the distance between the two bounding boxes cannot be measured. To solve this problem, GIoU Loss was proposed. GIoULoss is defined as:

wherein C represents the minimum circumscribed rectangular area of the prediction frame A and the real frame B, so that the problem existing when the boundary frames are not coincident is solved. However, when two boxes are involved, the GIoU Loss is degraded to the IoU Loss, and the DIoU Loss is proposed to solve this problem. DIoU Loss is defined as:

wherein ρ (b, b) ^gt ) Representing the distance between the center points of the two bounding boxes, and c represents the diagonal length of the smallest bounding rectangular box.

The authors consider a good location Loss function taking into account three factors, namely IoU, center point distance and bounding box shape, and then propose the CIoU Loss definition as:

wherein w, h, w ^gt 、h ^gt The width and height of the prediction bounding box and the width and height of the real bounding box, respectively. It can be seen that CioULoss introduces an aspect ratio on the basis of DioU Loss, making the Loss function more perfect. In YOLOv6, the latest SIoU Loss (scoilaioulos) is used as a Loss function, and the SIoU Loss further introduces angle information between two bounding boxes as shown in fig. 3. The authors think that the loss function should be consideredTaking the angle alpha between the boundary frames into consideration, when alpha is smaller than pi/4, the alpha is constrained to be equal to 0, namely, the center point of the prediction boundary frame is close to the X axis where the center point of the real frame is located, and when the angle alpha is larger than pi/4, beta is minimized, and the angle alpha is close to the Y axis.

The angular loss Λ is defined as:

the SIoU distance loss is defined as:

wherein c _w 、c _h W, H are shown in fig. 3. It can be seen that the angular loss in the SIoU is finally in the form of a distance loss component. Plus the shape Loss Ω of the measured shape, the final SIoU Loss is defined as:

however, the latest SIoU Loss still has some problems. First, for angles less than pi/4, optimizing to the X-axis and vice versa to the Y-axis, this inconsistent optimization behavior is detrimental to the training of the network. Secondly, the authors merge the angle loss into the distance loss, and the constraint effect of the angle loss on the network cannot be clearly reflected. The last is also a common problem of the series IoU loss function, namely that the inherent relation among the components inside the loss function is not fully considered, and the network cannot be purposefully optimized in different stages of training, which is the problem to be solved by the invention.

2.1 dynamic cross-ratio Loss (DYIOU Loss)

Observing the whole target detection task training process, it is not difficult to find that in the initial stage of training, as shown in fig. 4, the distance between the real boundary frame and the prediction boundary frame is often far, and even if the width and the height of the two boundary frames are completely equal, the prediction effect is still poor, so that the network should pay more attention to the angle loss at this time, and the constraint on the angle can enable the center points of the two boundary frames to be close more quickly. The reflection on the loss function is to increase the weight of the angle loss and properly decrease the weight of the shape loss. By the middle and later stages of training, as shown in fig. 5, the bounding boxes are relatively close, and even though the bounding angles are not adopted, the network has the capability of optimizing the center point distance, and if the angle loss is excessively concerned at the moment, the training is disturbed. At this point, the shape becomes a major factor affecting the detection effect, so the network should pay more attention to the shape loss. The reflection on the loss function is to reduce the angle loss weight and increase the weight of the shape loss.

Based on the idea, the dynamic cross ratio Loss function (dynamicintersection over union Loss, DY IoU Loss) based on the IoU Loss is designed to dynamically adjust different weights of each component part of the position Loss function in different training stages. The loss function is composed of IoU loss, distance loss, angle loss, and shape loss 4 parts.

(1) Loss of angle: and optimizing the condition that the angle loss is not subdivided, and optimizing the angle loss to pi/4 no matter whether the included angle alpha is smaller than pi/4 or not, namely restraining the prediction boundary frame to approach the real boundary frame along the direction of the included angle pi/4.

The angular loss Λ is defined as:

/>

wherein b _x 、b _y Respectively representing the horizontal and vertical coordinates of the central point of the boundary frame, gt represents the real boundary frame, c _w 、c _h The horizontal and vertical distances of the true bounding box center point from the predicted bounding box center point are represented, respectively, as shown in fig. 6. This consistent behavior in the present invention is more conducive to network training than the angular loss of the split case in the SIoU.

(2) The distance loss Δ and shape loss Ω are defined as:

where ρ, c in the distance penalty represent the center point distance between the prediction bounding box and the real bounding box, respectively (as in fig. 7). W, h, w in shape loss ^gt 、h ^gt Representing the width and height of the prediction bounding box and the real bounding box, respectively. θ is used to control the global weight of shape loss, along with the setting among the SIoU, 4 on the COCO dataset.

(3) DYIoU Loss is defined as:

L _dyiou ＝1-IoU+λ(Δ+γΛ+υΩ) (20)

γ＝1-e ^-Δ (21)

where λ is a global parameter used to weigh the contribution of IoU and three losses of distance, angle, shape, unchanged throughout the training task, the invention ultimately chooses 0.5 on the COCO dataset. For the angle loss and the shape loss, weights gamma and v are respectively given, and the two weights respectively decrease and increase along with the decrease of the distance loss delta, namely the contribution of the angle loss is dynamically reduced and the contribution of the shape loss is increased in the network training process.

The DYIoU Loss effectively solves the problems existing in other position losses. Firstly, the consistency of the network is ensured, and the optimization of different coordinate axes according to different directions of angles is avoided. Secondly, the angle loss is not fused with the position loss, but exists in the form of independent components, so that the constraint of the angle on the network is more visual. Finally, the loss dynamic weight is given, and the contribution of different losses is dynamically regulated in different stages of network training.

2.2 fusion of the Neck Structure of the Deformable convolution

At present, the design idea of mainstream Neck is to fuse feature graphs with different scales. The manner of fusion is corresponding element addition, such as FPN, SPP, etc., and splicing in the channel dimension, such as improved PAN in YOLOv4, etc. The high-level features often contain semantic information of the image, and the feature map is small, so that upsampling is performed first and then fusion with the low-level features is performed. The low-level features often contain image shape information, and the feature map is large, so downsampling is performed first and then fusion with the high-level features is performed.

Since the receptive field of convolution is limited and the convolution pattern is fixed, the sampling position cannot be flexibly adjusted ((a) in fig. 8). Therefore, the convolution used in neg cannot effectively retain information in feature graphs with different dimensions, and the deformable convolution (defoblecon solution) is implemented by training a bias matrix, so that the convolution can adaptively adjust the sampling position, increase the receptive field, capture more useful features in the image, and improve the feature extraction capability (fig. 8 (b)). Based on this feature, the present invention improves on the lock of YOLOv6, which in combination with the deformable convolution proposes a DePAN lock (deformablepath aggregation networkneck) structure (as in fig. 9) to fuse features of different scales.

The DePAN Neck receives different scale features extracted by the backstbone, for the high-level features, up-sampling is carried out through a 1 multiplied by 1 convolution and then through transposed convolution, splicing operation is carried out on the high-level features and adjacent low-level features in the channel dimension, and then the features are fused through DeConvBlock, namely the fusion process from top to bottom in FIG. 9. For the low-layer features, the 3×3 convolution downsampling is performed, then the low-layer features are spliced with the adjacent high-layer features in the channel dimension, the spliced features are fused through DeConvBlock, and finally the fused features are obtained and used for being input into a Head part for classification and regression prediction. The DeConvBlock structure is composed by repeatedly stacking N DeConv cells (fig. 10). The DePAN Neck structure is not limited to a certain model, but can be plug and play, and can be expanded into various target detection tasks.

3.1 data set

To evaluate the effectiveness of the proposed method, the present invention selects the same COCO2017 dataset as in the original paper of the base network Yolov 6. COCO2017 is a large dataset containing 118278 training pictures and 5000 verification pictures. Each picture contains 3.5 targets on average. In addition, the invention also performs a rich experiment on the public dataset PASCAL VOC2012 dataset to further verify the effectiveness of the method, the VOC2012 comprising 5717 training pictures and 5823 verification pictures, for a total of 20 categories.

3.2 evaluation index

The present invention uses the mAP most commonly used in target detection as an evaluation index. For its calculation method. Firstly, definition of accuracy and recall rate is respectively as follows:

wherein TP, FP and FN respectively represent real examples, false positive examples and false negative examples. AP (Average Precision) is the area under the precision-Recall curve. And the mAP is equal to the average AP value for all classes. mAP (mAP) ^0.5 Represents the mAP value when the IoU threshold is 0.5, mAP ^0.5:0.95 Represents the average of 10 mAPs obtained with a step size of 0.05 from 0.5 to 0.95 for the IoU threshold.

3.3 Experimental details

The method is applied to a YOLOv6 model with three volumes of YOLOv6-N, YOLOv6-T, YOLOv-S and is realized based on a Pytorch deep learning framework. The experimental equipment is an RTX-3090 display card. The optimizer uses SGD, and the image input size and data enhancement section are all set along with original YOLOv 6. Since the Batchsize in original Yolov6 is 256, and the invention is set to 64, the initial learning rate is reduced to 1/4 of the original setting, namely, the initial learning rate of Yolov6-N is set to 0.005, and the initial learning rate of Yolov6-T, YOLOv-S is set to 0.0025. The cosine annealing strategy is used to adjust the learning rate, epochs set 400, and no pre-training is used during training.

3.4 experimental results and analysis

The invention applies the proposed method to the three-volume network of YOLOv6-N, YOLOv6-T, YOLOv-S, the detection accuracy is shown in Table 1, wherein the mAP of the three-volume network is represented by using the method based on the original model ^0.5:0.95 Respectively lifting 2.9 percent, 2.1 percent and 0.8 percent of mAP ^0.5 Respectively lifting by 3.0 percent, 2.3 percent and 0.9 percent. Meanwhile, in order to test the detection effect of the algorithm on the small target, the detection precision of the small target is counted as shown in a table 2, and mAP ^0.5:0.95 Lifting by 1.2 percent, 1.3 percent and 0.6 percent respectively. To more intuitively demonstrate the improvement of the algorithm performance, fig. 11 shows a comparison of the results of reasoning on the picture using the original algorithm and the improved algorithm. From the experimental results, it can be seen that the DeP proposed by the present invention is usedANNeck and DYIOU Loss can be trained, and performance of the model can be improved remarkably.

3.5 ablation experiments

In order to verify the effectiveness of each module, the invention designs a series of control variable experiments, two body volumes of YOLOv6-N and YOLOv6-T are used for testing the influence of each method proposed by the invention on the performance of a model, the experimental setting is the same as that of section 3.3, and the model is trained on a COCO training set. The results are shown in Table 3, where +represents training using DYIOU Loss as the position Loss training based on the original model, and++ represents training using both DYIOU Loss and DePANNeck improved models. Fig. 12 (a), 12 (b) and 12 (c) are graphs of the mAP of the YOLOv6-N training process position loss. From the results in table 3, it can be seen that the method provided by the invention can obviously improve the accuracy of the model by gradually adding the method based on the original model, and especially the effect is more obvious after the DePAN Neck fusion feature is used.

To further verify the effectiveness of the dynamic position Loss presented herein, the present experiment devised a series of ablative experiments on the VOC2012 dataset for deep analysis of DYIoU Loss.

First, for the experiments of the angle loss dynamic weight γ and the shape loss dynamic weight v in the formula 20, the influence on the detection accuracy was observed with and without using each dynamic weight, and the results are shown in table 4, wherein x and v represent the use and inapplicability of the corresponding dynamic weights, respectively. From the experimental results, the detection accuracy is highest when the dynamic weights gamma and upsilon are used simultaneously, and the accuracy is reduced when only the dynamic weights gamma are used and upsilon is not used, but the loss is still converged. However, when the dynamic weight gamma is not used, we find that the loss can not be converged no matter the dynamic weight gamma is used or not used, and the detection precision is extremely low. It is therefore necessary to assign dynamic weights to the angle losses in DYIoU Loss. In addition, in order to verify the influence of different global parameters lambda on model precision in the formula 20, four values of 0.25, 0.5, 0.75 and 1.0 are selected for experiments, and the experimental results are shown in table 5, wherein the detection precision is highest when lambda is equal to 0.5.

Table 1 comparison of detection accuracy in% for different models on COCO2017 validation set

/>

Table 2 comparison of detection accuracy of small target object in%

Table 3 ablation experiments compare results in%

Table 4 comparison of detection accuracy in% with and without each dynamic weight

Table 5 comparison of detection accuracy in% using different lambda parameters

/>

Claims

1. An improved YOLOv6 target detection method introducing dynamic position loss, characterized by the steps of:

(3) The DePAN Neck is used in the YOLOv6 to replace the original Neck part;

(6) Setting a learning rate by a Cosine wakeup strategy, namely preheating the first n epochs by using a small learning rate, and adjusting the learning rate by using a preset learning rate after the n epochs according to a Cosine annealing strategy;

2. The improved YOLOv6 target detection method of claim 1, wherein the DePAN rock in step (2) receives different scale features extracted by backbone, and for higher level features, up-samples by a 1 x 1 convolution and then by transpose convolution, performs a stitching operation in the channel dimension with adjacent lower level features, and fuses the features by DeConvBlock; for the low-layer features, splicing the low-layer features with adjacent high-layer features in the channel dimension through 3X 3 convolution downsampling, and fusing the spliced features through DeConvBlock to finally obtain fused features which are used for inputting the Head part to carry out classification and regression prediction; the DeConvBlock structure is formed by repeatedly stacking N DeConv cells.

3. The improved YOLOv6 target detection method of claim 1, wherein the dynamic cross ratio loss function in step (4) consists of IoU loss, distance loss, angle loss, shape loss 4 parts, specifically as follows:

the angular loss Λ is defined as:

(2) The distance loss Δ and shape loss Ω are defined as:

/>

(3) DYIoU Loss is defined as:

L _dyiou ＝1-IoU+λ(Δ+γΛ+υΩ) (20)

γ＝1-e ^-Δ (21)