CN111814827A

CN111814827A - Keypoint target detection method based on YOLO

Info

Publication number: CN111814827A
Application number: CN202010514432.XA
Authority: CN
Inventors: 徐光柱; 屈金山; 万秋波; 雷帮军; 石勇涛; 夏平; 陈鹏; 吴正平
Original assignee: China Three Gorges University CTGU
Current assignee: Hunan Feifei Animation Co ltd
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2020-10-23
Anticipated expiration: 2040-06-08
Also published as: CN111814827B

Abstract

YOLO-based keypoint target detection method, including data set production and processing: On the labeled data set with the original horizontal rectangular frame, add the offset distance (Δx, Δy) of each key point to the upper left corner of the labeled frame , the vertex position coordinates of the upper left corner of the label box are (LUx, LUy), which satisfies that LUx is less than the value in the x direction of all key points, and LUy is less than the value in the y direction of all key points. At this time, the position of each key point is : The fourth quadrant of the coordinate axis when the upper left vertex of the callout box is the origin of the coordinate axis. Point target detection based on the offset of the upper left corner vertex of the prediction frame: The prediction frame is obtained through YOLO, and the offset between each key point and the upper left corner vertex of the prediction frame is obtained at the same time, and the offset corresponding to each key point (Δx) output by the network is obtained. ,Δy) and the coordinates (LUx, LUy) of the upper left corner of the prediction frame are added to obtain the coordinate position of the key point.

Description

Keypoint target detection method based on YOLO

技术领域technical field

本发明涉及目标检测技术领域，具体涉及一种基于YOLO的关键点目标检测方法。The invention relates to the technical field of target detection, in particular to a key point target detection method based on YOLO.

背景技术Background technique

基于深度学习技术的视觉目标检测近年来取得了长足的发展，但仍存在很多挑战性问题。首先，目前的视觉目标检测模型输出的都是目标的包围框，缺乏对目标关键点的检测，如人脸检测中的五官特征点，人体检测中的肢体关节点等。另一方面，目前的目标检测算法对旋转目标检测一直是一个难点，目前众多的目标检测的预测框均为水平的矩形包围框。主要有两个原因：1)、目标检测中多数目标使用水平的矩形框就可满足条件，这和观测的视角存在很大的关系，从人站立的角度观测到的目标多数为水平的矩形。2)、深度学习模型的训练高度依赖于数据集的标注，而目前多数数据集的标注框仍然为水平矩形框。Visual object detection based on deep learning technology has made great progress in recent years, but there are still many challenging problems. First of all, the current visual target detection model outputs the bounding box of the target, and lacks detection of key points of the target, such as facial feature points in face detection, limb joint points in human body detection, etc. On the other hand, the current target detection algorithm has always been a difficult point for rotating target detection, and the prediction boxes of many target detections are horizontal rectangular bounding boxes. There are two main reasons: 1) Most of the targets in target detection can meet the conditions by using a horizontal rectangular frame, which has a great relationship with the viewing angle of observation. Most of the targets observed from the perspective of people standing are horizontal rectangles. 2) The training of the deep learning model is highly dependent on the annotation of the dataset, and the annotation frame of most datasets is still a horizontal rectangular frame.

随着目标检测技术的不断发展，人们意识到通过关键点来对目标进行定位是一种可行的方案，于是文献[1]Law H，Deng J.CornerNet：Detecting Objects as PairedKeypoints[J].International Journal of Computer Vision，2020，128(3)：642-656，提出一种分别预测目标左上角和目标右下角的方法，通过这两个关键的角点形成的矩形框来定位目标，相比于中心点预测方法更简单，但是其本质上仍然是得出一个水平矩形框，不输出点目标。文献[2]Zhou X，Wang D，

P.Objects as Points.arXive-prints，2019：arXiv：1904.07850在文献[1]的基础上添加了中心关键点，用三个关键点来检测目标，提高了准确率和召回率，但其本质仍然是用3个关键点来确定预测框，最终并不输出关键点。With the continuous development of target detection technology, people realize that it is a feasible solution to locate the target through key points, so the literature [1]Law H, Deng J.CornerNet: Detecting Objects as PairedKeypoints[J].International Journal of Computer Vision, 2020, 128(3): 642-656, proposes a method to predict the upper left corner of the target and the lower right corner of the target respectively. The rectangular frame formed by these two key corners is used to locate the target, compared to the center The point prediction method is simpler, but its essence is still to draw a horizontal rectangular box and not output a point target. Reference [2] Zhou X, Wang D,

P.Objects as Points.arXive-prints, 2019: arXiv: 1904.07850 adds the central key point based on the literature [1], uses three key points to detect the target, improves the accuracy and recall rate, but its essence is still It uses 3 key points to determine the prediction frame, and finally does not output the key points.

中国专利(CN201810363952.8)提出一种基于深度学习的手掌检测与关键点定位方法，该方法利用FasterR-CNN网络进行训练，检测时得到手掌轮廓候选框以及定位手掌关键点，再调整候选框阈值，从候选框中筛选最佳的具备关键点定位的手掌图像。The Chinese patent (CN201810363952.8) proposes a deep learning-based palm detection and key point location method. The method uses the FasterR-CNN network for training, obtains the palm outline candidate frame and locates the palm key points during detection, and then adjusts the candidate frame threshold. , and select the best palm image with keypoint localization from the candidate frame.

另外，旋转、倾斜目标的检测同样为人们所关注，文献[3]Ma J，Shao W，Ye H，etal.Arbitrary-Oriented Scene Text Detection via Rotation Proposals[J].IEEETransactions on Multimedia，2018，20(11)：3111-3122.提出了一种任意方向文本检测方案，通过设置带角度的旋转的anchors-Rotation Anchors，再经过RRoI(旋转感兴趣区域)池化层将候选框映射到特征图上，前往分类器得到结果。但是RRPN存在速度太慢的问题。In addition, the detection of rotating and tilting targets also attracts attention. Literature [3] Ma J, Shao W, Ye H, etal. Arbitrary-Oriented Scene Text Detection via Rotation Proposals [J]. IEEE Transactions on Multimedia, 2018, 20 ( 11): 3111-3122. A text detection scheme in any direction is proposed. By setting the anchors-Rotation Anchors with an angle, and then through the RRoI (rotation region of interest) pooling layer, the candidate frame is mapped to the feature map. Go to the classifier to get the result. But RRPN has the problem of being too slow.

文献[4]Yang X，Liu Q，Yan J，et al.R3Det：Refined Single-Stage Detectorwith Feature Refinement for Rotating Object.arXiv e-prints.2019：arXiv：1908.05612。针对RRPN存在的问题，使用RetinaNet构造单阶段检测框架，使用RefineDet思想，对一阶段检测结果细化，从而提高了速度。中国专利“201910381699.3”提出一种基于旋转区域提取的舰船多目标检测方法，通过对旋转目标进行标注，通过计算置信度最高的预选框与其他预选框的旋转交并比得到最终检测结果，但是检测精度难以保证。Reference [4] Yang X, Liu Q, Yan J, et al. R3Det: Refined Single-Stage Detectorwith Feature Refinement for Rotating Object. arXiv e-prints. 2019: arXiv: 1908.05612. Aiming at the problems of RRPN, RetinaNet is used to construct a single-stage detection framework, and the RefineDet idea is used to refine the first-stage detection results, thereby improving the speed. Chinese patent "201910381699.3" proposes a multi-target detection method for ships based on rotation area extraction. By labeling the rotating target, the final detection result is obtained by calculating the rotation intersection of the preselected frame with the highest confidence and other preselected frames, but The detection accuracy is difficult to guarantee.

文献[5]Redmon J，Divvala S K，Girshick R，et al.You Only Look Once：Unified，Real-Time Object Detection[C].computer vision and patternrecognition，2016：779-788.(You only look once)是由Joseph Redmon和Ali Farhadi等人于2015年提出的基于单个神经网络的目标检测系统。YOLO为了保证检测的效率，提出one-stage的思想，不同于R-CNN等two-stage算法需要生成区域建议，消耗算力而导致速度较慢，YOLO不生成区域建议，而是利用单个卷积神经网络，将输入图片分成n*n个网格，对每个网格进行预测，直接对目标进行分类和回归，实现端到端的检测，因此检测速度大幅提升。YOLO在GPU上达到45fps，同时其简化版本达到155fps。之后YOLO为了提高精度，又相继提出YOL09000、YOLOv3。如：文献[6]Redmon J，Farhadi A.YOL09000：Better，Faster，Stronger[C].IEEE Conference on Computer Vision&Pattern Recognition，2017：7263-7271。文献[7]Redmon J，Farhadi A.YOLOv3：An Incremental Improvement.arXive-prints，2018：arXiv：1804.02767。Reference [5] Redmon J, Divvala SK, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection [C]. computer vision and patternrecognition, 2016: 779-788. (You only look once) yes A single neural network based object detection system proposed by Joseph Redmon and Ali Farhadi et al. in 2015. In order to ensure the efficiency of detection, YOLO proposes the idea of one-stage, which is different from two-stage algorithms such as R-CNN, which need to generate regional proposals, which consumes computing power and leads to slow speed. YOLO does not generate regional proposals, but uses a single convolution. The neural network divides the input image into n*n grids, predicts each grid, directly classifies and regresses the target, and realizes end-to-end detection, so the detection speed is greatly improved. YOLO hits 45fps on the GPU, while its simplified version hits 155fps. Later, in order to improve the accuracy, YOLO proposed YOL09000 and YOLOv3 successively. For example: Literature [6] Redmon J, Farhadi A. YOL09000: Better, Faster, Stronger [C]. IEEE Conference on Computer Vision & Pattern Recognition, 2017: 7263-7271. Reference [7] Redmon J, Farhadi A. YOLOv3: An Incremental Improvement. arXive-prints, 2018: arXiv: 1804.02767.

YOLO作为一种性能优异的通用目标检测系统，其在速度上的优势保证了其在工程上应用的可行性，因此人们尝试使用YOLO来解决相关问题，但原始YOLO在目标检测中仅仅输出水平矩形框作为目标框。因此文献[8]Lei J，Gao C，Hu J，et al.OrientationAdaptive YOLOv3 for Object Detection in Remote Sensing Images[C]，2019：586-597.提出了一种扩展YOLO的方法来解决旋转矩形目标的定位问题，在YOLO的输出中增加了一个theta输出，即预测框的旋转角度，但这种方法仅能解决矩形的平面旋转问题，对于非规则的的矩形，如内旋之后类似梯形的矩形目标仅仅通过旋转仍然无法准确定位。同时中国专利“CN201910707178.2”也提出一种基于YOLOv3的旋转矩形目标检测方法，通过将检测目标设置为5位向量(x，y，w，h，θ)，添加一个角度θ，使用带旋转角度的锚点来检测旋转目标，该方法同样只能应对平面简单旋转，对内旋等场景仍然无法精确定位。中国专利“CN201910879419.1”提出一种基于改进的YOLO算法的水下目标检测算法，设计了新的损失函数，将物体长宽比信息加入到损失函数之中，从而提高了对水下物体旋转侧翻的等情况的检测效果，但涉及的场景有限。中国专利“CN201910856434.4”提出一种基于YOLO模型的车牌定位和识别方法，其中为了提高对车牌的定位精度，训练一个改进的YOLO卷积神经网络和一个卷积增强的SRCNN(Super Resolution)卷积神经网络，在YOLO卷积神经网络训练时，采用maxout激活函数替代原模型的激活函数，增强了拟合能力。As a general target detection system with excellent performance, YOLO's speed advantage ensures its feasibility in engineering applications. Therefore, people try to use YOLO to solve related problems, but the original YOLO only outputs horizontal rectangles in target detection. box as the target box. Therefore, the literature [8] Lei J, Gao C, Hu J, et al. OrientationAdaptive YOLOv3 for Object Detection in Remote Sensing Images [C], 2019: 586-597. A method of extending YOLO is proposed to solve the problem of rotating rectangular targets. For the positioning problem, a theta output is added to the output of YOLO, that is, the rotation angle of the predicted frame, but this method can only solve the plane rotation problem of the rectangle. For irregular rectangles, such as trapezoid-like rectangular targets after internal rotation There is still no accurate positioning just by rotating. At the same time, the Chinese patent "CN201910707178.2" also proposes a rotation rectangle target detection method based on YOLOv3. By setting the detection target as a 5-bit vector (x, y, w, h, θ), adding an angle θ, using the belt rotation The anchor point of the angle is used to detect the rotating target. This method can also only deal with the simple rotation of the plane, and still cannot accurately locate the scene such as internal rotation. The Chinese patent "CN201910879419.1" proposes an underwater target detection algorithm based on the improved YOLO algorithm, designs a new loss function, and adds the object aspect ratio information into the loss function, thereby improving the detection of underwater object rotation. The detection effect of rollover and other situations is limited, but the scenes involved are limited. The Chinese patent "CN201910856434.4" proposes a license plate location and recognition method based on the YOLO model. In order to improve the location accuracy of the license plate, an improved YOLO convolutional neural network and a convolutionally enhanced SRCNN (Super Resolution) volume are trained. Integral neural network, in the training of YOLO convolutional neural network, the maxout activation function is used to replace the activation function of the original model, which enhances the fitting ability.

上述针对YOLO的改进方法，虽然一定程度上提高了YOLO模型应对复杂场景下目标检测的能力。但是YOLO仍然存在如下问题：1)对于存在关键点的视觉目标检测中，关键点的检测同样重要，如人脸检测中的五官特征、人体检测中的肢体关节点等，而YOLO缺乏对这些关键点的检测。2)现实存在许多不规则的矩形，由不同的视角导致的旋转角度下的长宽比例较大的矩形物体，如各种角度的车牌、空中拍摄的车辆，舰船等目标。YOLO对于这些旋转倾斜的矩形目标的的预测框会包含大量与目标无关的冗余信息。The above improvement method for YOLO improves the ability of the YOLO model to deal with target detection in complex scenes to a certain extent. However, YOLO still has the following problems: 1) In the detection of visual objects with key points, the detection of key points is also important, such as facial features in face detection, limb joint points in human body detection, etc., and YOLO lacks these key points. point detection. 2) There are many irregular rectangles in reality, and rectangular objects with a large aspect ratio under the rotation angle caused by different perspectives, such as license plates of various angles, vehicles photographed in the air, ships and other targets. The prediction frame of YOLO for these rotated and inclined rectangular targets contains a lot of redundant information that is irrelevant to the target.

发明内容SUMMARY OF THE INVENTION

针对上述技术问题，本发明提供一种基于YOLO的关键点目标检测方法，通过在原始YOLO的基础上，增加点目标的检测算法，使YOLO具备检测点目标的能力，使YOLO可同时输出目标检测框和关键点，同时在具体应用中实现对旋转矩形物体的精准定位。In view of the above technical problems, the present invention provides a YOLO-based key point target detection method. By adding a point target detection algorithm on the basis of the original YOLO, YOLO has the ability to detect point targets, so that YOLO can output target detection at the same time. frame and key points, and at the same time achieve precise positioning of rotated rectangular objects in specific applications.

本发明采取的技术方案为：The technical scheme adopted in the present invention is:

基于YOLO的关键点目标检测方法，包括以下步骤：The YOLO-based keypoint target detection method includes the following steps:

步骤一、数据集的制作与处理：Step 1. Data set production and processing:

在原始标注框为水平矩形框的标注数据集上，添加各个关键点到标注框左上角顶点的偏移距离(Δx，Δy)，标注框左上角的顶点位置坐标为(LUx，LUy)，满足LUx小于所有关键点的x方向上的值，LUy小于所有关键点的y方向上的值，此时，各个关键点位置均为：以标注框左上顶点为坐标轴原点时坐标轴的第四象限。On the annotation dataset whose original annotation frame is a horizontal rectangular frame, add the offset distance (Δx, Δy) of each key point to the vertex of the upper left corner of the annotation frame, and the position coordinates of the vertex at the upper left corner of the annotation frame are (LUx, LUy), which satisfies LUx is less than the value in the x direction of all key points, and LUy is less than the value in the y direction of all key points. At this time, the positions of each key point are: the fourth quadrant of the coordinate axis when the upper left vertex of the label box is the origin of the coordinate axis .

步骤二、基于预测框左上角顶点偏移量的点目标检测：Step 2. Point target detection based on the offset of the upper left corner vertex of the prediction frame:

首先，通过YOLO得到预测框，同时得到各个关键点与预测框左上角顶点的偏移量，将网络输出的对应各个关键点的偏移量(Δx，Δy)与预测框左上角顶点的坐标(LUx,LUy)相加，即可得到关键点的坐标位置。First, the prediction frame is obtained through YOLO, and the offset between each key point and the upper left corner vertex of the prediction frame is obtained at the same time, and the offset (Δx, Δy) corresponding to each key point output by the network is compared with the coordinates of the upper left corner vertex of the prediction frame ( LUx, LUy) are added to get the coordinate position of the key point.

本发明一种基于YOLO的关键点目标检测方法，优点在于：A YOLO-based key point target detection method of the present invention has the advantages of:

1：在YOLO使用预测框定位的同时，并行预测出各个关键点距离预测框左上角顶点的距离，再结合预测框左上角顶点的位置得出各个关键点的位置。为点目标的检测提供了一种可行的方案。1: While YOLO uses the prediction frame to locate, the distance of each key point from the upper left corner of the prediction frame is predicted in parallel, and then the position of each key point is obtained by combining the position of the upper left corner of the prediction frame. It provides a feasible solution for point target detection.

2：本发明点目标检测方案可以应对旋转倾斜矩形目标检测时的定位不准确，包含大量冗余信息的问题。为诸如倾斜车牌、交通标识、高空舰船遥感检测等提供了一种精确定位的方案。2: The point target detection scheme of the present invention can deal with the problem of inaccurate positioning and a large amount of redundant information contained in the detection of rotating and inclined rectangular targets. It provides a precise positioning solution for such as inclined license plates, traffic signs, remote sensing detection of high-altitude ships, etc.

3：本发明所提出的基于预测框左上角顶点偏移量的点目标检测具有通用性，能够应用于所有需要输出特征点和目标框的目标检测任务中。3: The point target detection based on the offset of the upper left corner vertex of the prediction frame proposed by the present invention is universal and can be applied to all target detection tasks that require outputting feature points and target frames.

4：本发明提供了一种点目标检测的方案：具体为通过点目标距离初始预测框左上角顶点的偏移量来得到点目标的位置，从而得到关键点的位置。在原有目标检测基础上，通过增加点输出，扩展了one-stage算法的检测能力，使其能够适用于更多的场景，如点目标检测、旋转目标的精准定位等。4: The present invention provides a point target detection solution: specifically, the position of the point target is obtained by the offset of the point target from the upper left corner vertex of the initial prediction frame, thereby obtaining the position of the key point. On the basis of the original target detection, by adding point output, the detection capability of the one-stage algorithm is expanded, making it applicable to more scenarios, such as point target detection, accurate positioning of rotating targets, etc.

5：针对YOLO点目标检测方案的具体实现，本发明在YOLO中设计了一种点目标相对预测框左上角顶点偏移量的损失计算函数，从而改进了YOLO的损失函数。同时，扩展了YOLO的输出，使YOLO在输出目标框的同时输出关键点的位置信息，使YOLO的性能更加强大，满足更多的应用需求。5: For the specific implementation of the YOLO point target detection scheme, the present invention designs a loss calculation function in YOLO for the offset of the point target relative to the upper left corner vertex of the prediction frame, thereby improving the YOLO loss function. At the same time, the output of YOLO is expanded, so that YOLO outputs the position information of key points while outputting the target frame, which makes the performance of YOLO more powerful and meets more application requirements.

附图说明Description of drawings

图1(1)为当前人脸检测中仅仅得到人脸包围框图(缺乏关键点的检测)。Figure 1(1) shows that only the frame of the face frame is obtained in the current face detection (lack of detection of key points).

图1(2)为在倾斜的路牌检测中预测框包含冗余信息示意图。Fig. 1(2) is a schematic diagram showing that the prediction frame contains redundant information in inclined street sign detection.

图1(3)为本发明可以在人脸检测中同时输出目标框与关键点的示意图。FIG. 1(3) is a schematic diagram showing that the present invention can simultaneously output a target frame and key points in face detection.

图1(4)为本发明可以通过关键点检测实现对倾斜路牌的精准检测示意图。Fig. 1(4) is a schematic diagram of the accurate detection of inclined street signs that can be realized by key point detection in the present invention.

图2为YOLO目标检测流程示意图。Figure 2 is a schematic diagram of the YOLO target detection process.

图3为YOLOV3位置预测与anchor的关系图。Figure 3 shows the relationship between YOLOV3 position prediction and anchor.

图4为YOLO检测旋转矩形框效果不理想图。Figure 4 shows the unsatisfactory effect of YOLO detecting the rotating rectangular frame.

图5(1)为本发明数据集制作以及算法设计流程图；Fig. 5 (1) is the data set making and algorithm design flow chart of the present invention;

图5(2)为本发明模型检测目标流程图。Figure 5(2) is a flow chart of the model detection target of the present invention.

图6为通过偏移量得到关键点位置图。Figure 6 shows the key point position map obtained by offset.

图7为YOLOV3和关键点方案标注格式图。Figure 7 shows the annotation format of YOLOV3 and the key point scheme.

图8位YOLO在旋转矩形目标检测中的具体应用流程图。Figure 8 is a flow chart of the specific application of YOLO in rotating rectangle target detection.

具体实施方式Detailed ways

基于YOLO的关键点目标检测方法，通过在原始YOLO的基础上，增加点目标的检测算法，使YOLO具备检测点目标的能力，使YOLO可同时输出目标检测框和关键点，同时在具体应用中实现对旋转矩形物体的精准定位，如图1(1)～图1(4)所示。Based on YOLO's key point target detection method, by adding a point target detection algorithm on the basis of the original YOLO, YOLO has the ability to detect point targets, so that YOLO can output target detection frames and key points at the same time, while in specific applications Accurate positioning of rotating rectangular objects is achieved, as shown in Figure 1(1) to Figure 1(4).

图1(1)为当前人脸检测中仅仅得到人脸包围框示意图，缺乏关键点的检测。Figure 1(1) is a schematic diagram of only obtaining a face bounding box in the current face detection, lacking the detection of key points.

图1(2)为在倾斜的路牌检测中预测框包含冗余信息示意图，效果不理想。Fig. 1(2) is a schematic diagram showing that the prediction frame contains redundant information in inclined street sign detection, and the effect is not ideal.

图1(3)为本发明方案在人脸检测中同时输出目标框与关键点的示意图。FIG. 1(3) is a schematic diagram of simultaneously outputting a target frame and key points in face detection according to the solution of the present invention.

图1(4)为本发明方案通过关键点检测实现对倾斜路牌的精准检测的示意图。Fig. 1(4) is a schematic diagram of the solution of the present invention realizing accurate detection of inclined street signs through key point detection.

(一)：YOLO的核心思想，如图2所示，YOLO将输入图像分成SxS个格子，如果物体实际标注的中心点位于某个格子，那么这个格子就对该物体进行检测。YOLO目标检测的原理，这里以YOLOv3目标边框预测的原理为例，YOLOv3在预测时，首先训练前通过k-means聚类生成9个anchor，对应输出的3个尺度的特征图，每个尺度有3个anchors。在416的网络输入尺寸下，YOLOv3的3个尺度的输出的特征图大小分别为13*13、26*26、52*52，分别用于检测大、中、小三个尺度的目标。对于每个尺度下的特征图，YOLOv3给出3个anchor，对于特征图上的每个像素grid，都会有3个anchor进行预测，找出最合适的anchor，给出相应的offset，即为预测框。YOLOv3对每个预测框，给出4个值，t_x、t_y、t_w、t_h，而4个值和最终预测的bbox映射关系如公式(1.1)～(1.4)所示。(1): The core idea of YOLO, as shown in Figure 2, YOLO divides the input image into SxS grids. If the actual center point of the object is located in a grid, then this grid will detect the object. The principle of YOLO target detection, here is the principle of YOLOv3 target frame prediction as an example, YOLOv3 first generates 9 anchors through k-means clustering before training, corresponding to the output feature maps of 3 scales, each scale has 3 anchors. Under the network input size of 416, the output feature map sizes of the three scales of YOLOv3 are 13*13, 26*26, and 52*52, respectively, which are used to detect large, medium, and small scale targets respectively. For the feature map at each scale, YOLOv3 gives 3 anchors, and for each pixel grid on the feature map, there will be 3 anchors for prediction, find the most suitable anchor, and give the corresponding offset, which is the prediction frame. YOLOv3 gives 4 values for each prediction box, t _x , _{ty , t w} _, th _h , and the mapping relationship between the 4 values and the final predicted bbox is shown in formulas (1.1) to (1.4).

b_x＝δ(t_x)+c_x(1.1)b _x =δ(t _x )+c _x (1.1)

b_y＝δ(t_y)+c_y(1.2)b _y =δ( _ty )+ _cy (1.2)

公式(1.1)～(1.4)为yolov3中输出值和预测框之间的映射公式。Formulas (1.1) to (1.4) are the mapping formulas between the output value and the prediction frame in yolov3.

其中，t_x、t_y分别表示坐标偏移的值，t_w、t_h则表示为预测的尺度的缩放，其中p_w、p_h分别表示anchor的宽和高。δ(t_x)、δ(t_y)用于表示某个目标的中心点相对负责检测这个目标的网格的偏移量，如图3中所标明，其中C_x，C_y表示中心点坐标grid cell的左上角坐标，最终所获得的b_x、b_x、b_w、b_h、为相对于特征图的中心点坐标和宽高。YOLOv3的损失函数如公式(1.5)所示。Among them, t _x and _ty represent the value of the coordinate offset, respectively, _tw and _th represent the scaling of the predicted scale, and _p _w and ph represent the width and height of the anchor, respectively. δ(t _x ), δ(t _y ) are used to represent the offset of the center point of a certain target relative to the grid responsible for detecting the target, as indicated in Figure 3, where C _x , C _y represent the coordinates of the center point The coordinates of the upper left corner of the grid cell, and the finally obtained b _x , b _x , b _w , and b _h are relative to the center point coordinates and width and height of the feature map. The loss function of YOLOv3 is shown in Equation (1.5).

YOLOv3损失函数包含中心坐标损失Loss_center，式(1.5.1)、宽高损失Loss_wh式(1.5.2)、置信度损失Loss_score式(1.5.3～1.5.4)、类别损失Loss_class式(1.5.4)共4个部分的损失。式(1.5)中各变量的含义如下：YOLOv3 loss function includes center coordinate loss Loss _center , formula (1.5.1), width and height loss Loss _wh formula (1.5.2), confidence loss Loss _score formula (1.5.3 ~ 1.5.4), category loss Loss _class formula (1.5.4) Loss of 4 parts in total. The meaning of each variable in formula (1.5) is as follows:

其中，SxS为网络划分图片的网格数，B为每个网格预测的边界框数目，

为网格i中第j个边界框的预测。其中各个部分公式中各变量的含义则分别为：公式(1.5.1)λ_coord为动态参数，

为中心坐标的真值，Cxy_i为预测值；公式(1.5.2)中，

和

表示该目标宽度和高度的真实值，w_i和h_i分别表示网络预测该目标的高度和宽度；公式(1.5.3)和公式(1.5.4)，由包含目标时的置信度损失1.5.3和不含目标时的置信度损失1.5.4两个部分构成，其中λ_noobj为不含目标时网络的误差的系数，

和C_i分别代表检测目标的置信度真值和网络预测置信度；式(1.5.5)中

为检测目标概率的真值。Among them, SxS is the number of grids that the network divides the picture, B is the number of bounding boxes predicted by each grid,

is the prediction of the jth bounding box in grid i. The meaning of each variable in each part of the formula is: formula (1.5.1) λ _coord is a dynamic parameter,

is the true value of the center coordinate, and Cxy _i is the predicted value; in formula (1.5.2),

and

Represents the true value of the width and height of the target, w _i and _hi represent the height and width of the target predicted by the network respectively; formula (1.5.3) and formula (1.5.4), the confidence loss when the target is included is 1.5. 3 and the confidence loss 1.5.4 without the target are composed of two parts, where λ _noobj is the coefficient of the error of the network without the target,

and C _i represent the true value of the confidence of the detection target and the confidence of the network prediction, respectively; in formula (1.5.5)

is the true value of the detection target probability.

(二)：基于预测框左上角顶点偏移的点目标检测：(2): Point target detection based on the offset of the upper left corner of the prediction frame:

在检测非水平矩形目标时，原始YOLO通过预测中心点和宽高来得到最终的水平的预测框，这就导致如图4所示的问题，图4中为两艘停泊在港口的航母，原始YOLO在理想状态下预测框为黄色框，其中包含了大量与目标无关的其他信息，同时两艘航母的预测框高度重叠，在非极大值抑制NMS的时候容易剔除另外的预测而导致漏检的发生。如文献[9]Neubeck A，Gool L JV.Efficient Non-Maximum Suppression[C].InternationalConference on Pattern Recognition，2006：850-855中记载的技术方案。而理想的预测框则应该为红色的旋转矩形预测框，但这无疑超出了YOLO的能力范围。When detecting non-horizontal rectangular targets, the original YOLO obtains the final horizontal prediction frame by predicting the center point and width and height, which leads to the problem shown in Figure 4. In Figure 4, there are two aircraft carriers berthed in the port. The original In an ideal state, YOLO's prediction box is a yellow box, which contains a lot of other information unrelated to the target. At the same time, the prediction boxes of the two aircraft carriers overlap in height. When the non-maximum value suppresses NMS, it is easy to eliminate other predictions and cause missed detection. happened. Such as the technical solution described in the document [9] Neubeck A, Gool L JV. Efficient Non-Maximum Suppression [C]. International Conference on Pattern Recognition, 2006: 850-855. The ideal prediction box should be a red rotated rectangular prediction box, but this is undoubtedly beyond the scope of YOLO's capabilities.

(三)：本发明一种基于YOLO的关键点目标检测方法：(3): a kind of key point target detection method based on YOLO of the present invention:

不仅用于使YOLO获得检测点目标的能力，同时在具体应用中，可以解决图4所示的旋转目标检测问题，在检测出关键点后，连接各个关键点即可得到更加准确的红色预测框。方案的流程图如图5(1)、图5(2)所示：首先制作数据集，之后设计基于预测框左上角顶点偏移的点目标的检测算法，设计损失函数，训练模型，在检测时通过预测框和点偏移量得出点目标。Not only is it used to enable YOLO to obtain the ability to detect point targets, but also in specific applications, it can solve the rotating target detection problem shown in Figure 4. After the key points are detected, connect each key point to get a more accurate red prediction frame . The flow chart of the scheme is shown in Figure 5(1) and Figure 5(2): first create a data set, then design a detection algorithm based on the offset of the top left corner of the prediction frame, design a loss function, train the model, and then detect When the point target is obtained by the prediction box and the point offset.

本发明中点目标检测的原理如图6所示，该图6中关键点的个数为4，图6中虚线框为anchor，(p_w，p_h)为anchor的宽和高，蓝色框为目标预测框，4个红色箭头分别表示欲检测目标的4个关键点相对预测框左上角顶点偏移的距离，绿色框则为在旋转矩形检测中最后得到的旋转目标框。其中，各个关键点相对于预测框左上角的偏移量的距离的公式如公式(1.8)～(2.1)所示。The principle of midpoint target detection in the present invention is shown in FIG. 6 , the number of key points in FIG. 6 is 4, the dotted box in FIG. 6 is the anchor, (p _w , p _h ) is the width and height of the anchor, and the blue The frame is the target prediction frame, the 4 red arrows represent the offset distance of the 4 key points of the target to be detected relative to the top left corner of the prediction frame, and the green frame is the final rotation target frame obtained in the rotation rectangle detection. The formulas for the distance of each key point relative to the offset of the upper left corner of the prediction frame are shown in formulas (1.8) to (2.1).

本发明中模型对每个预测框，会输出t_x、t_y、t_w、t_h以及4组偏移量，其中，t_x、t_y、t_w、t_h用于得到预测框，即为蓝色包围框，故公式(1.6)～(1.7)首先求出预测框的宽高(b_w，b_h)，再通过预测框的(b_w，b_h)得到4组偏移量，如公式(1.8)～(2.1)所示。其中，D1_x，D1_y为D1点相对于预测框左上角的在x轴和y轴方向上的偏移距离。同理，D2_x，D2_y、D3_x，D3_y、D4_x，D4_y分别表示目标关键点D2、D3、D4到预测框左上角顶点的偏移距离。The model in the present invention outputs t _x , _ty , t _w , _th and 4 sets of offsets for each prediction frame, wherein t _x , _ty , t _w , and _th are used to obtain the prediction frame, that is, It is a blue bounding box, so formulas (1.6)~(1.7) first calculate the width and height of the prediction frame (b _w , b _h ), and then obtain 4 sets of offsets through the prediction frame (b _w , b _h ), As shown in formulas (1.8) to (2.1). Among them, D1 _x , D1 _y are the offset distances of the D1 point relative to the upper left corner of the prediction frame in the x-axis and y-axis directions. Similarly, D2 _x , D2 _y , D3 _x , D3 _y , D4 _x , and D4 _y represent the offset distances from the target key points D2, D3, and D4 to the top-left corner vertex of the prediction frame, respectively.

D1_X＝δ(t_x1)·b_w D1_y＝δ(t_y1)·b_h (1.8)；D1 _X =δ(t _x1 )·b _w D1 _y =δ(t _y1 )·b _h (1.8);

D2_X＝δ(t_x2)·b_w D2_y＝δ(t_y2)·b_h (1.9)；D2 _X =δ(t _x2 )·b _w D2 _y =δ(t _y2 )·b _h (1.9);

D3_X＝δ(t_x3)·b_w D3_y＝δ(t_y3)·b_h (2.0)；D3 _X =δ(t _x3 )·b _w D3 _y =δ(t _y3 )·b _h (2.0);

D4_X＝δ(t_x4)·b_w D4_y＝δ(t_y4)·b_h (2.1)。D4 _X =δ(t _x4 )·b _w D4 _y =δ(t _y4 )·b _h (2.1).

公式(1.6)～(2.1)为基于预测框左上角顶点偏移算法的计算公式。Formulas (1.6) to (2.1) are the calculation formulas based on the algorithm of the vertex offset in the upper left corner of the prediction frame.

YOLO点目标检测中关键点的损失函数如公式(2.2)所示，该式子中关键点个数为4：The loss function of key points in YOLO point target detection is shown in formula (2.2), and the number of key points in this formula is 4:

若关键点增多，关键点损失函数将如公式(2.3)所示，式子中m为关键点的数量：If the number of key points increases, the key point loss function will be as shown in formula (2.3), where m is the number of key points:

公式(2.2)～公式(2.3)为基于预测框左上角顶点偏移算法的偏移量损失函数计算公式本发明是在原始YOLOv3的检测中增加了关键点的计算损失，因此本发明的最终损失函数应为：Formulas (2.2) to (2.3) are the calculation formulas of the offset loss function based on the top left corner vertex offset algorithm of the prediction frame. The present invention adds the calculation loss of key points in the detection of the original YOLOv3, so the final loss of the present invention is The function should be:

Loss_{KeyPoint_offset}＝Loss_yolov3+Loss_KeyPoint (2.4)。Loss _{KeyPoint_offset} = Loss _yolov3 + Loss _KeyPoint (2.4).

公式(2.4)为基于预测框左上角顶点偏移算法的总的损失函数计算公式。The formula (2.4) is the calculation formula of the total loss function based on the vertex offset algorithm of the upper left corner of the prediction frame.

数据集的制作与处理：YOLO点目标检测中对数据集的处理关键在于训练数据集中添加关键点的位置信息，在原始标注框为水平矩形框的标注数据集上添加关键点的位置信息。即在标注数据中添加各个关键点到标注框左上角顶点的偏移距离(Δx，Δy)。需要注意，标注框左上角的顶点位置坐标为(LUx,LUy)需满足LUx小于所有关键点的x方向上的值，LUy小于所有关键点的y方向上的值。此时，各个关键点位置均为以标注框左上顶点为坐标轴原点时的坐标轴的第四象限。图7所示分别为原始YOLO中的训练数据的标注格式和关键点检测方案标注格式。此处的标记格式为关键点为4个时，当关键点的数目增加时，其对应的训练数据集的标注同样需增加对应关键点的偏移距离。The production and processing of the data set: The key to the processing of the data set in the YOLO point target detection is to add the position information of the key points in the training data set, and add the position information of the key points to the label data set whose original label frame is a horizontal rectangular frame. That is, add the offset distance (Δx, Δy) of each key point to the top left corner vertex of the annotation frame in the annotation data. It should be noted that the vertex position coordinates of the upper left corner of the label box are (LUx, LUy), which must satisfy that LUx is less than the value in the x direction of all key points, and LUy is less than the value in the y direction of all key points. At this time, the positions of each key point are the fourth quadrant of the coordinate axis when the upper left vertex of the callout box is the origin of the coordinate axis. Figure 7 shows the labeling format of the training data in the original YOLO and the labeling format of the keypoint detection scheme, respectively. The label format here is that when the number of key points is 4, when the number of key points increases, the labeling of the corresponding training data set also needs to increase the offset distance of the corresponding key points.

模型的检测：本发明基于预测框左上角顶点偏移量的点目标检测方案用于解决YOLO的点目标检测，首先，通过YOLO得到预测框，同时得到各个关键点与预测框左上角顶点的偏移量。将网络输出的对应各个关键点的偏移量(Δx，Δy)与预测框左上角顶点的坐标(LUx,LUy)相加，即可得到关键点的坐标位置。在具体的应用中，如点目标检测中，人脸五官特征点的检测，仅得到各个关键点位置即可，无需后续处理。而当本方案具体应用到旋转矩形目标的检测中时，因在旋转倾斜矩形目标的精准定位中，需要根据给出的4个关键点来绘制出精准的定位框。因此其流程如图8所示，图片输入到YOLO网络中，得到预测框bbox同时得到关键点到预测框左上角的偏移量，根据预测框左上角顶点和4个关键点的偏移量计算出关键点所在的位置，再将关键点进行连接，得到精准的红色定位框。Detection of the model: The point target detection scheme based on the offset of the upper left corner of the prediction frame of the present invention is used to solve the point target detection of YOLO. First, the prediction frame is obtained through YOLO, and the offset of each key point and the upper left corner of the prediction frame is obtained simultaneously. shift. The coordinate position of the key point can be obtained by adding the offset (Δx, Δy) of each key point output by the network to the coordinate (LUx, LUy) of the upper left corner of the prediction frame. In specific applications, such as point target detection, in the detection of facial features, only the position of each key point can be obtained without subsequent processing. When this solution is specifically applied to the detection of rotating rectangular targets, because in the precise positioning of rotating and inclined rectangular targets, it is necessary to draw an accurate positioning frame according to the given four key points. Therefore, the process is shown in Figure 8. The picture is input into the YOLO network, the prediction frame bbox is obtained, and the offset from the key point to the upper left corner of the prediction frame is obtained, which is calculated according to the offset of the upper left corner vertex of the prediction frame and the 4 key points. Find the location of the key points, and then connect the key points to get an accurate red positioning frame.

YOLO作为一种性能优异的通用目标检测系统，其在检测精度和检测速度上都有较好的表现，但是其水平预测框对于点目标的检测以及倾斜旋转矩形目标无法给出更好的解决方案，因此本发明通过基于YOLO预测框左上角顶点偏移的点目标检测算法，使YOLO具有检测点目标的能力，同时应用在旋转倾斜目标的检测中，从而达到拓展YOLO的目的。As a general target detection system with excellent performance, YOLO has good performance in detection accuracy and detection speed, but its horizontal prediction frame cannot provide a better solution for the detection of point targets and oblique rotation rectangular targets Therefore, the present invention enables YOLO to have the ability to detect point targets through the point target detection algorithm based on the offset of the upper left corner of the YOLO prediction frame, and is simultaneously applied in the detection of rotating and inclined targets, thereby achieving the purpose of expanding YOLO.

本发明提出一种基于预测框左上角偏移的点目标检测的方案，该方案可以同时输出目标框和特征点，解决关键点的检测，如人脸五官特征点、人体肢体关节点等，同时解决如倾斜视角下的交通标识牌、广告牌、高空遥感等旋转倾斜矩形目标的精准定位。The present invention proposes a point target detection scheme based on the offset of the upper left corner of the prediction frame. The scheme can output the target frame and feature points at the same time to solve the detection of key points, such as facial feature points, human limb joint points, etc., and at the same time It solves the precise positioning of rotating and inclined rectangular targets such as traffic signs, billboards, and high-altitude remote sensing under oblique viewing angles.

Claims

1. The key point target detection method based on YOLO is characterized in that comprising the following steps:

Step 1. Data set production and processing:

On the label dataset whose original label frame is a horizontal rectangular frame, add the offset distance (Δx, Δy) of each key point to the upper left corner vertex of the label frame, and the coordinates of the vertex position at the upper left corner of the label frame are (LUx, LUy), LUx is less than the value in the x direction of all key points, and LUy is less than the value in the y direction of all key points. At this time, the positions of each key point are: the fourth quadrant of the coordinate axis when the upper left vertex of the label box is the origin of the coordinate axis;

Step 2. Point target detection based on the offset of the upper left corner vertex of the prediction frame:

First, the prediction frame is obtained through YOLO, and the offset between each key point and the upper left corner vertex of the prediction frame is obtained at the same time. , LUy) are added to obtain the coordinate position of the key point.

2. the key point target detection method based on YOLO according to claim 1, is characterized in that:

In the first step,

The number of key points is 4. The formula for the distance of each key point relative to the offset of the upper left corner of the prediction frame is shown in (1.8) to (2.1). The model outputs t _x and _ty for each prediction frame. , t _w , t _h and 4 sets of offsets, t _x , _{ty , t w} _, th _h are used to predict the original target frame, which is the blue bounding frame bbox, so through formulas (1.6) ~ (1.7) first Find the width and height of the prediction frame (b _w , b _h ), and then obtain 4 sets of offsets through the prediction frame (b _w , b _h ), as shown in formulas (1.8) to (2.1);

D1 _X =δ(t _x1 )·b _w D1 _y =δ(t _y1 )·b _h (1.8)

D2 _X =δ(t _x2 )·b _w D2 _y =δ(t _y2 )·b _h (1.9)

D3 _X =δ(t _x3 )·b _w D3 _y =δ(t _y3 )·b _h (2.0)

D4 _X =δ(t _x4 )·b _w D4 _y =δ(t _y4 )·b _h (2.1)

Among them: D1 _x , D1 _y is the offset distance of D1 point relative to the upper left corner of the prediction frame in the x-axis and y-axis directions; similarly, D2 _x , D2 _y , D3 _x , D3 _y , D4 _x , D4 _y Respectively represent the offset distances from the target key points D2, D3, and D4 to the top left corner of the prediction frame.

3. the key point target detection method based on YOLO according to claim 2, is characterized in that: the loss function of key point in YOLO point target detection is as shown in formula (2.2), and the number of key points in this formula is 4:

If the number of key points increases, the key point loss function will be as shown in (2.3), where m is the number of key points:

YOLO point target detection is to increase the calculation loss of key points in the original YOLOv3 detection, so the final loss function is:

Loss _{KeyPoint_offset} = Loss _yolov3 + Loss _KeyPoint (2.4).

4. the key point target detection method based on YOLO according to claim 2, is characterized in that:

In the second step, the picture is input into the YOLO network, the offset from the key point to the upper left corner of the prediction frame is obtained while the prediction frame is obtained, and the key point is calculated according to the offset of the upper left corner vertex of the prediction frame and the four key points. The position of the point, and then the key points are connected to obtain an accurate positioning frame.