CN111353440A

CN111353440A - Target detection method

Info

Publication number: CN111353440A
Application number: CN202010137294.8A
Authority: CN
Inventors: 刘建闽; 向钰; 彭小华; 阎晶亮; 黄嵩衍; 胡波
Original assignee: Hunan University of Humanities Science and Technology; Guangxi University of Finance and Economics
Current assignee: Hunan University of Humanities Science and Technology; Guangxi University of Finance and Economics
Priority date: 2019-12-30
Filing date: 2020-03-03
Publication date: 2020-06-30

Abstract

The invention discloses a target detection method, comprising: constructing a video target detection and recognition model, inputting labelled data in a training data set into a full convolution network, and using a weighted linear combination of independent loss functions to optimize and generate a comprehensive loss value function And get the overall loss function; the video object detection and recognition model uses the logical activation function in the last layer of the network model to calculate the classification confidence value and rectangular selection box. The invention can improve the accuracy and real-time performance of high-frame rate high-definition video target detection and recognition.

Description

A target detection method

技术领域technical field

本发明属于视频识别技术领域，更具体的说是涉及一种目标检测方法。The invention belongs to the technical field of video recognition, and more specifically relates to a target detection method.

背景技术Background technique

随着智能监控、智能交通等领域的快速发展，高帧率高清数据源的增加，现实应用场景的高要求和复杂多变性，导致经典方法已不能满足最新的需求，无法解决高帧率高清视频目标检测识别所需的准确率和实时性的问题。With the rapid development of intelligent monitoring, intelligent transportation and other fields, the increase of high frame rate high-definition data sources, and the high requirements and complexity of real application scenarios, the classical methods can no longer meet the latest needs and cannot solve high-frame rate high-definition video. The problem of accuracy and real-time performance required for target detection and identification.

因此，如何提供一种准确率和实时性高的目标检测方法是本领域技术人员亟需解决的问题。Therefore, how to provide a target detection method with high accuracy and real-time performance is an urgent problem to be solved by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明提供了一种目标检测方法，可解决高要求和复杂多变的现实应用场景下的高帧率高清视频目标检测识别所需的准确率和实时性的问题。In view of this, the present invention provides a target detection method, which can solve the problems of accuracy and real-time performance required for target detection and recognition of high frame rate high-definition video in high-demand and complex and changeable real application scenarios.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种目标检测方法，包括：A target detection method, comprising:

视频目标检测识别模型的构建，将训练数据集中的带标签数据通过高速存储介质输入到全卷积网络，使用独立损失函数的加权线性组合来优化产生综合损失值函数并通过训练最小化得到总体损失函数；The construction of the video target detection and recognition model, the labeled data in the training data set is input into the fully convolutional network through the high-speed storage medium, and the weighted linear combination of the independent loss functions is used to optimize the comprehensive loss value function and minimize the overall loss through training. function;

视频目标检测识别模型除最后一层之外的所有层使用基于漏桶的修正线性单元作为激活函数，在网络模型的最后一层使用逻辑激活函数来计算分类置信值和矩形选择框，实现了视频车辆及行人的目标实时检测识别。All layers except the last layer of the video target detection and recognition model use the corrected linear unit based on the leaky bucket as the activation function, and the logical activation function is used in the last layer of the network model to calculate the classification confidence value and the rectangular selection box, which realizes the video Real-time detection and recognition of vehicles and pedestrians.

优选的，还包括视频目标检测识别模型的测试：将验证数据集中的验证数据发送至优化后的全卷积网络，得到预测矩形选择框，与验证数据集中的真实矩形选择框通过计算得到函数曲线下面积的平均值。Preferably, it also includes the test of the video target detection and recognition model: sending the verification data in the verification data set to the optimized fully convolutional network to obtain a predicted rectangular selection box, and calculating the function curve with the real rectangular selection box in the verification data set The average value of the area below.

优选的，预训练权值及阈值初始化通过学习模式中的学习率衰减策略输入至全卷积网络。Preferably, the pre-training weights and threshold initialization are input to the fully convolutional network through the learning rate decay strategy in the learning mode.

优选的，预训练权值及阈值初始化的方法为：采用基于标准数据集训练所得的预训练模型所得参数进行训练前模型预训练权值及阈值的初始化。Preferably, the method for initializing pre-training weights and thresholds is as follows: using parameters obtained from a pre-training model trained based on a standard data set to initialize the pre-training weights and thresholds of the pre-training model.

优选的，学习率衰减策略的方法为：采用视频目标检测识别模型所得参数进行训练前模型参数初始化，采用小批量梯度下降法进行训练，将模型初始的小批量梯度下降法每一次迭代学习率lr改为0.001，该模型可输出每个分隔中各相关目标分类的置信值和选择框坐标。Preferably, the method of the learning rate decay strategy is: use the parameters obtained from the video target detection and recognition model to initialize the model parameters before training, use the mini-batch gradient descent method for training, and use the initial mini-batch gradient descent method for each iteration learning rate lr Changed to 0.001, the model outputs confidence values and selection box coordinates for each relevant target classification in each bin.

优选的，使用独立损失函数的加权线性组合来优化产生综合损失值函数并通过训练最小化得到总体损失函数的方法为：Preferably, the method of using a weighted linear combination of independent loss functions to optimize and generate a comprehensive loss value function and minimize the overall loss function through training is as follows:

设定S_i(x_g,y_g,ω_g,h_g)为目标的真实矩形选择框，x_g,y_g为中心点；Set S _i (x _g , y _g , ω _g , h _g ) as the real rectangular selection box of the target, and x _g , y _g as the center point;

设定S(x,y,ω,h)为目标的预测矩形选择框，x,y为中心点；Set S(x, y, ω, h) as the prediction rectangle selection box of the target, and x, y as the center point;

其中，x，y为预测矩形选择框中心点坐标，ω和h为预测矩形选择框的宽和高；x_g,y_g为真实矩形选择框中心点坐标，ω_g和h_g为真实矩形选择框的宽和高；Among them, x and y are the coordinates of the center point of the selection box of the prediction rectangle, ω and h are the width and height of the selection box of the prediction rectangle; x _g , y _g are the coordinates of the center point of the selection box of the real rectangle, and ω _g and h _g are the selection of the real rectangle. the width and height of the box;

基于正则项超参数λ对可能性和目标检测算法的预测矩形选择框与目标真实标注的选择框真值欧式距离的误差D进行加权，得到损失函数：L1＝λ·D，以实现检测识别能力的泛化，并同时计算Intersection over Union损失函数值L2＝λ·IOU，IOU是一种测量在特定数据集中检测相应物体准确度的一个标准；Based on the hyperparameter λ of the regular term, weight the error D of the Euclidean distance between the predicted rectangular selection box of the target detection algorithm and the true value of the selection box marked by the target, and obtain the loss function: L1=λ·D, in order to realize the detection and recognition ability generalization, and calculate the Intersection over Union loss function value L2=λ·IOU at the same time, IOU is a standard to measure the accuracy of detecting corresponding objects in a specific data set;

81个分隔每个需计算6类的概率，每次输入的原始数据经过网络后生成486个概率，某空间点位有一个概率Pr，可对目标的有无进行定量的衡量，81个分隔每个分隔给出6类的概率Pc后，需复合该分隔对目标有无的定量衡量概率Po，则每个分隔分为六类后单个独立类别非条件概率复合了在某分隔确定有目标存在后该目标归类于某类的可能性概率后并得到相应的分类损失函数值Lc＝Pc·Po；总体损失函数L＝L1+L2+Lc。Each of the 81 partitions needs to calculate the probabilities of 6 categories. Each time the input raw data passes through the network, 486 probabilities are generated. There is a probability Pr at a certain space point, which can quantitatively measure the presence or absence of the target. The 81 partitions each After each division gives the probability Pc of 6 categories, it is necessary to compound the quantitative measurement probability Po of the existence of the target for the division. Then, after each division is divided into six categories, the unconditional probability of a single independent category is compounded. After a division determines that there is a target After the object is classified into the probability probability of a certain class, the corresponding classification loss function value Lc=Pc·Po is obtained; the overall loss function L=L1+L2+Lc.

优选的，视频目标检测识别模型所用卷积层和池化层支持多尺寸图像输入，通过预处理，将单一尺寸的数据集通过实施数据集增强，在训练中每经过M批训练后，就随机选择不同尺寸的图像。Preferably, the convolutional layer and pooling layer used in the video target detection and recognition model support multi-size image input, and through preprocessing, a single-size data set is enhanced through the implementation of the data set. Choose images of different sizes.

优选的，多个类别目标检测中，每类目标都可绘制一条recall和precision的函数曲线，AveP(q)是该函数曲线下的面积，如式8所示：mAP是Q个类别AP的平均值，类别序号q为1到Q的整数；Preferably, in the detection of multiple categories of targets, a function curve of recall and precision can be drawn for each category of targets, and AveP(q) is the area under the function curve, as shown in Equation 8: mAP is the average of Q categories of APs value, the category number q is an integer from 1 to Q;

优选的，训练数据集为各类包括D5格式在内的各种分辨率的视频；验证数据集是由预留训练数据集中20％的数据作为验证数据而组成。Preferably, the training data set is videos of various resolutions including D5 format; the verification data set is formed by reserving 20% of the data in the training data set as the verification data.

本发明的有益效果在于：The beneficial effects of the present invention are:

本发明基于分隔置信计算和正则项超参数，设计了基于分类置信值的复合损失函数，可用于高要求和复杂多变的现实应用场景下的高帧率高清视频目标检测识别，明显提升了基于mAP和摄像头帧率fps衡量的准确率和实时性。The invention designs a composite loss function based on the classification confidence value based on the separation confidence calculation and the hyperparameter of the regular term, which can be used for high frame rate high-definition video target detection and recognition in high-demand and complex and changeable real application scenarios. The accuracy and real-time performance measured by mAP and camera frame rate fps.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative efforts.

图1附图为本发明视频目标检测识别模型的构建图。FIG. 1 is a diagram illustrating the construction of a video target detection and recognition model of the present invention.

图2附图为本发明视频目标检测识别模型的测试图。FIG. 2 is a test diagram of the video target detection and recognition model of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

请参阅附图1-2，本发明提供了一种目标检测方法，包括：Referring to Figures 1-2, the present invention provides a target detection method, including:

视频目标检测识别模型的构建，将训练数据集中的带标签数据通过高速存储介质输入到全卷积网络，预训练权值及阈值初始化通过学习模式中的学习率衰减策略输入至全卷积网络。使用独立损失函数的加权线性组合来优化产生综合损失值函数并通过训练最小化得到总体损失函数。In the construction of the video target detection and recognition model, the labeled data in the training data set is input to the fully convolutional network through high-speed storage media, and the pre-training weights and threshold initialization are input to the fully convolutional network through the learning rate decay strategy in the learning mode. A weighted linear combination of independent loss functions is used for optimization to produce a composite loss value function and minimized by training to obtain an overall loss function.

视频目标检测识别模型在网络模型的最后一层使用逻辑激活函数来计算分类置信值和矩形选择框，实现了视频车辆及行人的目标实时检测识别。The video target detection and recognition model uses the logical activation function in the last layer of the network model to calculate the classification confidence value and the rectangular selection box, which realizes the real-time detection and recognition of video vehicles and pedestrians.

本发明还包括视频目标检测识别模型的测试：将验证数据集中的验证数据发送至优化后的全卷积网络，得到预测矩形选择框，与验证数据集中的真实矩形选择框通过计算得到函数曲线下面积的平均值mAP。The invention also includes the test of the video target detection and recognition model: the verification data in the verification data set is sent to the optimized full convolution network to obtain the predicted rectangular selection frame, and the real rectangular selection frame in the verification data set is calculated to obtain under the function curve Average mAP of area.

其中，训练数据集为各类包括D5格式在内的各种分辨率的视频；验证数据集是由预留训练数据集中20％的数据作为验证数据而组成。Among them, the training data set is all kinds of videos of various resolutions including D5 format; the verification data set is composed of 20% of the data in the reserved training data set as the verification data.

预训练权值及阈值初始化的方法为：采用基于标准数据集训练所得的预训练模型所得参数进行训练前模型预训练权值及阈值的初始化。The method for initializing the pre-training weights and thresholds is as follows: using the parameters obtained from the pre-training model trained based on the standard data set to initialize the pre-training weights and thresholds of the pre-training model.

学习率衰减策略的方法为：采用视频目标检测识别模型所得参数进行训练前模型参数初始化，采用小批量梯度下降法进行训练，将模型初始的小批量梯度下降法每一次迭代学习率lr改为0.001，目的是为了同时保持一定的学习能力和记忆能力，在既能学到新知识的同时又不会完全遗忘旧的知识，由于占用内存较大容易溢出，批处理规模一般不超过128，该模型可输出每个分隔中各相关目标分类的置信值和选择框坐标。The method of the learning rate decay strategy is: use the parameters obtained from the video target detection and recognition model to initialize the model parameters before training, use the mini-batch gradient descent method for training, and change the learning rate lr of each iteration of the initial mini-batch gradient descent method to 0.001 , the purpose is to maintain a certain learning ability and memory ability at the same time, while learning new knowledge without completely forgetting old knowledge, due to the large memory occupation, it is easy to overflow, and the batch size generally does not exceed 128. This model Confidence values and selection box coordinates for each relevant target classification in each bin can be output.

使用独立损失函数的加权线性组合来优化产生综合损失值函数并通过训练最小化得到总体损失函数的方法为：The method of using a weighted linear combination of independent loss functions to optimize to produce a comprehensive loss value function and minimize the overall loss function through training is:

基于正则项超参数λ对可能性和目标检测算法的预测矩形选择框与目标真实标注的选择框真值欧式距离的误差D进行加权，得到损失函数：L1＝λ·D，以实现检测识别能力的泛化，并同时计算Intersection over Union损失函数值L2＝λ·IOU，IOU是一种测量在特定数据集中检测相应物体准确度的一个标准，正则项超参数λ∈[0,∞]；Based on the hyperparameter λ of the regular term, weight the error D of the Euclidean distance between the predicted rectangular selection box of the target detection algorithm and the true value of the selection box marked by the target, and obtain the loss function: L1=λ·D, in order to realize the detection and recognition ability The generalization of , and calculate the Intersection over Union loss function value L2=λ·IOU at the same time, IOU is a standard to measure the accuracy of detecting corresponding objects in a specific data set, the regular term hyperparameter λ∈[0,∞];

在另一种实施例中，视频目标检测识别模型所用卷积层和池化层支持多尺寸图像输入，通过预处理，将单一尺寸的数据集通过实施数据集增强，在训练中每经过M批训练(M＝20批次)后，就随机选择不同尺寸的图像。设本章网络模型的降采样参数等于64，64的倍数{320,384，…，1216}，最小的尺寸为320*320，最大的尺寸为1216*1216。只需微调模型到对应维度然后继续下一批次的训练。In another embodiment, the convolution layer and pooling layer used in the video target detection and recognition model support multi-size image input, and through preprocessing, a single-size data set is enhanced by implementing the data set. After training (M=20 batches), images of different sizes are randomly selected. Let the downsampling parameters of the network model in this chapter be equal to 64, a multiple of 64 {320, 384, ..., 1216}, the smallest size is 320*320, and the largest size is 1216*1216. Just fine-tune the model to the corresponding dimension and proceed to the next batch of training.

多个类别目标检测中，每类目标都可绘制一条recall和precision的函数曲线，AveP(q)是该函数曲线下的面积，如式8所示：mAP是Q个类别AP的平均值，类别序号q为1到Q的整数；In the detection of multiple categories of targets, a function curve of recall and precision can be drawn for each category of targets. AveP(q) is the area under the function curve, as shown in Equation 8: mAP is the average of Q categories of APs, and the category The serial number q is an integer from 1 to Q;

本专利提出了基于分隔置信计算和正则项超参数泛化的视频目标实时检测识别方法，得到分类置信值的复合损失函数方法，依据以上方法所设计的模型在网络模型的最后一层使用逻辑激活函数来计算分类置信值和矩形选择框，实现了视频车辆及行人的目标实时检测识别模型，可用于高要求和复杂多变的现实应用场景下的高帧率高清视频目标检测识别，相对经典方法显著提升了基于mAP和FPS(每秒帧率)衡量的性能，明显提升了基于mAP和fps衡量的准确率和实时性(可支持70fps)，满足了高帧率高清视频目标检测识别性能要求。This patent proposes a real-time detection and recognition method of video objects based on separation confidence calculation and regular term hyperparameter generalization, and a composite loss function method for obtaining classification confidence values. The model designed according to the above method uses logic activation in the last layer of the network model function to calculate the classification confidence value and rectangular selection box, and realize the real-time detection and recognition model of video vehicles and pedestrians. Significantly improves the performance based on mAP and FPS (frame rate per second) measurement, significantly improves the accuracy and real-time performance based on mAP and fps measurement (supports 70fps), and meets the high frame rate HD video target detection and recognition performance requirements.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of object detection, comprising:

constructing a video target detection and identification model, inputting labeled data in a training data set into a full convolution network through a high-speed storage medium, optimizing and generating a comprehensive loss value function by using weighted linear combination of independent loss functions, and obtaining a total loss function through training minimization;

the video target detection and identification model calculates a classification confidence value and a rectangular selection box by using a logic activation function at the last layer of the network model, so that the real-time detection and identification of the targets of video vehicles and pedestrians are realized.

2. The method of claim 1, further comprising testing a video object detection recognition model: and sending the verification data in the verification data set to the optimized full convolution network to obtain a prediction rectangular selection frame, and calculating with a real rectangular selection frame in the verification data set to obtain an average value of the area under the function curve.

3. The method of claim 2, wherein the pre-training weights and threshold initialization are input to the full convolution network via a learning rate decay strategy in a learning mode.

4. The method of claim 3, wherein the method for initializing the pre-training weights and the threshold comprises: and initializing a pre-training weight and a threshold of the pre-training model before training by adopting parameters of the pre-training model obtained by training based on the standard data set.

5. The method of claim 4, wherein the learning rate decay strategy comprises: the parameters obtained by the video target detection and identification model are adopted for model parameter initialization before training, a small batch gradient descent method is adopted for training, the learning rate lr of each iteration of the initial small batch gradient descent method of the model is changed into 0.001, and the model can output the confidence value and the selection box coordinates of each related target classification in each partition.

6. The method of claim 1, wherein the method of using weighted linear combination of independent loss functions to optimize the generation of the composite loss value function and the training minimization to obtain the overall loss function is:

setting S_i(x_g,y_g,ω_g,h_g) True rectangular selection box, x, for the target_g,y_gIs the central point;

setting S (x, y, omega, h) as a prediction rectangle selection frame of a target, wherein x and y are central points;

wherein, x and y are coordinates of the central point of the prediction rectangle selection frame, and omega and h are the width and height of the prediction rectangle selection frame; x is the number of_g,y_gSelecting the coordinates of the center point of the frame, omega, for the true rectangle_gAnd h_gSelecting the width and height of the frame for the true rectangle;

weighting the probability and the error D of the Euclidean distance between a prediction rectangle selection box of a target detection algorithm and a selection box truth value of a target real label based on the regular term hyperparameter lambda to obtain a loss function: l1 ═ λ · D to generalize the detection and identification capabilities, and at the same time calculate the Intersection over Union loss function value L2 ═ λ · IOU, which is a standard for measuring the accuracy of detecting the corresponding object in a specific dataset;

81 partitions generate 486 probabilities after each input original data passes through a network, a certain space point has a probability Pr, the existence of a target can be quantitatively measured, 81 partitions give out probabilities Pc of 6 classes, the quantitative measurement probability Po of the partitions to the target needs to be compounded, and then after the partitions are divided into six classes, the unconditional probability of a single independent class is compounded to obtain a corresponding classification loss function value Lc which is Pc-Po after the probability that the target is classified into the certain class after the target exists in the certain partition is determined; the overall loss function L is L1+ L2+ Lc.

7. The method of claim 1, wherein the convolutional layer and pooling layer of the video target detection recognition model support multi-size image input, and the single-size data set is enhanced by preprocessing, and randomly selecting different-size images after each M training batches.

8. The method as claimed in claim 2, wherein in the detection of the plurality of classes of targets, each class of target can draw a function curve of recall and precision, and avep (q) is an area under the function curve, as shown in equation 8: mAP is the average value of Q types of AP, and the type serial number Q is an integer from 1 to Q;

9. the method of claim 2, wherein the training data set is a variety of video with various resolutions including D5 format; the validation data set is composed of 20% of the data in the reserved training data set as validation data.