CN114565867A

CN114565867A - Unmanned aerial vehicle scene video target detection method based on convolutional neural network

Info

Publication number: CN114565867A
Application number: CN202210085038.8A
Authority: CN
Inventors: 卢湖川; 赵庆宇
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-31

Abstract

The invention belongs to the field of video target detection in the field of computer vision, and provides an unmanned aerial vehicle scene video target detection method based on a convolutional neural network. The framework can fully utilize information correlation of pedestrians between front and rear frames of a video to narrow a region to be searched, reduces calculation load, aims at the problem of insufficient calculation capacity of an embedded platform, further accelerates network reasoning speed by using a tensorrt quantitative acceleration technology, and achieves good balance between accuracy and speed. The invention deploys the algorithm framework on an NVIDIA Jetson TX2 embedded platform for detecting the targets of the unmanned aerial vehicle scene downlink people, and the platform has the advantages of small volume, low power consumption, suitability for embedded application and the like.

Description

Unmanned aerial vehicle scene video target detection method based on convolutional neural network

Technical Field

The invention belongs to the field of video object detection in the field of computer vision, and particularly relates to image classification, object detection and neural network quantization acceleration technologies, in particular to an unmanned aerial vehicle scene video object detection method based on a convolutional neural network.

Background

Along with the rapid development of low-cost commercial unmanned aerial vehicles, video monitoring on the unmanned aerial vehicle scene is also more and more concerned. Several studies have addressed this problem from different aspects. However, few attempts have been made on embedded platforms. This work was primarily directed to developing an effective and efficient drone-based pedestrian detection algorithm framework on the Nvdia Jetson TX2 platform.

Existing detectors can be broadly divided into two-stage (e.g., fast-RCNN) and single-stage (e.g., YOLO and SSD) detectors. The two-stage detector first generates a region proposal frame (propusals) and uses a sub-network to classify and modify the region proposal frame; while the single-stage detector directly gives the final result of the detector, no region suggestion box needs to be generated. Generally, the speed of a single-stage detector is faster, while the accuracy of a two-stage detector is higher. Considering the computing power of embedded platforms, we chose the MobileNet based SSD detector as our base model.

In unmanned aerial vehicle scenes, the image resolution is typically high, but the people in the field of view are relatively small. Therefore, how to balance the speed and accuracy of the algorithm is a difficult task. If the detector is applied directly on the high resolution image, the computational cost will be huge for the embedded platform. But if we directly adjust the image to a low resolution, some objects will not be recognized because the appearance information is very limited. In our observations, most areas in the camera are free of targets. Thus, saving the computation of these areas will greatly speed up the detection process while maintaining good performance. In particular, we use the temporal and spatial relationship between frames to determine the detection position.

In this work, two MobileNetv 1-based SSD detectors, namely HeavyDet and MiniDet, were combined. The HeavyDet is a powerful global detector that detects the entire image in a sliding window fashion and finds the targets of the entire domain. In order to sufficiently extract the temporal-spatial information of the target video sequence, it is assumed that the movement of the target is small in a short time, and then the results of the previous frames are used to determine a local search area in the current frame. These search regions are handled by the MiniDet model, which uses a very small input size, making the detector more efficient than the HeavyDet model. Furthermore, the HeavyDet and MiniDet models are dynamically interactive in order to achieve a good balance between accuracy and speed.

Disclosure of Invention

The invention aims to provide a scene video target detection framework of an unmanned aerial vehicle, and aims to solve the problem that the video target detection reasoning speed is slow and the algorithm real-time requirement in an actual application scene cannot be met under the condition that the calculation power of an embedded platform on the unmanned aerial vehicle is limited in the prior art.

The technical scheme of the invention is as follows:

an unmanned aerial vehicle scene video target detection method based on a convolutional neural network comprises the following steps:

step 1, constructing a convolutional neural network model, wherein the convolutional neural network model comprises a global HeavyDet model and a local MiniDet model which are dynamically interactive;

the HeavyDet model is an SSD detector based on MobileNet, an original image is divided into a plurality of sub-regions, the adjacent sub-regions are partially overlapped, and then all the sub-regions of a picture are input into the HeavyDet model as a batch; next using the modified NMS to eliminate false positives due to the presence of targets in overlapping regions;

the improved NMS adds preprocessing operation on the basis of the traditional NMS, and the specific process is as follows: before executing the traditional NMS, mapping the position coordinates of the target boundary frames corresponding to all the sub-regions back to the original picture to obtain the target boundary frames under the original picture coordinate system, and then executing the traditional NMS;

the MiniDet model is an SSD detector based on MobileNet, taking a search area as its input, and returning the position of the target in the search area;

step 2, acquiring a video sequence to be detected, wherein the video sequence comprises video sequences of a plurality of pedestrians in an unmanned aerial vehicle scene, and the input of a convolutional neural network model is an image of any frame in the video sequence;

step 3, carrying out unmanned aerial vehicle scene downlink human target detection on the video sequence by using a convolutional neural network model to obtain a detection result;

taking each frame of image in the video sequence obtained in the step (2) as the input of the convolutional neural network model constructed in the step (1), and then predicting the positions of all pedestrians in the input image by using the convolutional neural network model;

the HeavyDet model is responsible for finely detecting pedestrians in the whole picture, and then expanding the detection result into a search area; the MiniDet model is responsible for correcting the search area frame by frame; the HeavyDet model and the MiniDet model are executed alternately;

when the pedestrian position is predicted, the HeavyDet model searches the whole image to preliminarily position the target position, and then an area which is enlarged by 1.5 times by taking the geometric center of the detected target boundary frame of the HeavDet model as the center is used as a search area of MiniDet; applying the MiniDet model to the search area to obtain a bounding box about the target, which will continue to be expanded by a factor of 1.5 as the search area for the next frame; until the heavyDet model detects the whole image again, so as to perfect the result of the MiniDet model and initialize a new target entering the region;

when the MiniDet model fails to find any target in a search area, retaining the bounding box of the target in the last frame and reducing its score; the position of the search area is updated only when some targets are detected in the search area or the HeavyDet model is started; the expansion of the detection bounding box may cover the adjacent target, thereby causing repeated detection, and the following two methods are adopted to solve the problem of repeated detection: the first approach is to ignore only partially visible targets during the training phase; another approach is to perform a modified NMS in collecting the results of MiniDet;

step 4, outputting the detection result, and visually outputting the bounding boxes of all pedestrians based on the detection result;

step 5, obtaining a convolutional neural network model, comprising:

building a convolutional neural network model to be trained;

acquiring a training data set, a testing data set and a verification data set;

training the convolutional neural network model by using a training data set to obtain the weight of the trained convolutional neural network model;

the trained convolutional neural network model is quantitatively accelerated using tensorrt.

The acquiring of the training data set, the testing data set and the verification data set comprises:

acquiring a pedestrian detection data set of an unmanned plane scene;

manually labeling all data in the detection data set;

and splitting the labeled data set into a training data set, a testing data set and a verification data set.

The invention has the beneficial effects that:

(1) a good balance between accuracy and speed is achieved

The invention provides an effective and efficient pedestrian detection framework using the HeavyDet and MiniDet models. The framework can fully utilize information correlation of pedestrians between front and rear frames of a video to narrow a region to be searched, reduces calculation load, aims at the problem of insufficient calculation capacity of an embedded platform, further accelerates network reasoning speed by using a tensorrt quantitative acceleration technology, and achieves good balance between accuracy and speed. The algorithm framework achieves satisfactory performance on the Nvdia Jetson TX2 platform, exceeding 5 fps.

(2) The applicability is wider, and the method is easy to deploy to an embedded platform

The invention deploys the algorithm framework on an NVIDIA Jetson TX2 embedded platform for detecting the targets of the unmanned aerial vehicle scene downlink people, and the platform has the advantages of small volume, low power consumption, suitability for embedded application and the like.

The Tensor RT is a tool for optimizing an algorithm deployment end, and has good support for most mainstream deep learning applications. The invention uses the Tensor RT to carry out rapid reasoning on the model with the optimization precision of FP16, thereby greatly reducing the delay, ensuring the optimal model deduction performance and meeting the strict requirements of time delay and throughput on an embedded platform with relatively tight calculation. In addition, the present invention packages the programs, dependencies into a lightweight, portable Docker container, and then deploys to Jetson TX2, Inviet.

Drawings

FIG. 1 is an overall framework of the detection algorithm. The HeavyDet and MiniDet operate alternately to detect pedestrians in a video sequence.

Fig. 2 is an example of the sliding window strategy of HeavyDet. The original image is segmented into overlapping sub-regions that are input as the same batch to the HeavyDet.

Fig. 3 is an example of expanding a detected target region. The target area detected in the previous frame is expanded in both the horizontal and vertical directions as a local search area for MiniDet.

FIG. 4 is an example of a fine position of a regression target in the expanded search area. MiniDet processes a given search area to update the location of the target.

FIG. 5 is a partial visualization of the methods herein.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

step 1, obtaining a convolutional neural network model; which comprises the following steps: a strong global heavidet model and a fast local MiniDet model. The former carefully detects pedestrians on the whole image through a series of sliding windows; the latter performs a local search around the previous detection result. The HeavyDet and MiniDet models are dynamically interactive.

The HeavyDet model is a MobileNet-based SSD detector. In order to pursue high accuracy, the larger the input size of the image should be set, the better. But if the whole image (1920 x 1080) is taken as input, the memory of Nvdia Jetson TX2 is not sufficient. To solve this problem, the original image is divided into a number of sub-regions, which are input as a batch to the HeavyDet model. This approach not only greatly reduces the algorithm's requirements for run-time memory, but also facilitates improved detection performance because the target bounding box is much easier to regress for small images, especially when the target is relatively small. Segmenting the original image into multiple sub-regions can result in the target on the edges of each region being split, resulting in a situation where the target bounding box is split in half or results in false negatives. Thus, in the implementation of the present framework, adjacent sub-areas are overlapped together. In addition, we also used an improved version of NMS (Non-Maximum Suppression, NMS) to eliminate false positives due to the presence of targets in overlapping regions.

The traditional NMS is mainly used for extracting a boundary frame with high confidence coefficient in a picture in target detection, and inhibiting false detection frames with low confidence coefficient. Generally, the number of target bounding boxes output by the model is very large, and the specific number is determined by the number of anchors, where many repeated boxes are located to the same target, and the NMS is used to remove these repeated boxes to obtain the true target bounding box. In the method, the original picture is firstly split into a plurality of sub-regions with overlapping regions, and the conventional NMS can not eliminate a false detection frame of repeated detection among the plurality of sub-regions, so that the conventional NMS is improved and a preprocessing operation is added. Before executing traditional NMS, firstly mapping the position coordinates of the target boundary box corresponding to all the sub-regions back to the original picture to obtain the target boundary box under the original picture coordinate system, and then executing traditional NMS.

The MiniDet model is also a MobileNet based SSD detector with a small search area as its input and returns the location of the object in the search area. The search areas are obtained by expanding the previously detected target bounding box, the size of the search areas is 1.5 times of the size of the detected target bounding box, and the center positions of the search areas coincide. Since the speed of movement of a person is typically slow, the search area may also tend to cover all objects previously detected. When processing images using MiniDet, many regions without targets are ignored. Thus, the computational load is reduced and the speed of the detector is increased.

Step 2, acquiring a video sequence to be detected; the video sequence is a video sequence including a plurality of pedestrians in an unmanned aerial vehicle scene, and the video sequence can be a recorded video, such as a video obtained by a video recorder and other image recording devices in real time on line, and no strict requirement is made here. The input of the convolutional neural network is an image of any frame in the video sequence.

Step 3, carrying out unmanned aerial vehicle scene downlink human target detection on the video sequence by using the convolutional neural network model to obtain a detection result;

taking each frame of image in the video sequence as the input of the convolutional neural network, and then predicting the positions of all pedestrians in the input image by using the convolutional neural network;

the target detection algorithm of the method consists of HeavyDet and MiniDet, and is based on the SSD target detection algorithm. The HeavyDet model is responsible for finely detecting pedestrians in the whole picture and then expanding the detection result into a small search area, and the MiniDet model is responsible for correcting the pedestrian in the search area frame by frame. The HeavyDet model and the MiniDet model are alternately executed in this way, so that the balance between speed and accuracy is achieved.

Specifically, in predicting the pedestrian position, the heavidet model searches the entire image carefully to preliminarily locate the target position, and then expands the detection result of heavidet by 1.5 times as a search area of MiniDet. The MiniDet model is applied to the search area to obtain a bounding box about the target that will continue to be expanded by a factor of 1.5 as the search area for the next frame. After processing several frames, the HeavyDet again examines the entire image to refine the MiniDet results and initialize new targets into the region. The HeavyDet and MiniDet are dynamically alternated in this manner, with a good balance between accuracy and speed.

It can be noted that the MiniDet model may fail during its lifetime. If the MiniDet misses an object, the corresponding search area disappears so that the object is not identified again in the next frame until it is detected by the next HeavyDet. To correct this, the search areas are left for a period of time so that even if an object is not detected for a period of time, its trajectory does not immediately stop. Specifically, when MiniDet fails to find any target in a search area, we keep the target box in the last frame and reduce its score. The position of the search area is updated only when some targets are detected in the search area or the HeavyDet is initiated. Notably, the extension of the detection bounding box may cover nearby objects, thereby resulting in duplicate detections. The following two approaches are taken to address this problem. The first approach is to ignore partially visible objects during the training phase, which makes the model more sensitive to whole people rather than cropped people. Another approach is to perform the NMS when collecting the results of MiniDet, which is a conventional way of handling duplicate detections.

Furthermore, when the camera moves rapidly, the position of the object within the scene can change drastically. It is very difficult for MiniDet to continuously detect the target. In the algorithmic framework herein, the number of search areas where no target is detected is analyzed. If the proportion of the search area in which the target cannot be detected is high enough, the HeavyDet will be restarted to search the entire image again. It is not necessary to start the HeavyDet as often. The algorithm framework herein examines the number of targets detected and if the number is stable, the frequency of launching the HeavyDet is reduced.

And 4, outputting the detection result. The method comprises the following steps: based on the detection result, the boundary frames of all the pedestrians are visually output.

Step 5, the obtaining of the convolutional neural network model comprises the following steps:

building a convolutional neural network model to be trained;

acquiring a training data set, a testing data set and a verification data set;

and training the convolutional neural network model by using the training data set to obtain the weight of the trained convolutional neural network model.

Carrying out quantitative acceleration on the trained convolutional neural network model by using tensorrt;

step 6, the acquiring of the training data set, the testing data set and the verification data set comprises:

acquiring a pedestrian detection data set of an unmanned plane scene;

manually labeling all data in the detection data set;

The embodiment of the invention provides an unmanned aerial vehicle scene video target detection method based on a convolutional neural network. The HeavyDet model searches through the entire image to preliminarily locate the target position, and then takes the result of HeavyDet as a search area of MiniDet. The MiniDet model is applied to the search area to obtain a bounding box about the target that will continue to be expanded as the search area for the next frame. After a period of processing, the HeavyDet again examines the entire image to refine the MiniDet results and initialize new targets into the region. Fig. 1 and table 1 show the overall framework flow of the method.

TABLE 1HeavyDet and MiniDet test procedures

The overall algorithm framework uses 5 frames as a period, and the first frame at the beginning of each period uses the HeacyDet detection, followed by the MiniDet detection. When the scene change is large or the target displacement is large, the search area of MiniDet may not keep up with the movement of the target, resulting in the target being lost and not being detected. For this case, a target loss threshold is set, and when the search area in which no target is detected is greater than 40% of the entire search area, HeacyDet detection is started for the next 3 frames of pictures. And when the target loss percentage is smaller than the target loss threshold value, the detection flow returns to be normal. The process flow of the present invention is specifically illustrated by the following example.

First, as shown in fig. 1 and table 1, it is first necessary to acquire a sequence of video frames, which may be recorded videos, or videos acquired online in real time by an image recording device such as a video recorder, and no strict requirement is made here. The input picture in fig. 1 is an example input.

Secondly, if the current stage (one stage for every 5 frames) is finished or fast motion occurs, the first frame picture of the current stage is partitioned with an overlapping area according to the method shown in fig. 2, and all the partitioned subgraphs are input into the HeacyDet model as a batch. And obtaining the detection result of the HeacyDet model.

When the camera moves rapidly, the position of the object within the scene changes dramatically. It is very difficult for MiniDet to continuously detect the target. In the algorithmic framework herein, the number of search areas where no target is detected is analyzed. If the proportion of the search area in which the target cannot be detected is high enough, the HeavyDet will be restarted to search the entire image again. It is not necessary to start the HeavyDet as often. The algorithm framework herein examines the number of targets detected and if the number is stable, the frequency of launching the HeavyDet is reduced.

An improved version of NMS was used to eliminate false positives in the detection results of the HeacyDet model due to the presence of targets in overlapping regions. And obtaining a detection result for removing the false positive target.

Fourthly, for the last four frames of pictures in the current stage, the method shown in fig. 3 is used to expand the bounding boxes of all the targets in the detection result in the third step by 1.5 times, and the expanded bounding boxes are used as the search area of the MiniDet model. The solid line frame in fig. 3 is a target boundary frame detected by the HeacyDet model, and the dashed line frame is obtained by expanding the detected target boundary frame, and the size of the dashed line frame is 1.5 times of the size of the detected target boundary frame, and the center positions of the two frames coincide.

Fifthly, all the search areas in the fourth step are used as a batch to be input into the MiniDet model, and the positions of the targets in the search areas which are corrected by the MiniDet are output.

Note that the MiniDet model may fail during its lifetime. If the MiniDet misses an object, the corresponding search area disappears, so that the object is not recognized again in the next frame until it is detected by the next HeavyDet. To correct this, the search areas are left for a period of time so that even if an object is not detected for a period of time, its trajectory does not immediately stop. Specifically, when MiniDet fails to find any target in a search area, we keep the target box in the last frame and reduce its score. The position of the search area is updated only when some targets are detected in the search area or the HeavyDet is initiated. Notably, the extension of the detection bounding box may cover nearby objects, thereby resulting in duplicate detections. The following two approaches are taken to address this problem. The first approach is to ignore partially visible objects during the training phase, which makes the model more sensitive to whole people rather than cropped people. Another approach is to perform the NMS when collecting the results of MiniDet, which is a conventional way of handling duplicate detection.

And sixthly, entering the next stage, and repeating all the steps until the video frame is finished.

The final output detection effect is shown in fig. 5, and it can be seen from the figure that the video target detection framework of the present invention can still maintain a good detection effect under the conditions of complex scene, more people and smaller target.

Claims

1. An unmanned aerial vehicle scene video target detection method based on a convolutional neural network is characterized by comprising the following steps:

the HeavyDet model is an SSD detector based on MobileNet, an original image is divided into a plurality of sub-regions, the adjacent sub-regions are partially overlapped, and then all the sub-regions of a picture are input into the HeavyDet model as a batch; next using modified NMS to eliminate false positives due to the presence of targets in overlapping regions;

step 2, acquiring a video sequence to be detected, wherein the video sequence to be detected comprises a plurality of pedestrian video sequences in an unmanned aerial vehicle scene, and the input of a convolutional neural network model is an image of any frame in the video sequence;

the HeavyDet model is responsible for finely detecting pedestrians in the whole picture, and then expanding the detected target boundary frame into a search area; the MiniDet model is responsible for frame-by-frame correction in the search area; the HeavyDet model and the MiniDet model are executed alternately;

when the pedestrian position is predicted, the HeavyDet model searches the whole image to preliminarily position the target position, and then an area which is enlarged by 1.5 times by taking the geometric center of a target boundary frame detected by the HeavDet model as the center is used as a search area of MiniDet; applying the MiniDet model to the search area to obtain a bounding box about the target, which will continue to be expanded by a factor of 1.5 as the search area for the next frame; until the heavyDet model detects the whole image again, so as to perfect the result of the MiniDet model and initialize a new target entering the region;

when the MiniDet model fails to find any target in a search area, retaining the bounding box of the target in the last frame and reducing its score; the position of the search area is updated only when some targets are detected in the search area or the HeavyDet model is started; the expansion of the detection bounding box may cover the adjacent target, thereby causing the repeated detection, and the following two methods are adopted to solve the problem of the repeated detection: the first approach is to ignore only partially visible targets during the training phase; another approach is to perform a modified NMS in collecting the results of MiniDet;

step 5, obtaining a convolutional neural network model, comprising:

building a convolutional neural network model to be trained;

acquiring a training data set, a testing data set and a verification data set;

2. The method of claim 1, wherein the obtaining the training dataset, the testing dataset, and the verification dataset comprises:

acquiring a pedestrian detection data set of an unmanned plane scene;

manually labeling all data in the detection data set;