CN107862705B

CN107862705B - Unmanned aerial vehicle small target detection method based on motion characteristics and deep learning characteristics

Info

Publication number: CN107862705B
Application number: CN201711166232.4A
Authority: CN
Inventors: 高陈强; 杜莲; 王灿; 冯琦; 汤林; 汪澜
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-11-21
Filing date: 2017-11-21
Publication date: 2021-03-30
Anticipated expiration: 2037-11-21
Also published as: CN107862705A

Abstract

The invention relates to an unmanned aerial vehicle small target detection method based on motion characteristics and deep learning characteristics, and belongs to the technical field of image processing and computer vision. Firstly, processing an input video data set through a video image stabilization algorithm to compensate the motion of a camera; analyzing a detected motion candidate target area in the image; dividing a video data set into two parts, and training by utilizing a training data set to obtain an improved candidate area generation network model; generating a network through the candidate region based on the depth features obtained through training, and generating candidate targets for the video images of the test set through the network; fusing the candidate target areas; and training by utilizing a training data set to obtain a model of the deep neural network based on the double channels, and applying the model to obtain a recognition result. And applying a target tracking method based on the multilayer depth characteristics to the recognition result of the previous step to obtain the final position of the unmanned aerial vehicle. The invention can accurately detect the unmanned aerial vehicle in the video image and provides support for the subsequent research of the unmanned aerial vehicle intelligent monitoring related field.

Description

Unmanned aerial vehicle small target detection method based on motion characteristics and deep learning characteristics

Technical Field

The invention belongs to the technical field of image processing and computer vision, and relates to an unmanned aerial vehicle small target detection method based on motion characteristics and deep learning characteristics.

Background

At present, with the rapid increase of the availability and maturity of commercial drones, the sale of drones is multiplied, and drones flying in public areas are common. Unmanned aerial vehicle not only appears in the camera lens of the popular comprehensive skill, on the romantic wedding ceremony, can also spray the pesticide in the sky above the farmland, replace the workman to carry out high altitude cleaning operation for survey and drawing shooting, forest fire prevention, military reconnaissance etc.. However, with the rapid development of unmanned aerial vehicles, the dangerous accidents caused by the unmanned aerial vehicles are also growing, and threats are brought to public safety, privacy disclosure, military safety and the like.

In recent years, techniques for detecting an unmanned aerial vehicle are roughly classified into sound Detection (Acoustics Detection), Radio Frequency Detection (Radio Frequency), Radar Detection (Radar Detection), Visual Detection (Visual Detection), and the like. Sound detection uses the microphone array to detect the rotor noise that unmanned aerial vehicle flies, then matches the noise that detects with the database that has recorded all unmanned aerial vehicle sounds, thereby discerns whether this noise belongs to unmanned aerial vehicle and judges whether there is unmanned aerial vehicle to be close to. The method for detecting the sound is easily interfered by environmental noise, and meanwhile, the time is consumed for constructing a database of the sound characteristics of the unmanned aerial vehicle. Radio frequency detection is to detect the drone by monitoring radio frequencies within a certain frequency range through a wireless receiver. This approach easily misinformates an unknown radio frequency transmitter as a drone. The radar detection is to judge whether the unmanned aerial vehicle is detected by detecting and verifying received electromagnetic waves scattered and reflected by a target. The cost and energy consumption of radar equipment are expensive and susceptible to environmental influences that create blind spots. Visual inspection typically detects drones by one or more imaging devices, and analyzes the sequence of images in some way to determine if a drone is present. Unmanned aerial vehicle detection based on vision is difficult for receiving ambient noise's interference, can fix a position unmanned aerial vehicle position, can also distinguish whether unmanned aerial vehicle carries dangerous goods, can detect information such as unmanned aerial vehicle's flight orbit, flying speed even. Therefore, the visual detection method has great advantages compared with other means, and can make up for the defects of other detection means.

At present, the research work of unmanned aerial vehicle detection based on vision is less. Obviously, it is more advantageous to avoid the danger of the drone in advance to detect the drone at a longer distance. Unmanned aerial vehicles compare target sizes such as pedestrian, aircraft, vehicle littleer, especially in remote imaging, unmanned aerial vehicle's size is very little, and this makes unmanned aerial vehicle detection based on vision more difficult. Therefore, a detection algorithm capable of effectively detecting the unmanned aerial vehicle small target in the video is needed at present.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for detecting a small target of an unmanned aerial vehicle based on motion characteristics and deep learning characteristics, which utilizes a target tracking algorithm to track the unmanned aerial vehicle and filter out false targets, and combines the characteristics of small size of the unmanned aerial vehicle, etc. to improve a convolutional neural network structure, so that the deep learning algorithm is suitable for the situation of the small target, and can effectively detect the unmanned aerial vehicle in a complex scene, thereby improving the accuracy of unmanned aerial vehicle detection.

In order to achieve the purpose, the invention provides the following technical scheme:

an unmanned aerial vehicle small target detection method based on motion characteristics and deep learning characteristics comprises the following steps:

s1: processing the input video data set through a video image stabilization algorithm to compensate the motion of a camera;

s2: detecting a motion candidate target area I from the video image after motion compensation by using a low-rank matrix analysis method, and removing tiny noise points in the motion candidate target area I by using an image post-processing module;

s3: dividing a data set of a video into a training set and a testing set, and training by using the training set to obtain an improved candidate region generation network model; processing the video image of the test set through the improved area generation network model to generate a candidate target area II;

s4: fusing the candidate target area I and the candidate target area II to obtain a candidate target area III;

s5: according to the candidate target area III, training by using a training set to obtain a dual-channel-based deep neural network, and then applying the dual-channel-based deep neural network to the candidate target of the test set to obtain a recognition result;

s6: and predicting the position of the target by using a correlation filtering algorithm, tracking and matching the stable target, filtering out false targets and obtaining the position of the unmanned aerial vehicle.

Further, in step S1, the video image stabilization algorithm includes:

s11: extracting characteristic points of each frame image by using an SURF algorithm;

s12: calculating an affine transformation model between two frames through the obtained feature matching points between the two frames of images;

s13: and compensating the current frame by using the obtained affine transformation model.

Further, in step S2, the process of detecting the motion candidate target region i by using the low rank matrix analysis method includes the following steps:

s21: will input video sequence image data f₁,f₂,...,f_nVectorization to form an image matrix

Where n is the number of video frames, f_nFor the matrix of video images of the n-th frame,

is f_nVectorizing the image matrix;

s22: decomposing the matrix C into a low-rank matrix L and a sparse matrix S through an RPCA algorithm, wherein the obtained low-rank matrix L represents a target background, and the sparse matrix S represents the obtained candidate moving target;

s23: and carrying out noise filtering processing on the obtained candidate moving target by utilizing morphological opening and closing operation, and filtering fine noise points in a moving candidate area.

Further, in step S3, the improved candidate area generation network model includes five convolutional layers and two fully-connected layers connected in sequence, wherein pooling layers are disposed between the first convolutional layer and the second convolutional layer, between the second convolutional layer and the third convolutional layer, and between the fifth layer and the first fully-connected layer;

step S3 specifically includes:

s31: dividing a data set of a video into a training set and a testing set;

s32: for the data of the training set, extracting manually marked positive samples in the image, and then randomly sampling a plurality of areas as negative samples;

s33: training by using positive and negative samples of a training set to obtain an improved candidate area generation network model;

s34: and processing the video image of the test set through the improved area generation network model to generate a candidate target area II.

Further, the width and height size ranges of the randomly sampled region in step S32 are determined by the width and height of the positive sample, and the overlapping region of the negative sample and the positive sample satisfies:

where IoU is the overlap ratio, r_gIs a positive sample region, r_nThe negative sample regions are randomly sampled.

Further, the step S4 of obtaining the candidate target region iii by fusion specifically includes:

s41: carrying out dense sampling on the candidate target area I to obtain a dense seed candidate area;

s42: calculating the similarity between the dense seed candidate region and the candidate target region II when the similarity meets the requirement

Combining two candidate regions, wherein Sim is the similarity between the dense seed candidate region and the candidate target region II;

s43: and traversing all the candidate target areas I to obtain a final candidate target area III.

Further, in step S5, the dual-channel-based deep neural network includes a front-end module and a back-end module;

the front-end module consists of two parallel deep neural network models, wherein one of the models takes a candidate target area as input directly and passes through a 6-layer convolutional neural network and 1 full-connection layer; the other one takes the candidate target area as the center, establishes an expansion area on the original image target area as input, and passes through a 6-layer convolutional neural network and 1 full-connection layer;

the rear-end module takes the output of the two full-connection layers obtained by the front-end module as input, and obtains the classification information of each candidate area as a final classification result through 2 full-connection layers and 1 softmax layer;

step S5 specifically includes:

s51: for the training data set, dividing the training data set of the candidate target area III obtained in the step S4 into positive and negative samples, and inputting the positive and negative samples into a two-channel-based deep neural network for training to obtain optimal weight;

s52: and applying the optimal weight to the candidate target areas of the test set obtained in the step S4 for classification, so as to obtain a final recognition result.

Further, step S6 specifically includes:

s61: the center position (x) of the target of the previous frame of the current frame t is known_t-1,y_t-1) For the improved candidate region generation network model obtained by training in step S5, performing sparsification on the convolution feature map array obtained by the last three convolution layers of the improved candidate region generation network model, and then extracting the depth feature of the target by using the sparsified feature map;

s62: respectively constructing correlation filters for the output characteristics of the last three convolution layers of the improved candidate area generation network model, convolving the characteristics of each layer with the corresponding correlation filter from back to front, and calculating the corresponding confidence fraction f to obtain the new central position (x) of the candidate target in the current frame_t,y_t)；

S63: extracting depth features around the new center position, and updating parameters of the relevant filter;

s64: and in consideration of the stability and continuity of the target motion of the unmanned aerial vehicle, filtering the candidate target area track with the tracking frame number less than the threshold value, and finally obtaining the tracking target as the detection result of the unmanned aerial vehicle.

Further, the step of constructing the correlation filter is:

s621: and (3) setting the size of the output feature as M multiplied by N multiplied by D and the depth feature as x, constructing an objective function of the correlation filter:

wherein, w^*Is an objective function of the correlation filter, w is the correlation filter, x_m,nFor the feature at the (m, n) pixel, λ is the regularization parameter λ (λ ≧ 0), y (m, n) denotes the label of the pixel at (m, n);

y (m, n) obeys a two-dimensional gaussian distribution:

wherein σ is the width of the Gaussian kernel;

s622: converting the objective function into a frequency domain by using fast Fourier transform to obtain an optimal solution of the objective function,

fourier transform of Y indicates a Hadamard product, W^dIs the optimal solution of the objective function,

taking the Fourier transform of the depth feature x, wherein i is the ith channel, D is the channel sequence, and D belongs to {1,2, …, D };

s623: given a candidate target region of the next frame image, for the depth feature z of the candidate region, the response map corresponding to the correlation filter is:

wherein F^-1Which represents the fourier transform of the signal,

a fourier transform representing the depth feature z.

Further, the updating of the parameters of the correlation filter in step S63 satisfies:

P_t、Q_tis an intermediate variable, W_tIs the target function of the correlation filter of the updated t-th frame, t is the video frame number, and eta is the learning rate.

The invention has the beneficial effects that:

1) the invention provides a method for detecting an unmanned aerial vehicle based on the motion characteristics and deep learning characteristics of the unmanned aerial vehicle. The method can effectively detect the target under the conditions that the background is complex and the unmanned aerial vehicle is small.

2) The method improves the traditional deep neural network structure, and effectively solves the problem that the existing target detection algorithm based on the deep neural network is not suitable for small targets.

3) The method provides an online tracking algorithm based on multilayer depth features and related filters, and can better track and predict the trajectory of the unmanned aerial vehicle and filter false targets.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a schematic diagram of a method for detecting a small target of an unmanned aerial vehicle based on motion characteristics and deep learning characteristics according to the invention;

FIG. 2 is a schematic diagram of a video image stabilization algorithm;

FIG. 3 is a schematic diagram of a convolutional neural network structure;

FIG. 4 is a schematic diagram of candidate object generation using an improved area generation network;

FIG. 5 is a schematic diagram of a dual channel-based deep neural network;

FIG. 6 is a schematic diagram of an online tracking algorithm based on depth features.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

In the invention, a candidate target detection module based on motion characteristics extracts a motion target area in a video through low-rank matrix analysis after video stabilization is carried out on an original video;

a candidate target detection module based on the depth characteristics extracts candidate targets from the video image through an improved area generation network model;

the improved candidate area generation network is characterized in that the network structure and the size of the candidate area are modified on the basis of the traditional area generation network, and the network layer of the output characteristic diagram is replaced;

the candidate region fusion module is used for fusing the candidate regions obtained in the steps S2 and S3;

the candidate target identification module based on the dual-channel deep neural network is used for improving a traditional deep neural network model to classify and identify candidate areas according to the characteristics of small targets to obtain a final identification result;

the depth feature-based online tracking algorithm improves the traditional artificial feature-based tracking algorithm, utilizes the features of the target extracted by the convolutional neural network, and has robustness.

Fig. 1 is a schematic diagram of a method for detecting a small target of an unmanned aerial vehicle based on motion characteristics and deep learning characteristics, and as shown in the figure, the method specifically includes the following steps:

step S1: firstly, processing a data set of an input original video through a video image stabilization algorithm to compensate camera motion, wherein a specific flow chart is shown in fig. 2:

s101: and extracting key points from the image by using an SURF algorithm, and constructing a SURF feature point descriptor.

S102: and calculating Euclidean distances of the corresponding feature points of the two frames of images, then selecting the minimum distance, setting a threshold value, keeping the matching points when the distance of the corresponding feature points is smaller than the threshold value, and otherwise, removing the matching points.

S103: the two frames of images are subjected to bidirectional matching, and by repeating the step S12, when the matched feature point pair coincides with the result obtained in the step S12, the final feature matching point is obtained.

S104: and setting the camera motion as an affine transformation model, and calculating by using a least square method according to the feature matching points obtained in the step to obtain the affine transformation model between the two frames of images.

S105: and according to the obtained affine transformation model, registering the current frame and the set reference frame to obtain the compensated current frame, storing the compensated current frame into a new video, and finally obtaining the stable video.

S106: and calculating the offset degree between the compensated current frame and the reference frame, if the offset degree is greater than a threshold value, updating the current frame to be the reference frame, and otherwise, continuously reading the next frame.

Step S2: and detecting a motion candidate target region of the compensated video by a low-rank matrix analysis method, and removing fine noise points in the motion candidate region by an image post-processing module.

Step S3: dividing a data set into a training set and a testing set, and training by using the training data set to obtain an improved candidate area generation network model; candidate targets are generated for the video images of the test set by the trained improved region generation network. The improved area generation network model structure is shown in fig. 4:

in step S3, training the improved candidate area generation network model using the positive and negative samples of the data set; generating candidate targets for the video images of the test set through the trained improved area generation network, specifically comprising the following processes:

firstly, aiming at the characteristics of an unmanned aerial vehicle, improving the traditional candidate area generation network structure to obtain an improved candidate area generation network, wherein the improved candidate area generation network modifies the network structure and the scale size of characteristic extraction, and replaces the network layer of an output characteristic diagram; then training by utilizing a training data set to obtain an improved candidate region to generate the optimal weight of the network model; and finally, applying the optimal weight to the test data set to obtain a candidate target rectangular frame.

The improved area generation Network is mainly additionally provided with two full convolution layers on the basis of a Convolutional Neural Network (CNN), wherein one full convolution layer is an area classification layer and used for judging whether a candidate area is a foreground target or a background, and the other full convolution layer is an area frame regression layer and used for predicting the position coordinates of the candidate area. The convolutional neural network is composed of five convolutional layers, three pooling layers and two fully-connected layers, as shown in fig. 3. The traditional area generation network usually processes the feature map generated by the last convolutional layer, but the small target often depends more on the shallow feature because the shallow feature has higher resolution, so the method changes the shallow feature into the fourth convolutional layer. And the area generation network slides on the feature map output by the fourth convolution layer through a sliding network, the sliding network is fully connected with 9 windows with different scales on the feature map each time, then the low-dimensional vector is mapped, and finally the low-dimensional vector is sent to two fully-connected layers to obtain the category and the position of the candidate target. Compared with the traditional area generation network, the method has the advantage that the size of 9 scales is reduced compared with the original size, so that the method is more beneficial to the detection of small targets.

Step S3 specifically includes:

s31: dividing a data set of a video into a training set and a testing set;

In step S32, the image is sampled by negative samples, the width and height of the sampled region are determined by the maximum (minimum) width and height of the positive samples, and the overlap ratio of the region of the negative samples and the positive samples cannot exceed the following conditions:

Step S4: and performing dense sampling on the candidate target area obtained in the step S2 to obtain a denser candidate target frame, and then obtaining a final candidate target by fusing the dense candidate target frame with the candidate target obtained in the step S3.

The specific fusion mode comprises the following steps:

s41: taking the motion candidate region obtained in the step S2 as a seed candidate region, and performing further dense sampling on the seed candidate region to obtain a dense seed candidate region;

s42: calculating the similarity between the seed candidate region and the candidate region obtained in step S3, when the similarity is larger than mu (mu epsilon [0.6, 1)]) And meanwhile, combining the two candidate regions, and traversing all the seed candidate regions to obtain a final candidate region. The similarity Sim calculation formula of the area A and the area B is as follows:

step S5: according to the deep neural network model based on the double channels and aiming at the small target detection, the network model is obtained by training through a training data set, and then the network model is applied to candidate targets of a test set to obtain a recognition result. The structure of the deep neural network model based on the dual channels is shown in fig. 5:

the deep neural network model based on the double channels mainly comprises two parts, namely a front-end module and a rear-end module. The front-end module consists of two parallel deep neural network models, one of which takes a candidate target area as input directly and obtains 4096-dimensional characteristics through 6 convolution layers and 1 full-connection layer; and the other one takes an extended area of the target area which is 4 times that of the candidate target area as the center on the original drawing as input, and obtains 4096-dimensional characteristics through 6 convolutional layers and 1 full-connected layer. The back end module is used for inputting two 4096 characteristics obtained by the front end module in a string mode, and classification information of each candidate area is obtained through 2 full connection layers and 1 softmax layer to serve as a final classification result.

In step S5, according to the two-channel-based deep neural network for small target detection proposed by the method, the network model is obtained by training with a training data set, and then the network model is applied to candidate targets in a test set to obtain a recognition result, which specifically includes:

s51: for the training data set, dividing the candidate target region of the training data set obtained in the step S4 into positive and negative samples, and inputting the improved two-channel-based deep neural network training to obtain the optimal weight.

S52: and applying the optimal weight to the candidate target area of the test data set obtained in the step S4 for classification, so as to obtain a final recognition result.

Step S6: the target tracking method based on the depth features, which is provided by the method, is applied to the recognition result of the step S5, the position of the target is predicted by using a relevant filtering algorithm, and the matched stable target is tracked, so that the false target is filtered, and the final position of the unmanned aerial vehicle is obtained. The specific flow chart of the target tracking algorithm based on the depth features is shown in fig. 6:

s601: inputting a candidate target area of a previous frame of the current frame, firstly thinning a convolution feature map array obtained by the last three layers of convolution layers of the model by using the neural network model obtained by training in the step S5, and then extracting the depth feature of the target by using the thinned feature map;

s602: constructing corresponding correlation filters for the output characteristics of each convolution layer, performing convolution on the characteristics of each layer and the corresponding correlation filters from back to front, and calculating corresponding confidence scores to obtain the new position of the candidate target in the current frame;

s603: depth features are extracted around the new center position of the candidate object to update the parameters of the correlation filter.

S604: and in consideration of the stability and continuity of the target motion of the unmanned aerial vehicle, filtering the candidate target area track with the tracking frame number less than the threshold value, and finally obtaining the tracking target as the detection result of the unmanned aerial vehicle.

The threshold mentioned in step S604 has a value in the range of 5 to 20.

In step S6, the constructing a corresponding correlation filter for each of the M × N × D output features specifically includes:

firstly, let the depth feature with size M × N × D be x, and construct the objective function of the corresponding correlation filter as:

wherein, lambda (lambda is more than or equal to 0) is a regularization parameter; y (m, n) represents the label of the pixel at (m, n), which follows a two-dimensional gaussian distribution:

then, the objective function is converted into a frequency domain by using fast Fourier transform, and the optimal solution of the objective function can be derived as follows:

wherein Y is a Fourier transform of Y,. indicates a Hadamard product;

finally, after a candidate target region of the next frame image is given, for the depth feature z of the candidate region, the response map corresponding to the correlation filter is:

wherein, F^-1Representing an inverse fourier transform.

Further, in step S6, the parameter W of the correlation filter^dThe updating policy specifically includes:

where t is the video frame number and η is the learning rate.

Finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A small target detection method of an unmanned aerial vehicle based on motion characteristics and deep learning characteristics is characterized in that: the method comprises the following steps:

s6: predicting the position of a target by using a relevant filtering algorithm, tracking and matching the stable target, and filtering out false targets to obtain the position of the unmanned aerial vehicle;

in step S3, the improved candidate area generation network model includes five convolutional layers and two fully-connected layers connected in sequence, wherein pooling layers are disposed between the first convolutional layer and the second convolutional layer, between the second convolutional layer and the third convolutional layer, and between the fifth layer and the first fully-connected layer;

step S3 specifically includes:

s31: dividing a data set of a video into a training set and a testing set;

s34: processing the video image of the test set through the improved area generation network model to generate a candidate target area II;

the width and height size range of the randomly sampled region in step S32 is determined by the width and height of the positive sample, and the overlapping region of the negative sample and the positive sample satisfies:

where IoU is the overlap ratio, r_gIs a positive sample region, r_nRandomly sampling a negative sample region;

the step S4 of obtaining the candidate target region iii by fusion specifically includes:

s43: traversing all the candidate target areas I to obtain a final candidate target area III;

in step S5, the dual-channel-based deep neural network includes a front-end module and a back-end module;

step S5 specifically includes:

2. The unmanned aerial vehicle small target detection method based on the motion characteristic and the deep learning characteristic as claimed in claim 1, characterized in that: in step S1, the video image stabilization algorithm includes:

3. The unmanned aerial vehicle small target detection method based on the motion characteristic and the deep learning characteristic as claimed in claim 1, characterized in that: in step S2, the process of detecting the motion candidate target region i by using the low rank matrix analysis method includes the following steps:

is f_nVectorizing the image matrix;

4. The unmanned aerial vehicle small target detection method based on the motion characteristic and the deep learning characteristic as claimed in claim 1, characterized in that: step S6 specifically includes:

s61: the center position (x) of the target of the previous frame of the current frame t is known_t-1,y_t-1) For step S5 trainingAcquiring an improved candidate region generation network model, performing sparsification on a convolution feature map array acquired by the last three layers of convolution layers of the improved candidate region generation network model, and extracting depth features of a target by using the sparsified feature map;

5. The unmanned aerial vehicle small target detection method based on the motion characteristic and the deep learning characteristic as claimed in claim 4, wherein: the steps of constructing the correlation filter are as follows:

y (m, n) obeys a two-dimensional gaussian distribution:

wherein σ is the width of the Gaussian kernel;

wherein F^-1Which represents the fourier transform of the signal,

a fourier transform representing the depth feature z.

6. The unmanned aerial vehicle small target detection method based on the motion characteristic and the deep learning characteristic as claimed in claim 5, wherein: the parameters for updating the correlation filter in step S63 satisfy: