CN115272894A

CN115272894A - Unmanned aerial vehicle-oriented image target detection method and device, electronic equipment and storage medium

Info

Publication number: CN115272894A
Application number: CN202210917031.8A
Authority: CN
Inventors: 王素玉; 张磊; 张宏宇; 周伯翔
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-11-01

Abstract

The invention discloses a method and a device for detecting an image target facing an unmanned aerial vehicle, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining an aerial image data set of the unmanned aerial vehicle and processing the image data set; constructing an unmanned aerial vehicle image target detection model; training the target detection model by using the image data set to obtain a final model; wherein, the unmanned aerial vehicle image target detection model is constructed by the following steps: replacing a backbone network of the Yolov5s model with a MobileNet V3_ Small network; cutting an original MobileNet V3_ Small network, and removing the last 4 layers of the original design for a classification task, wherein the last 4 layers comprise 3 convolution layers and 1 pooling layer; and taking the spatial pyramid pooling structure SPPF as the last layer of the MobileNet V3_ Small network. According to the invention, the lightweight MobileNet V3_ Small network is constructed, and the lightweight network is trained, so that the model not only ensures the speed, but also ensures the precision.

Description

Unmanned aerial vehicle-oriented image target detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of images, in particular to an unmanned aerial vehicle-oriented image target detection method and device, electronic equipment and a storage medium.

Background

In the deep neural network, the effect of the model is improved by superposing a large number of convolutional layers, but the network comprises a large number of redundant parameters by increasing the number of network layers. In practical application, the redundant parameters increase the prediction time, and meanwhile, the requirement on the memory is higher, so that the model lightweight becomes more important on the equipment with limited performance, such as an unmanned aerial vehicle and the like. At present, there are two main methods for model weight reduction, one is compression based on the existing complex model, and the other is to design a lightweight network structure. MobileNet V3 designs a light-weight network structure, and effectively improves the reasoning speed and performance of a mobile embedded terminal. The pruning lightweight method improves the deployment capability of the model on the premise of limited hardware resource conditions. However, the balance between the precision and the speed is difficult to achieve by simply adopting a light weight method, so that various light weight strategies facing the target detection of the aerial image of the unmanned aerial vehicle are designed, the model parameters and the calculated amount are reduced, and the target detection precision is improved as much as possible.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an unmanned aerial vehicle-oriented image target detection method, an unmanned aerial vehicle-oriented image target detection device, electronic equipment and a storage medium.

The invention discloses an unmanned aerial vehicle-oriented image target detection method, which comprises the following steps:

acquiring an unmanned aerial vehicle aerial image data set, and processing the image data set;

constructing an unmanned aerial vehicle image target detection model;

training the target detection model by the image data set to obtain a final model;

wherein, the construction of the unmanned aerial vehicle image target detection model comprises the following steps:

replacing a backbone network of the Yolov5s model with a MobileNet V3_ Small network;

cutting the original MobileNet V3_ Small network, and removing the last 4 layers of the original design for classification tasks, wherein the last 4 layers comprise 3 convolutional layers and 1 pooling layer;

and taking the spatial pyramid pooling structure SPPF as the last layer of the MobileNet V3_ Small network.

Preferably, training the image dataset on the object detection model comprises:

performing sparse training on the target detection model by using L1 regularization;

after sparse training, calculating to obtain a pruning evaluation index according to the sum of the scale parameter of the BN layer and the absolute value of the filter;

based on a set pruning threshold, pruning channels corresponding to the pruning evaluation indexes lower than the pruning threshold;

and taking the cut model as a student model, taking the original Yolov5s model as a teacher model, and carrying out knowledge distillation training under the supervision of the teacher model to obtain the final model.

Preferably, the sparse training using L1 regularization on the target detection model comprises:

the method for acquiring the pruning channel according to the BN layer comprises the following formula:

sparse training is carried out on the scale parameters of the BN layer, the numerical value of the BN layer is enabled to be continuously close to 0, and the formula is as follows:

the formula of the loss function of the target detection model is as follows:

in the formula: z is a linear or branched member _in 、Z _out Input features and output features, respectively; gamma is a scale parameter; beta is a deviation; mu.s _c And σ _c Respectively the mean value and the variance of the current batch; epsilon is a parameter;

is the original loss function; lambda sigma _λ∈τ g (gamma) is a penalty item of the lamination weight parameter and the BN layer scale parameter; λ is a regularization coefficient; x and W are input image characteristics and convolution layer weight parameters respectively; g (. Gamma.) is L1.

Preferably, after sparse training, the step of obtaining a pruning evaluation index by calculation according to the sum of the scale parameter of the BN layer and the absolute value of the filter comprises:

the sum of absolute values of the filters is as follows:

in the formula: k is _m Is the sum of the absolute values of the weighting parameters for the filter m; i W _i The absolute value of the weight of the ith convolution kernel in the convolution kernel m is |; j is the total number of convolution kernels in the current filter m;

the calculation formula of the pruning evaluation index is as follows:

s _i ＝γ×K _i ；

in the formula: s is _i Is the pruning standard score of the filter i, gamma is the scale parameter of a BN layer connected behind the filter i, K _i Is the sum of the absolute values of the filters i.

Preferably, based on a set pruning threshold, pruning the channel corresponding to the pruning evaluation index being lower than the pruning threshold includes:

carrying out ascending arrangement on the pruning evaluation indexes;

meanwhile, the pruning proportion is set to be 50%, and a pruning threshold value is set from the position of 50% from small to large;

and if all branch evaluation indexes are smaller than the pruning threshold, reserving the maximum two branch evaluation indexes.

Preferably, the step of training knowledge distillation under the supervision of the teacher model by taking the pruned model as a student model and the original Yolov5s model as a teacher model to obtain the final model comprises the following steps:

the loss function of the student model comprises a regression loss function, a classification loss function and a confidence coefficient loss function;

the confidence loss function formula is:

the classification loss function formula is:

the regression loss function formula is as follows:

the loss function of the student model is:

in the formula:

a confidence label, a category label and a prediction box position label of the student model respectively,

respectively predicting the confidence score, the category score and the prediction frame coordinate output by the student model; lambda [ alpha ] _D Is a balance parameter;

and predicting the confidence of the teacher model.

Preferably, processing the image data set comprises performing a normalization operation on each image, mapping all pixel values to a range of 0 to 1.

The invention also provides an image target detection device for the unmanned aerial vehicle, which comprises:

the acquisition module is used for acquiring an unmanned aerial vehicle aerial image data set and processing the image data set;

the construction module is used for constructing an unmanned aerial vehicle image target detection model;

the training module is used for training the image data set on the target detection model to obtain a final model;

wherein, the constructing of the unmanned aerial vehicle image target detection model comprises the following steps:

cutting the original MobileNet V3_ Small network, and removing the last 4 layers of the original design for a classification task, wherein the last 4 layers comprise 3 convolution layers and 1 pooling layer;

The invention also provides an electronic device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the above-mentioned method.

The invention also provides a storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the above-mentioned method.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the lightweight MobileNet V3_ Small network is constructed and trained, so that the model not only ensures the speed, but also ensures the precision.

Drawings

FIG. 1 is a schematic view of a flow structure of an unmanned aerial vehicle-oriented image target detection method according to the present invention;

FIG. 2 is a schematic diagram of an image target detection method for an unmanned aerial vehicle according to the present invention, wherein the structure of the image target detection method is provided with an SPPF;

fig. 3 is a network structure diagram of a target detection model in the unmanned aerial vehicle-oriented image target detection method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

referring to fig. 1, the invention discloses an unmanned aerial vehicle-oriented image target detection method, which comprises the following steps:

in this embodiment, the image data set of the VisDrone unmanned aerial vehicle aerial photography is used, so that the input image can be trained and predicted normally, the image is preprocessed, the image is normalized, all pixel values are mapped to be in the range of 0 to 1, the size of the input image is limited to 640 × 640, and the input image is conveniently input into the model.

Constructing an unmanned aerial vehicle image target detection model;

specifically, referring to fig. 2, constructing the unmanned aerial vehicle image target detection model includes:

cutting an original MobileNet V3_ Small network, and removing the last 4 layers of the original design for a classification task, wherein the last 4 layers comprise 3 convolution layers and 1 pooling layer;

Referring to fig. 3, the mobilenetv3 \/inversed residual module, which is the main component of the backbone network MobileNetV3_ Small, includes a 1 × 1 convolution operation, a BN layer, a Relu activation function, a 3 × 3 depth separable convolution, and an SE attention mechanism. Specifically, the 13 th layer of 1 × 1 convolutional layer, the 14 th layer of 7 × 7 global average Pooling layer, and the 15, 16 layers of 1 × 1 convolution are deleted, and instead, there is a modified Spatial Pyramid Pooling Structure (SPPF) as the last layer of the backbone network.

Training the target detection model by using the image data set to obtain a final model;

in this embodiment, training the image dataset on the target detection model includes:

performing sparse training on a target detection model by using L1 regularization;

specifically, the method for obtaining the pruning channel according to the BN layer has the formula:

in the formula: the gamma scale parameter and the beta deviation participate in the back propagation of the detection network at the same time, and are trainable parameters. Sparse training is carried out on the scale parameter gamma of the BN layer, namely, L1 regular constraint is added, and the numerical value is enabled to be close to 0 continuously; z is a linear or branched member _in 、Z _out Respectively input features and output featuresPerforming identification; mu.s _c And σ _c Respectively the mean value and the variance of the current batch; the parameter epsilon is to prevent the denominator from being 0.

The formula of the loss function of the target detection model is as follows:

The first term of the formula is an original loss function, the second term is a penalty term of a convolutional layer weight parameter and a BN layer scale parameter, lambda is a regularization coefficient, namely a sparsity rate, the larger the constraint strength is, x and W respectively represent an input image characteristic and the convolutional layer weight parameter, and g (gamma) represents L1, namely g (gamma) = | gamma |. In the sparse training process, a mode of simultaneously constraining the convolution kernel weight parameters and the BN layer scale parameters is adopted, the smaller the convolution kernel weight is, the lower the information importance degree is, and the smaller the BN layer scale parameters are, the smaller the constraint strength is. After the sparse training is completed, when the size of the scale parameter γ approaches 0, regardless of the size of the convolution output feature of the previous layer, the feature of the channel is multiplied by the scale parameter whose corresponding value is small enough, and then the output of the channel is also small enough, so that the corresponding relationship between the input feature channel and the output feature channel is cut off. The method is improved on the basis of the scale parameters, the parameters of a BN layer are taken as the standard for judging the importance of a pruning channel, meanwhile, the convolution parameters of the filter are involved in the judgment standard, the operation is specifically adopted, the penalty of the weight parameters is added into a loss function, and a plurality of weights close to 0 appear in the filter after sparse training.

After sparse training, calculating to obtain pruning evaluation indexes according to the sum of the scale parameters of the BN layer and the absolute value of the filter;

specifically, for the sum of absolute values of the same filter, the magnitude of the sum may represent the importance of the convolution kernel, that is, the formula of the sum of absolute values of the filter is:

in the formula: k is _m Is the sum of the absolute values of the weighting parameters for filter m; i W _i I is the absolute value of the weight of the ith convolution kernel in the convolution kernel m; j is the total number of convolution kernels in the current filter m;

combining the sum of the scale parameter gamma after sparse training and the absolute value of the filter, and fusing the judgment information of the filter and the BN layer to calculate a final pruning evaluation index s, wherein the calculation formula is as follows:

s _i ＝γ×K _i ；

in the formula, s _i Is the pruning standard score of the filter i, gamma is the scale parameter of the BN layer connected behind the filter i, K ⁱ Is the sum of the absolute values of the filters i.

Based on a set pruning threshold, pruning channels corresponding to pruning evaluation indexes lower than the pruning threshold;

specifically, after the operation is completed, a filter list set S for judging pruning standards is obtained, scores in the set are sorted in an ascending order, the pruning proportion is set to be 50%, and a pruning threshold is set from a position with a small value to a position with a large value of 50%;

if all branch evaluation indexes are smaller than a pruning threshold, the maximum two branch evaluation indexes are reserved; and pruning and removing the filters which are lower than the threshold value and the corresponding BN layer scale parameter gamma, and simultaneously reducing the residual filters after pruning and the feature maps corresponding to the scale parameters. The convolution kernels corresponding to the channels are cut off, the calculated amount of the convolution kernels is reduced while the parameters of the network are reduced through channel pruning, and a lightweight network model with less calculated amount and less memory occupation is obtained. According to the unmanned aerial vehicle image target detection method, pruning is continued on the basis of replacing a MobileNet lightweight trunk network, a lighter unmanned aerial vehicle image target detection model is obtained, and specific experimental data after pruning are shown in an experimental detail part.

θ＝sort _R (S)；

In the formula: theta is the final threshold size for pruning operation, and sort _ R (S) is pruning deletion for the result before sorting the filter list set in ascending order by the threshold R.

And (4) taking the cut model as a student model, taking the original Yolov5s model as a teacher model, and carrying out knowledge distillation training under the supervision of the teacher model to obtain a final model.

Specifically, the method comprises the following steps of taking the cut model as a student model, taking an original Yolov5s model as a teacher model, and carrying out knowledge distillation training under the supervision of the teacher model to obtain a final model, wherein the step of obtaining the final model comprises the following steps:

the confidence loss function is formulated as:

in the formula:

respectively representing confidence label, student model confidence prediction, teacher model confidence prediction, lambda _D Used for balancing two-part loss, the first term in the formula represents the original confidence coefficient loss function of the student model, the second term represents the loss function of knowledge distillation, and the confidence coefficientDegree label

Replacement with prediction output of teacher model

Thereby realizing the transfer of knowledge to the student model;

the classification loss function is formulated as:

namely, the classification loss of the student model with the same design also consists of two parts, and the difference is that the second part of the teacher model is added with confidence measure parameters

The probability that each anchor frame contains the detected target is expressed, and the method has the significance that if one anchor frame is the background, the confidence coefficient value is small, the loss of the whole second teacher model is invalid, and the student model is prevented from learning unimportant background information, so that the convergence is accelerated.

Similarly, the regression loss function formula is:

and finally, calculating final loss on the last layer of convolution output characteristic graph by using a single-stage detection algorithm, wherein the student model loss function comprises a target detection loss function and a distillation loss function, namely the loss function of the student model is finally:

in the formula:

respectively a confidence label, a category label and a prediction box position label of the student model,

and predicting the confidence of the teacher model.

Further, the student model is used as a single-stage target detection network, which is different from a double-stage detection model based on the candidate region RCNN series, and Yolo outputs detection frame coordinates, category scores and confidence score thereof at the last layer of the network. Assuming that the form of the final layer of prediction output feature matrix is M × (C + 5), M represents the number of anchor frames generated by each cell in the prediction feature map, C represents the number of preset classification categories, and the meaning of numeral 5 is 4 position coordinates representing detection frames and 1 confidence representing that the current anchor frame contains a target. In the candidate region-based two-stage detection model, the knowledge distillation loss function is applied in a manner that features output by the last convolutional layer of the teacher model are directly transmitted to the student model, but no candidate region exists in the single-stage detection algorithm, the features are directly predicted on the last output feature map, three anchors are generated at each position on the feature map, for example, the size of the feature map output finally in the final prediction branch is 40 × 40, the number of generated anchor frames is 4800, a large number of anchor frames do not have targets and only contain background regions, and if a large number of background regions are transmitted to the student model, the network is caused to continuously optimize and learn coordinates of the background regions and classify and optimize the background regions, so that the convergence of the student model is difficult. The number of candidate region detection boxes of the two-stage detection model is small, and most of the candidate region detection boxes contain the target to be detected, so that a method based on confidence measure knowledge distillation is designed in a single-stage detection algorithm. The single-stage final prediction result comprises confidence degree prediction, so a method for constraining knowledge distillation loss by using confidence degree scores of prediction output is provided, namely, the final loss function of the student model is only contributed when the confidence degree of a prediction output target of the teacher model is higher.

The single-stage detection network completes the knowledge distillation process, and besides adding distillation loss to the loss function, the method also has a step of very important operation, namely performing Non-Maximum inhibition operation (NMS) on a characteristic diagram. The feature diagram finally generated in the training process of the single-stage detection algorithm comprises a plurality of cells, one cell generates three anchor frames by default, the plurality of anchor frames can predict the same detection object in the reasoning process, and after the end-to-end prediction is completed, the detection frames which are highly overlapped in the detection result output by the last layer of convolutional layer are filtered out through NMS operation. However, in the knowledge distillation training process, the teacher model prediction process also generates excessively overlapped detection frames, and excessive redundant information is transmitted to the student models, so that the student models are over-fitted. Therefore, using a feature map based NMS in the teacher model inference process, e.g., the last level output contains n × n cells, if the same class is predicted between adjacent cells, they have a greater probability of detecting the same goal, so the adjacent cell output class is the same and its corresponding distillation loss is set to 0, and the size of the adjacent cell is set to 3 × 3 during the experiment. And finally, adding a distillation loss function in the student model and adding NMS operation based on the characteristic diagram in the prediction output of the teacher model to complete the training process of the teacher model for guiding the student model.

Training: and storing the optimal model parameters, and loading the stored model parameters in prediction.

The experimental environment was based on the ubuntu16.04 operating system, using Python and Pytorch versions 3.8 and 1.7, respectively. The whole experiment is improved and tested on the basis of the yolov5 detection framework, and experiment comparison and analysis are respectively carried out according to the method.

Table 1 influence of lightweight backbone network MobileNetV3 on algorithm performance

TABLE 1

Table 2 channel pruning lightweight experimental results based on convolutional layer weights and BN layer sparse training

TABLE 2

A complex model YooloV 5x is used in the experiment to guide the training of knowledge distillation of the pruning lightweight model, and the experimental data are shown in Table 3. And (3) adopting a distillation strategy based on an output response diagram, applying a knowledge distillation loss function based on confidence coefficient scale parameters for training, and distilling the knowledge learned by the YoloV5x to the models after Mobile-yolos and pruning. Compared with the data in Table 2, the amount and the calculated amount of Distillation-mobile-yolo5s after Distillation are not increased, and mAP50 is increased by 1.7; knowledge distillation promotion is most obvious when a Mobile-yolos-30% pruning model is carried out, the mAP is promoted by 3.9 compared with the model before knowledge distillation is not carried out, and the FPS reaches 16; distilling the mixture in a Mobile-yolos-50% model to obtain Distillation-Mobile-yolos-50%, wherein the reasoning speed reaches 21FPS. The experimental data shows that the teacher model is used for guiding and training the student model, and the generalization capability of the student model is greatly improved.

TABLE 3 knowledge distillation experimental data based on confidence measure parameters

TABLE 3

Through experimental result analysis, the method is found that by using a light-weight backbone network MobileNet V3 and adopting pruning light-weight based on convolutional layer weight parameters and BN layer sparse training, the model parameters and the calculated amount are greatly reduced, and the loss of detection precision is accompanied. And then, using knowledge distillation, and guiding the training of the small model by the large model so that the precision of the small model is close to the detection precision of the large model.

The invention also provides an unmanned aerial vehicle-oriented image target detection device, which comprises:

the training module is used for training the image data set to a target detection model and training to obtain a final model;

wherein, the unmanned aerial vehicle image target detection model of structure includes:

The invention also provides an electronic device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the above method.

The present invention also provides a storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the above-mentioned method.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An unmanned aerial vehicle-oriented image target detection method is characterized by comprising the following steps:

constructing an unmanned aerial vehicle image target detection model;

2. The drone-oriented image target detection method of claim 2, wherein training the image dataset to the target detection model comprises:

3. The unmanned aerial vehicle-oriented image target detection method of claim 2, wherein performing sparse training on the target detection model using L1 regularization comprises:

the method for obtaining the pruning channel according to the BN layer comprises the following formula:

the formula of the loss function of the target detection model is as follows:

4. The unmanned aerial vehicle-oriented image target detection method of claim 3, wherein after sparse training, obtaining pruning evaluation indexes by calculation according to the sum of the scale parameter of the BN layer and the absolute value of the filter comprises:

the sum of absolute values of the filters is as follows:

in the formula: k _m Is the sum of the absolute values of the weighting parameters for the m filters; i W _i I is the absolute value of the weight of the ith convolution kernel in the convolution kernel mFor the value; j is the total number of convolution kernels in the current filter m;

the calculation formula of the pruning evaluation index is as follows:

s _i ＝γ×K _i ；

in the formula: s is _i Is the pruning standard score of the filter i, gamma is the scale parameter of the BN layer connected behind the filter i, K _i Is the sum of the absolute values of the filters i.

5. The unmanned aerial vehicle-oriented image target detection method of claim 4, wherein based on a set pruning threshold, pruning channels for which the pruning evaluation index is lower than the pruning threshold comprises:

carrying out ascending arrangement on the pruning evaluation indexes;

meanwhile, setting the pruning proportion to be 50%, and setting a pruning threshold value at the position from the small position to the large position of 50%;

6. The unmanned aerial vehicle-oriented image target detection method as claimed in claim 5, wherein taking the pruned model as a student model, taking the original Yolov5s model as a teacher model, and performing knowledge distillation training under the supervision of the teacher model to obtain the final model comprises:

the confidence loss function formula is:

the classification loss function formula is:

the regression loss function formula is as follows:

the loss function of the student model is:

in the formula:

and predicting the confidence of the teacher model.

7. An unmanned aerial vehicle-oriented image target detection method as claimed in claim 1, wherein processing the image data set comprises performing a normalization operation on each image to map all pixel values to a range of 0 to 1.

8. The utility model provides a towards unmanned aerial vehicle image target detection device which characterized in that includes:

the training module is used for training the image data set to the target detection model to obtain a final model;

9. An electronic device, comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the method of any one of claims 1 to 7.

10. A storage medium storing a computer program executable by an electronic device, the program causing the electronic device to perform the method of any one of claims 1 to 7 when the program is run on the electronic device.