CN114821265A

CN114821265A - Optimized SSD detection model training method and small target detection method

Info

Publication number: CN114821265A
Application number: CN202210330727.0A
Authority: CN
Inventors: 强俊; 刘无纪; 管萍; 李习习; 杜云龙; 肖光磊; 吴维
Original assignee: Anhui Polytechnic University
Current assignee: Anhui Polytechnic University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-29
Anticipated expiration: 2042-03-30
Also published as: CN114821265B

Abstract

In the method for optimizing training of the SSD detection model and the small target detection method provided by one or more embodiments of the specification, the VGG16 network in the SSD detection model is used as a main network, the advantage of the residual error network on the strong feature information expression capability is utilized, the ResNet50 network is introduced as an auxiliary main network, to improve the characteristic expression capability of the network, performing characteristic fusion on convolution data obtained after the convolution of the VGG16 network and convolution data obtained after the convolution of the ResNet50 network, performing convolution on the fusion convolution data again until the preset times are reached, and then inputting the fused data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model, and then testing the SSD detection model until the loss function tends to be stable to finally obtain an optimized SSD detection model, so that the detection capability of the SSD network on small targets can be improved at the same time of high detection speed.

Description

Optimized SSD detection model training method and small target detection method

Technical Field

One or more embodiments of the present disclosure relate to the field of machine vision and image processing technologies, and in particular, to an optimized SSD detection model training method and a small target detection method.

Background

In recent years, unmanned aerial vehicle detection technology is beginning to be widely applied to real traffic scenes, and vehicle and pedestrian detection has important research significance as an important component of unmanned aerial vehicle detection technology.

Heretofore, the target detection method can be generally classified into a conventional machine learning method and a deep learning method. The traditional machine learning methods are all based on manual design, and the process of feature extraction is too complex, so the methods often have the problems of poor generalization capability, low detection speed, low detection precision and the like, and are difficult to adapt to detection tasks of different scenes.

At present, target detection methods based on a deep convolutional network are generally divided into Two-Stage detection methods and Single-Stage detection methods. The detection method based on Two-Stage is the most excellent performance of the Faster R-CNN, a regional suggestion network (RPN) is introduced into the network, the target boundary and the target score of each position can be predicted at the same time, and high-quality regional suggestions are generated through end-to-end training, so that the detection accuracy of the network is improved, but the fast R-CNN network is complex in structure and very slow in detection speed. In view of the problem of efficiency, a Single-Stage method was subsequently proposed, which represents methods such as YOLO and ssd (Single Shot multi box detector). YOLO uses a fully connected network, which may cause problems such as spatial information loss, positioning error, and missed target detection, and especially has a poor detection effect on small targets, which affects the final detection accuracy. The SSD borrows the Anchor thought in the Faster R-CNN and uses a plurality of feature maps with different scales for detection, and the SSD can be well suitable for detection of targets with different sizes due to different perception visual fields of the feature maps, but the shallow feature map semantic information of the SSD is poor, so the detection effect of the SSD on small targets is not ideal.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide an optimized SSD detection model training method and a small target detection method, which can effectively improve the detection accuracy of small target detection.

In view of the above, one or more embodiments of the present specification provide an optimized SSD detection model training method for detecting small targets, including:

acquiring a historical image;

dividing historical images into a training set and a test set;

respectively inputting the historical images in the training set into a VGG16 network and a ResNet50 network for convolution to respectively obtain initial convolution data of VGG16 network images and initial convolution data of ResNet50 network images;

performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;

returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;

inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;

and testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to execute the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.

As an alternative embodiment, the preset convolution number is three times.

As an optional implementation, feature fusion is performed on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data in an element addition manner:

F _OUT ＝ε(F _out )

in the formula

Representing the addition of elements, F _l Representing the output characteristic, F, of the VGG16 network _a Indicating the output characteristic of the ResNet50 network, F _out Representing feature fusion results, using F _OUT As an input value for the next layer of the VGG16 network; from F _out To F _OUT Is adjusted by a convolution operation with epsilon as 1 x 1.

As another embodiment of the present invention, a small target detection method for an aerial image of an unmanned aerial vehicle is provided, including:

acquiring an aerial photography historical image of the unmanned aerial vehicle;

training an initial SSD detection model by using the aerial photography historical image of the unmanned aerial vehicle to obtain a trained optimized SSD detection model, wherein the initial SSD detection model takes a VGG16 network model as a main trunk and takes a ResNet50 network as an auxiliary trunk;

acquiring an aerial real-time image of the unmanned aerial vehicle;

and inputting the aerial real-time image of the unmanned aerial vehicle into the optimized SSD detection model so as to identify a small target in the aerial real-time image of the unmanned aerial vehicle.

As an optional implementation manner, the training of the initial SSD detection model by using the history image of aerial photography by the unmanned aerial vehicle to obtain the trained optimized SSD detection model includes:

dividing the unmanned aerial vehicle aerial photography historical image into a training set and a test set;

respectively inputting the unmanned aerial vehicle aerial photography historical images in the training set into a VGG16 network and a ResNet50 network for convolution, and respectively obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images;

As an optional implementation manner, the unmanned aerial vehicle aerial photography historical image is subjected to data enhancement, and an image set obtained after enhancement is divided into a training set and a test set.

As an alternative embodiment, the data enhancement includes rotation, translation, cropping, and brightness adjustment.

As an alternative embodiment, the preset convolution number is three times.

F _OUT ＝ε(F _out )

in the formula

From the above description, it can be seen that in the method for optimizing training of the SSD detection model and the method for detecting the small target provided by one or more embodiments of the present specification, the VGG16 network in the SSD detection model is used as the main network, and the ResNet50 network is introduced as the auxiliary main network by utilizing the advantage of the residual error network (ResNet50 network) that has strong capability of expressing the feature information, to improve the characteristic expression capability of the network, performing characteristic fusion on convolution data obtained after the convolution of the VGG16 network and convolution data obtained after the convolution of the ResNet50 network, performing convolution on the fusion convolution data again until the preset times are reached, and then inputting the fused data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model, and then testing the SSD detection model until the loss function tends to be stable to finally obtain an optimized SSD detection model, so that the detection capability of the SSD network on small targets can be improved at the same time of high detection speed.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.

FIG. 1 is a logic diagram of an optimized SSD detection model training method in accordance with one or more embodiments of the present disclosure;

FIG. 2 is a logic diagram of a small target detection method in accordance with one or more embodiments of the present disclosure;

FIG. 3 is a schematic diagram of dual backbone network feature fusion in accordance with one or more embodiments of the present disclosure;

FIG. 4 is a schematic diagram of an improved SSD network model diagram in accordance with one or more embodiments of the present disclosure;

fig. 5-7 are diagrams illustrating examples of the detection effect according to one or more embodiments of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure is further described in detail below with reference to specific embodiments.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

As an embodiment of the present invention, there is provided an optimized SSD detection model training method for detecting small targets, including:

acquiring a historical image;

dividing historical images into a training set and a test set;

In the embodiment of the invention, a VGG16 network in an SSD detection model is used as a main network, a ResNet50 network is introduced as an auxiliary main network by utilizing the advantage of a residual error network (ResNet50 network) on strong feature information expression capability, so that the feature expression capability of the network is improved, feature fusion is carried out on convolution data obtained after convolution of a VGG16 network and convolution data obtained after convolution of a ResNet50 network, the convolution is carried out on the fusion data again until preset times are reached, then the fusion data is input into a subsequent network of the SSD for classification and detection, an improved SSD detection model is obtained, the SSD detection model is tested until a loss function tends to be stable, an optimized SSD detection model is finally obtained, and the detection capability of the SSD network on small targets can be improved while the detection speed is high.

As an embodiment of the present invention, as shown in fig. 1, there is provided an optimized SSD detection model training method for detecting small targets, including:

s100, acquiring a historical image;

s110, dividing the historical image into a training set and a test set;

s120, respectively inputting the historical images in the training set into a VGG16 network and a ResNet50 network for convolution, and respectively obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images;

s130, performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;

optionally, feature fusion is performed on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data in an element addition manner:

F _OUT ＝ε(F _out )

in the formula

S140, returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;

optionally, the preset convolution number is three. Experiments show that the ideal small target detection effect can be obtained by three times of convolution, excessive calculation can be avoided, and efficiency maximization is achieved.

S150, inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;

and S160, testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.

It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities.

Corresponding to the optimized SSD detection model training method for detecting the small target, the embodiment of the invention also provides a small target detection method for the aerial image of the unmanned aerial vehicle, which comprises the following steps:

acquiring an aerial real-time image of the unmanned aerial vehicle;

In the embodiment of the invention, an SSD detection model which takes a VGG16 network model as a main skeleton and takes a ResNet50 network as an auxiliary skeleton is introduced, the SSD detection model utilizes the advantage of a residual error network (ResNet50 network) on strong feature information expression capability, and the residual error network is taken as the auxiliary skeleton to improve the feature expression capability of the network; secondly, by means of a double-backbone network feature fusion technology, the bottom layer feature information in the main backbone network can be increased, and the feature information of the small target is improved; and thirdly, replacing the trunk network of the SSD target detection network with the fused double trunk network, and improving the detection effect of the SSD target detection network on the small target, so that the SSD detection model is trained by utilizing the aerial photography historical image of the unmanned aerial vehicle, and the small target can be accurately and quickly identified from the aerial photography real-time image of the unmanned aerial vehicle after the optimized SSD detection model is obtained.

The embodiment of the invention also provides a small target detection method for the aerial image of the unmanned aerial vehicle, which comprises the following steps as shown in fig. 2:

s200, acquiring an aerial photography historical image of the unmanned aerial vehicle;

s210, training an initial SSD detection model by using the unmanned aerial vehicle aerial photography historical image to obtain a trained optimized SSD detection model, wherein the initial SSD detection model takes a VGG16 network model as a main trunk and takes a ResNet50 network as an auxiliary trunk;

optionally, the S210 is configured to:

s211, dividing the unmanned aerial vehicle aerial image into a training set and a test set;

s212, inputting the aerial photography history images of the unmanned aerial vehicles in the training set into a VGG16 network and a ResNet50 network respectively for convolution, and obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images respectively;

s213, performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;

F _OUT ＝ε(F _out )

in the formula

S214, returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;

optionally, the preset convolution number is three.

S215, inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;

s216, testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with a loss function obtained by previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to the step of inputting the historical images in the training set into a VGG16 network and a ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.

S220, acquiring an aerial real-time image of the unmanned aerial vehicle;

s230, inputting the unmanned aerial vehicle aerial real-time image into the optimized SSD detection model so as to identify a small target in the unmanned aerial vehicle aerial real-time image.

As an optional implementation manner, the method for detecting a small target in an aerial image of an unmanned aerial vehicle further includes:

s217, performing data enhancement on the unmanned aerial vehicle aerial image history, and dividing an image set obtained after enhancement into a training set and a test set. The S217 is used to replace S211.

Optionally, the data enhancement includes rotation, translation, cropping, and brightness adjustment.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The invention will now be further described with reference to the following examples and drawings:

1. SSD destination detection network

Compared with other target detection networks, the SSD target detection network has the advantages of high detection speed and high detection precision, and can be used for real-time target detection, and the embodiment of the invention adopts the SSD target detection network as a reference network:

the method selects an SSD model with VGG16 as a backbone network, and VGG16 is a classical network with a network depth of 16, and selects a 3 x 3 convolution kernel with a single size, so that the effect of accumulation of small convolution kernels is better than that of a large convolution kernel under the condition of same feeling, and the number of parameters is less, so that the method has a better effect. The SSD method is based on a feed-forward convolutional network that generates a set of a priori boxes of fixed size and the fraction of object class instances present in these a priori boxes, and then produces the final detection result by non-maximum suppression (NMS). The first few network layers are based on a standard architecture for high quality image classification, which is referred to as the base network. And adding feature extraction layers conv8_2, conv9_2, conv10_2 and conv11_2 on the base network. The detection process is not only performed on the added feature map, but also on the basic network feature maps conv4_3, conv7 in order to ensure that the network has a good detection effect on small targets.

The SSD designs a different number, size and wide high proportion of prior boxes for each feature map, and the prior boxes are composed of a series of fixed number and size target boxes generated by a certain rule. The specific size of the prior frame is determined by the scale and the width-height ratio, each layer of feature map corresponds to one scale, and the scale is generated according to the formula (1):

wherein s is _k Represents the scale of the prior box in the kth feature map, s _min Is 0.2, s _max Is 0.9, m represents the number of signatures used for detection, and in SSD, m has a value of 6. Setting a priori frames with different quantities and sizes for each grid on each layer of feature map; each mesh on the 3-level feature map of Conv4_3, Conv10_2, and Conv11_2 yields 4 prior boxes, the width to height ratio a _r1 Taking {1,2,1/2 }; each mesh on the 3-layer feature map of Conv7, Conv8_2, and Conv9_2 yields 6 a priori boxes, a _r2 The width to height ratio is 1,2,1/2,3, 1/3. After determining the scale and aspect ratio, the a priori box size may be derived from equations (2) and (3).

Wherein,

and

width and height of the prior frame, respectively, a _r Get a _r1 Or a _r2 . For the prior frame with the width-to-height ratio of 1, an additional scale s 'is added' _k The calculation method is as shown in formula (4):

in the SSD, the number of the prior frames of the 1 st detection layer is 38 × 38 × 4 ═ 5776, the 2 nd layer is 19 × 19 × 6 ═ 2166, the 3 rd layer is 10 × 10 × 6 ═ 600, the 4 th layer is 5 × 5 × 6 ═ 150, the 5 th layer is 3 × 3 × 4 ═ 36, the 6 th layer is 1 × 1 × 4 ═ 4, and the total network output is 5776+2166+600+150+36+4 ═ 8732 prior frames.

2. Dual backbone network

Targets in the unmanned aerial vehicle image are mostly small targets, the small targets have serious problems of blurring and texture distortion, the characteristics are not obvious, some networks are difficult to extract key characteristic information, and the identification capability of the classifier is influenced. Therefore, a composite backbone network is proposed, which combines two common backbones, as shown in fig. 3 and 4. Selecting ResNet50 capable of keeping bottom layer detail information as an auxiliary backbone network, fusing bottom layer features extracted by ResNet50 layer by layer into a main backbone network VGG16 on the basis that the main backbone network is kept unchanged, and replacing feature layers of the original backbone network with the feature layers obtained after fusion to be used as new feature layers to carry out next convolution.

In the secondary backbone, the result of each stage can be viewed as a higher level feature. The output of each feature level is part of the main skeleton input and flows to the parallel stages of the subsequent skeleton. In this way, multiple high-level and low-level features are fused to generate a richer feature representation. This process can be expressed as:

F _OUT ＝ε(F _out ) (6)

wherein

Representing the addition of elements, F _l Representing the output characteristics of the main skeleton at the current stage, F _a Representing output characteristics of auxiliary trunks, using F _out Display feature fusion results, using F _OUT As an input value for the next layer of the main trunk. From F _out To F _OUT Is adjusted by a convolution operation with epsilon as 1 x 1. Theoretically, this composite connection method can be used at the backbone layer, and experiments using the most basic and most useful composite connection method have shown that the proposed composite connection method is not limited by the feature size, and for simplicity of operation 150 × 150, 75 × 75 and 38 × 38 feature layers on the backbone are chosen, corresponding to the output of three layers of ResNet 50.

3. Analysis of experiments

In order to verify the effectiveness of the algorithm, the algorithm is verified on the Visdrone2019 unmanned aerial vehicle aerial photography data set.

To verify the effectiveness of the improvements to the SSD, the original SSD and the improved SSD were verified on the data set, respectively. The experimental results are shown in table 1 and fig. 5-7, and it can be seen from the table that the precision of the improved SSD is improved to some extent compared with the original SSD, because the improved SSD has better network feature extraction capability.

TABLE 1 SSD and modified SSD comparative experiments

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software and/or hardware in implementing one or more embodiments of the present description.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.

Claims

1. An optimized SSD detection model training method for detecting small targets, comprising:

acquiring a historical image;

dividing historical images into a training set and a test set;

2. The optimized SSD detection model training method for detecting small targets of claim 1, wherein the preset number of convolutions is three.

3. The method of claim 1, wherein the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data are feature fused by element addition:

F _OUT ＝ε(F _out )

in the formula

4. A small target detection method for aerial images of unmanned aerial vehicles is characterized by comprising the following steps:

acquiring an aerial real-time image of the unmanned aerial vehicle;

5. The method for detecting small objects in the aerial images of the unmanned aerial vehicle as claimed in claim 4, wherein the training of the initial SSD detection model by the aerial image history of the unmanned aerial vehicle to obtain the trained optimized SSD detection model comprises:

inputting the aerial photography history images of the unmanned aerial vehicles in the training set into a VGG16 network and a ResNet50 network respectively for convolution, and obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images respectively;

6. The method for detecting the small targets in the aerial image of the unmanned aerial vehicle as claimed in claim 5, further comprising performing data enhancement on the aerial historical image of the unmanned aerial vehicle, and dividing an image set obtained after enhancement into a training set and a test set.

7. The method of claim 6, wherein the data enhancement comprises rotation, translation, cropping, and brightness adjustment.

8. The method for detecting small objects in the aerial image of an unmanned aerial vehicle of claim 5, wherein the preset convolution number is three.

9. The method for detecting the small target of the aerial image of the unmanned aerial vehicle of claim 5, wherein the feature fusion is performed on the primary convolution data of the VGG16 network image and the primary convolution data of the ResNet50 network image in an element addition manner:

F _OUT ＝ε(F _out )

in the formula