CN114821265A - Optimized SSD detection model training method and small target detection method - Google Patents

Optimized SSD detection model training method and small target detection method Download PDF

Info

Publication number
CN114821265A
CN114821265A CN202210330727.0A CN202210330727A CN114821265A CN 114821265 A CN114821265 A CN 114821265A CN 202210330727 A CN202210330727 A CN 202210330727A CN 114821265 A CN114821265 A CN 114821265A
Authority
CN
China
Prior art keywords
network
convolution
image
data
detection model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210330727.0A
Other languages
Chinese (zh)
Other versions
CN114821265B (en
Inventor
强俊
刘无纪
管萍
李习习
杜云龙
肖光磊
吴维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Polytechnic University
Original Assignee
Anhui Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Polytechnic University filed Critical Anhui Polytechnic University
Priority to CN202210330727.0A priority Critical patent/CN114821265B/en
Publication of CN114821265A publication Critical patent/CN114821265A/en
Application granted granted Critical
Publication of CN114821265B publication Critical patent/CN114821265B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

In the method for optimizing training of the SSD detection model and the small target detection method provided by one or more embodiments of the specification, the VGG16 network in the SSD detection model is used as a main network, the advantage of the residual error network on the strong feature information expression capability is utilized, the ResNet50 network is introduced as an auxiliary main network, to improve the characteristic expression capability of the network, performing characteristic fusion on convolution data obtained after the convolution of the VGG16 network and convolution data obtained after the convolution of the ResNet50 network, performing convolution on the fusion convolution data again until the preset times are reached, and then inputting the fused data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model, and then testing the SSD detection model until the loss function tends to be stable to finally obtain an optimized SSD detection model, so that the detection capability of the SSD network on small targets can be improved at the same time of high detection speed.

Description

Optimized SSD detection model training method and small target detection method
Technical Field
One or more embodiments of the present disclosure relate to the field of machine vision and image processing technologies, and in particular, to an optimized SSD detection model training method and a small target detection method.
Background
In recent years, unmanned aerial vehicle detection technology is beginning to be widely applied to real traffic scenes, and vehicle and pedestrian detection has important research significance as an important component of unmanned aerial vehicle detection technology.
Heretofore, the target detection method can be generally classified into a conventional machine learning method and a deep learning method. The traditional machine learning methods are all based on manual design, and the process of feature extraction is too complex, so the methods often have the problems of poor generalization capability, low detection speed, low detection precision and the like, and are difficult to adapt to detection tasks of different scenes.
At present, target detection methods based on a deep convolutional network are generally divided into Two-Stage detection methods and Single-Stage detection methods. The detection method based on Two-Stage is the most excellent performance of the Faster R-CNN, a regional suggestion network (RPN) is introduced into the network, the target boundary and the target score of each position can be predicted at the same time, and high-quality regional suggestions are generated through end-to-end training, so that the detection accuracy of the network is improved, but the fast R-CNN network is complex in structure and very slow in detection speed. In view of the problem of efficiency, a Single-Stage method was subsequently proposed, which represents methods such as YOLO and ssd (Single Shot multi box detector). YOLO uses a fully connected network, which may cause problems such as spatial information loss, positioning error, and missed target detection, and especially has a poor detection effect on small targets, which affects the final detection accuracy. The SSD borrows the Anchor thought in the Faster R-CNN and uses a plurality of feature maps with different scales for detection, and the SSD can be well suitable for detection of targets with different sizes due to different perception visual fields of the feature maps, but the shallow feature map semantic information of the SSD is poor, so the detection effect of the SSD on small targets is not ideal.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide an optimized SSD detection model training method and a small target detection method, which can effectively improve the detection accuracy of small target detection.
In view of the above, one or more embodiments of the present specification provide an optimized SSD detection model training method for detecting small targets, including:
acquiring a historical image;
dividing historical images into a training set and a test set;
respectively inputting the historical images in the training set into a VGG16 network and a ResNet50 network for convolution to respectively obtain initial convolution data of VGG16 network images and initial convolution data of ResNet50 network images;
performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
and testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to execute the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
As an alternative embodiment, the preset convolution number is three times.
As an optional implementation, feature fusion is performed on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data in an element addition manner:
Figure BDA0003572932820000031
F OUT =ε(F out )
in the formula
Figure BDA0003572932820000032
Representing the addition of elements, F l Representing the output characteristic, F, of the VGG16 network a Indicating the output characteristic of the ResNet50 network, F out Representing feature fusion results, using F OUT As an input value for the next layer of the VGG16 network; from F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1.
As another embodiment of the present invention, a small target detection method for an aerial image of an unmanned aerial vehicle is provided, including:
acquiring an aerial photography historical image of the unmanned aerial vehicle;
training an initial SSD detection model by using the aerial photography historical image of the unmanned aerial vehicle to obtain a trained optimized SSD detection model, wherein the initial SSD detection model takes a VGG16 network model as a main trunk and takes a ResNet50 network as an auxiliary trunk;
acquiring an aerial real-time image of the unmanned aerial vehicle;
and inputting the aerial real-time image of the unmanned aerial vehicle into the optimized SSD detection model so as to identify a small target in the aerial real-time image of the unmanned aerial vehicle.
As an optional implementation manner, the training of the initial SSD detection model by using the history image of aerial photography by the unmanned aerial vehicle to obtain the trained optimized SSD detection model includes:
dividing the unmanned aerial vehicle aerial photography historical image into a training set and a test set;
respectively inputting the unmanned aerial vehicle aerial photography historical images in the training set into a VGG16 network and a ResNet50 network for convolution, and respectively obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images;
performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
and testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to execute the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
As an optional implementation manner, the unmanned aerial vehicle aerial photography historical image is subjected to data enhancement, and an image set obtained after enhancement is divided into a training set and a test set.
As an alternative embodiment, the data enhancement includes rotation, translation, cropping, and brightness adjustment.
As an alternative embodiment, the preset convolution number is three times.
As an optional implementation, feature fusion is performed on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data in an element addition manner:
Figure BDA0003572932820000041
F OUT =ε(F out )
in the formula
Figure BDA0003572932820000042
Representing the addition of elements, F l Representing the output characteristic, F, of the VGG16 network a Indicating the output characteristic of the ResNet50 network, F out Representing feature fusion results, using F OUT As an input value for the next layer of the VGG16 network; from F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1.
From the above description, it can be seen that in the method for optimizing training of the SSD detection model and the method for detecting the small target provided by one or more embodiments of the present specification, the VGG16 network in the SSD detection model is used as the main network, and the ResNet50 network is introduced as the auxiliary main network by utilizing the advantage of the residual error network (ResNet50 network) that has strong capability of expressing the feature information, to improve the characteristic expression capability of the network, performing characteristic fusion on convolution data obtained after the convolution of the VGG16 network and convolution data obtained after the convolution of the ResNet50 network, performing convolution on the fusion convolution data again until the preset times are reached, and then inputting the fused data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model, and then testing the SSD detection model until the loss function tends to be stable to finally obtain an optimized SSD detection model, so that the detection capability of the SSD network on small targets can be improved at the same time of high detection speed.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1 is a logic diagram of an optimized SSD detection model training method in accordance with one or more embodiments of the present disclosure;
FIG. 2 is a logic diagram of a small target detection method in accordance with one or more embodiments of the present disclosure;
FIG. 3 is a schematic diagram of dual backbone network feature fusion in accordance with one or more embodiments of the present disclosure;
FIG. 4 is a schematic diagram of an improved SSD network model diagram in accordance with one or more embodiments of the present disclosure;
fig. 5-7 are diagrams illustrating examples of the detection effect according to one or more embodiments of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure is further described in detail below with reference to specific embodiments.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.
As an embodiment of the present invention, there is provided an optimized SSD detection model training method for detecting small targets, including:
acquiring a historical image;
dividing historical images into a training set and a test set;
respectively inputting the historical images in the training set into a VGG16 network and a ResNet50 network for convolution to respectively obtain initial convolution data of VGG16 network images and initial convolution data of ResNet50 network images;
performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
and testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to execute the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
In the embodiment of the invention, a VGG16 network in an SSD detection model is used as a main network, a ResNet50 network is introduced as an auxiliary main network by utilizing the advantage of a residual error network (ResNet50 network) on strong feature information expression capability, so that the feature expression capability of the network is improved, feature fusion is carried out on convolution data obtained after convolution of a VGG16 network and convolution data obtained after convolution of a ResNet50 network, the convolution is carried out on the fusion data again until preset times are reached, then the fusion data is input into a subsequent network of the SSD for classification and detection, an improved SSD detection model is obtained, the SSD detection model is tested until a loss function tends to be stable, an optimized SSD detection model is finally obtained, and the detection capability of the SSD network on small targets can be improved while the detection speed is high.
As an embodiment of the present invention, as shown in fig. 1, there is provided an optimized SSD detection model training method for detecting small targets, including:
s100, acquiring a historical image;
s110, dividing the historical image into a training set and a test set;
s120, respectively inputting the historical images in the training set into a VGG16 network and a ResNet50 network for convolution, and respectively obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images;
s130, performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
optionally, feature fusion is performed on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data in an element addition manner:
Figure BDA0003572932820000081
F OUT =ε(F out )
in the formula
Figure BDA0003572932820000082
Representing the addition of elements, F l Representing the output characteristic, F, of the VGG16 network a Indicating the output characteristic of the ResNet50 network, F out Representing feature fusion results, using F OUT As an input value for the next layer of the VGG16 network; from F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1.
S140, returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
optionally, the preset convolution number is three. Experiments show that the ideal small target detection effect can be obtained by three times of convolution, excessive calculation can be avoided, and efficiency maximization is achieved.
S150, inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
and S160, testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities.
Corresponding to the optimized SSD detection model training method for detecting the small target, the embodiment of the invention also provides a small target detection method for the aerial image of the unmanned aerial vehicle, which comprises the following steps:
acquiring an aerial photography historical image of the unmanned aerial vehicle;
training an initial SSD detection model by using the aerial photography historical image of the unmanned aerial vehicle to obtain a trained optimized SSD detection model, wherein the initial SSD detection model takes a VGG16 network model as a main trunk and takes a ResNet50 network as an auxiliary trunk;
acquiring an aerial real-time image of the unmanned aerial vehicle;
and inputting the aerial real-time image of the unmanned aerial vehicle into the optimized SSD detection model so as to identify a small target in the aerial real-time image of the unmanned aerial vehicle.
In the embodiment of the invention, an SSD detection model which takes a VGG16 network model as a main skeleton and takes a ResNet50 network as an auxiliary skeleton is introduced, the SSD detection model utilizes the advantage of a residual error network (ResNet50 network) on strong feature information expression capability, and the residual error network is taken as the auxiliary skeleton to improve the feature expression capability of the network; secondly, by means of a double-backbone network feature fusion technology, the bottom layer feature information in the main backbone network can be increased, and the feature information of the small target is improved; and thirdly, replacing the trunk network of the SSD target detection network with the fused double trunk network, and improving the detection effect of the SSD target detection network on the small target, so that the SSD detection model is trained by utilizing the aerial photography historical image of the unmanned aerial vehicle, and the small target can be accurately and quickly identified from the aerial photography real-time image of the unmanned aerial vehicle after the optimized SSD detection model is obtained.
The embodiment of the invention also provides a small target detection method for the aerial image of the unmanned aerial vehicle, which comprises the following steps as shown in fig. 2:
s200, acquiring an aerial photography historical image of the unmanned aerial vehicle;
s210, training an initial SSD detection model by using the unmanned aerial vehicle aerial photography historical image to obtain a trained optimized SSD detection model, wherein the initial SSD detection model takes a VGG16 network model as a main trunk and takes a ResNet50 network as an auxiliary trunk;
optionally, the S210 is configured to:
s211, dividing the unmanned aerial vehicle aerial image into a training set and a test set;
s212, inputting the aerial photography history images of the unmanned aerial vehicles in the training set into a VGG16 network and a ResNet50 network respectively for convolution, and obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images respectively;
s213, performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
as an optional implementation, feature fusion is performed on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data in an element addition manner:
Figure BDA0003572932820000101
F OUT =ε(F out )
in the formula
Figure BDA0003572932820000102
Representing the addition of elements, F l Representing the output characteristic, F, of the VGG16 network a Indicating the output characteristic of the ResNet50 network, F out Representing feature fusion results, using F OUT As an input value for the next layer of the VGG16 network; from F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1.
S214, returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
optionally, the preset convolution number is three.
S215, inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
s216, testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with a loss function obtained by previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to the step of inputting the historical images in the training set into a VGG16 network and a ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
S220, acquiring an aerial real-time image of the unmanned aerial vehicle;
s230, inputting the unmanned aerial vehicle aerial real-time image into the optimized SSD detection model so as to identify a small target in the unmanned aerial vehicle aerial real-time image.
As an optional implementation manner, the method for detecting a small target in an aerial image of an unmanned aerial vehicle further includes:
s217, performing data enhancement on the unmanned aerial vehicle aerial image history, and dividing an image set obtained after enhancement into a training set and a test set. The S217 is used to replace S211.
Optionally, the data enhancement includes rotation, translation, cropping, and brightness adjustment.
It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The invention will now be further described with reference to the following examples and drawings:
1. SSD destination detection network
Compared with other target detection networks, the SSD target detection network has the advantages of high detection speed and high detection precision, and can be used for real-time target detection, and the embodiment of the invention adopts the SSD target detection network as a reference network:
the method selects an SSD model with VGG16 as a backbone network, and VGG16 is a classical network with a network depth of 16, and selects a 3 x 3 convolution kernel with a single size, so that the effect of accumulation of small convolution kernels is better than that of a large convolution kernel under the condition of same feeling, and the number of parameters is less, so that the method has a better effect. The SSD method is based on a feed-forward convolutional network that generates a set of a priori boxes of fixed size and the fraction of object class instances present in these a priori boxes, and then produces the final detection result by non-maximum suppression (NMS). The first few network layers are based on a standard architecture for high quality image classification, which is referred to as the base network. And adding feature extraction layers conv8_2, conv9_2, conv10_2 and conv11_2 on the base network. The detection process is not only performed on the added feature map, but also on the basic network feature maps conv4_3, conv7 in order to ensure that the network has a good detection effect on small targets.
The SSD designs a different number, size and wide high proportion of prior boxes for each feature map, and the prior boxes are composed of a series of fixed number and size target boxes generated by a certain rule. The specific size of the prior frame is determined by the scale and the width-height ratio, each layer of feature map corresponds to one scale, and the scale is generated according to the formula (1):
Figure BDA0003572932820000121
wherein s is k Represents the scale of the prior box in the kth feature map, s min Is 0.2, s max Is 0.9, m represents the number of signatures used for detection, and in SSD, m has a value of 6. Setting a priori frames with different quantities and sizes for each grid on each layer of feature map; each mesh on the 3-level feature map of Conv4_3, Conv10_2, and Conv11_2 yields 4 prior boxes, the width to height ratio a r1 Taking {1,2,1/2 }; each mesh on the 3-layer feature map of Conv7, Conv8_2, and Conv9_2 yields 6 a priori boxes, a r2 The width to height ratio is 1,2,1/2,3, 1/3. After determining the scale and aspect ratio, the a priori box size may be derived from equations (2) and (3).
Figure BDA0003572932820000122
Figure BDA0003572932820000123
Wherein,
Figure BDA0003572932820000124
and
Figure BDA0003572932820000125
width and height of the prior frame, respectively, a r Get a r1 Or a r2 . For the prior frame with the width-to-height ratio of 1, an additional scale s 'is added' k The calculation method is as shown in formula (4):
Figure BDA0003572932820000126
in the SSD, the number of the prior frames of the 1 st detection layer is 38 × 38 × 4 ═ 5776, the 2 nd layer is 19 × 19 × 6 ═ 2166, the 3 rd layer is 10 × 10 × 6 ═ 600, the 4 th layer is 5 × 5 × 6 ═ 150, the 5 th layer is 3 × 3 × 4 ═ 36, the 6 th layer is 1 × 1 × 4 ═ 4, and the total network output is 5776+2166+600+150+36+4 ═ 8732 prior frames.
2. Dual backbone network
Targets in the unmanned aerial vehicle image are mostly small targets, the small targets have serious problems of blurring and texture distortion, the characteristics are not obvious, some networks are difficult to extract key characteristic information, and the identification capability of the classifier is influenced. Therefore, a composite backbone network is proposed, which combines two common backbones, as shown in fig. 3 and 4. Selecting ResNet50 capable of keeping bottom layer detail information as an auxiliary backbone network, fusing bottom layer features extracted by ResNet50 layer by layer into a main backbone network VGG16 on the basis that the main backbone network is kept unchanged, and replacing feature layers of the original backbone network with the feature layers obtained after fusion to be used as new feature layers to carry out next convolution.
In the secondary backbone, the result of each stage can be viewed as a higher level feature. The output of each feature level is part of the main skeleton input and flows to the parallel stages of the subsequent skeleton. In this way, multiple high-level and low-level features are fused to generate a richer feature representation. This process can be expressed as:
Figure BDA0003572932820000131
F OUT =ε(F out ) (6)
wherein
Figure BDA0003572932820000132
Representing the addition of elements, F l Representing the output characteristics of the main skeleton at the current stage, F a Representing output characteristics of auxiliary trunks, using F out Display feature fusion results, using F OUT As an input value for the next layer of the main trunk. From F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1. Theoretically, this composite connection method can be used at the backbone layer, and experiments using the most basic and most useful composite connection method have shown that the proposed composite connection method is not limited by the feature size, and for simplicity of operation 150 × 150, 75 × 75 and 38 × 38 feature layers on the backbone are chosen, corresponding to the output of three layers of ResNet 50.
3. Analysis of experiments
In order to verify the effectiveness of the algorithm, the algorithm is verified on the Visdrone2019 unmanned aerial vehicle aerial photography data set.
To verify the effectiveness of the improvements to the SSD, the original SSD and the improved SSD were verified on the data set, respectively. The experimental results are shown in table 1 and fig. 5-7, and it can be seen from the table that the precision of the improved SSD is improved to some extent compared with the original SSD, because the improved SSD has better network feature extraction capability.
TABLE 1 SSD and modified SSD comparative experiments
Figure BDA0003572932820000141
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software and/or hardware in implementing one or more embodiments of the present description.
The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.

Claims (9)

1. An optimized SSD detection model training method for detecting small targets, comprising:
acquiring a historical image;
dividing historical images into a training set and a test set;
respectively inputting the historical images in the training set into a VGG16 network and a ResNet50 network for convolution to respectively obtain initial convolution data of VGG16 network images and initial convolution data of ResNet50 network images;
performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
and testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to execute the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
2. The optimized SSD detection model training method for detecting small targets of claim 1, wherein the preset number of convolutions is three.
3. The method of claim 1, wherein the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data are feature fused by element addition:
Figure FDA0003572932810000011
F OUT =ε(F out )
in the formula
Figure FDA0003572932810000021
Representing the addition of elements, F l Representing the output characteristic, F, of the VGG16 network a Indicating the output characteristic of the ResNet50 network, F out Representing feature fusion results, using F OUT As an input value for the next layer of the VGG16 network; from F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1.
4. A small target detection method for aerial images of unmanned aerial vehicles is characterized by comprising the following steps:
acquiring an aerial photography historical image of the unmanned aerial vehicle;
training an initial SSD detection model by using the aerial photography historical image of the unmanned aerial vehicle to obtain a trained optimized SSD detection model, wherein the initial SSD detection model takes a VGG16 network model as a main trunk and takes a ResNet50 network as an auxiliary trunk;
acquiring an aerial real-time image of the unmanned aerial vehicle;
and inputting the aerial real-time image of the unmanned aerial vehicle into the optimized SSD detection model so as to identify a small target in the aerial real-time image of the unmanned aerial vehicle.
5. The method for detecting small objects in the aerial images of the unmanned aerial vehicle as claimed in claim 4, wherein the training of the initial SSD detection model by the aerial image history of the unmanned aerial vehicle to obtain the trained optimized SSD detection model comprises:
dividing the unmanned aerial vehicle aerial photography historical image into a training set and a test set;
inputting the aerial photography history images of the unmanned aerial vehicles in the training set into a VGG16 network and a ResNet50 network respectively for convolution, and obtaining primary convolution data of VGG16 network images and primary convolution data of ResNet50 network images respectively;
performing feature fusion on the VGG16 network image primary convolution data and the ResNet50 network image primary convolution data to obtain fused VGG16 network image primary fusion data;
returning the VGG16 network image primary fusion data and the ResNet50 network image primary convolution data to execute the step of respectively inputting the data into a VGG16 network and a ResNet50 network for convolution until the preset convolution times are reached, and obtaining VGG16 network image fusion data;
inputting the VGG16 network image fusion data into a subsequent network of the SSD for classification and detection to obtain an improved SSD detection model;
and testing the improved SSD detection model by using the test set, calculating a loss function, comparing the loss function with the loss function obtained by the previous training, and if the loss function is smaller than the loss function obtained by the previous training, returning to execute the step of inputting the historical images in the training set into the VGG16 network and the ResNet50 network respectively for convolution until the loss function tends to be stable, wherein the obtained SSD detection model is the optimized SSD detection model.
6. The method for detecting the small targets in the aerial image of the unmanned aerial vehicle as claimed in claim 5, further comprising performing data enhancement on the aerial historical image of the unmanned aerial vehicle, and dividing an image set obtained after enhancement into a training set and a test set.
7. The method of claim 6, wherein the data enhancement comprises rotation, translation, cropping, and brightness adjustment.
8. The method for detecting small objects in the aerial image of an unmanned aerial vehicle of claim 5, wherein the preset convolution number is three.
9. The method for detecting the small target of the aerial image of the unmanned aerial vehicle of claim 5, wherein the feature fusion is performed on the primary convolution data of the VGG16 network image and the primary convolution data of the ResNet50 network image in an element addition manner:
Figure FDA0003572932810000031
F OUT =ε(F out )
in the formula
Figure FDA0003572932810000032
Representing the addition of elements, F l Representing the output characteristic, F, of the VGG16 network a Indicating the output characteristic of the ResNet50 network, F out Representing feature fusion results, using F OUT As an input value for the next layer of the VGG16 network; from F out To F OUT Is adjusted by a convolution operation with epsilon as 1 x 1.
CN202210330727.0A 2022-03-30 2022-03-30 Training method for optimizing SSD detection model and small target detection method Active CN114821265B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210330727.0A CN114821265B (en) 2022-03-30 2022-03-30 Training method for optimizing SSD detection model and small target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210330727.0A CN114821265B (en) 2022-03-30 2022-03-30 Training method for optimizing SSD detection model and small target detection method

Publications (2)

Publication Number Publication Date
CN114821265A true CN114821265A (en) 2022-07-29
CN114821265B CN114821265B (en) 2024-07-19

Family

ID=82533191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210330727.0A Active CN114821265B (en) 2022-03-30 2022-03-30 Training method for optimizing SSD detection model and small target detection method

Country Status (1)

Country Link
CN (1) CN114821265B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704866A (en) * 2017-06-15 2018-02-16 清华大学 Multitask Scene Semantics based on new neural network understand model and its application
CN109492674A (en) * 2018-10-19 2019-03-19 北京京东尚科信息技术有限公司 The generation method and device of SSD frame for target detection
CN111126472A (en) * 2019-12-18 2020-05-08 南京信息工程大学 Improved target detection method based on SSD
CN112270347A (en) * 2020-10-20 2021-01-26 西安工程大学 Medical waste classification detection method based on improved SSD
JP6980958B1 (en) * 2021-06-23 2021-12-15 中国科学院西北生態環境資源研究院 Rural area classification garbage identification method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704866A (en) * 2017-06-15 2018-02-16 清华大学 Multitask Scene Semantics based on new neural network understand model and its application
CN109492674A (en) * 2018-10-19 2019-03-19 北京京东尚科信息技术有限公司 The generation method and device of SSD frame for target detection
CN111126472A (en) * 2019-12-18 2020-05-08 南京信息工程大学 Improved target detection method based on SSD
CN112270347A (en) * 2020-10-20 2021-01-26 西安工程大学 Medical waste classification detection method based on improved SSD
JP6980958B1 (en) * 2021-06-23 2021-12-15 中国科学院西北生態環境資源研究院 Rural area classification garbage identification method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖体刚;蔡乐才;汤科元;高祥;张超洋;: "改进SSD的安全帽佩戴检测方法", 四川轻化工大学学报(自然科学版), no. 04, 20 August 2020 (2020-08-20) *

Also Published As

Publication number Publication date
CN114821265B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
CN111126472B (en) SSD (solid State disk) -based improved target detection method
CN107169421B (en) Automobile driving scene target detection method based on deep convolutional neural network
CN109472298B (en) Deep bidirectional feature pyramid enhanced network for small-scale target detection
CN110414377B (en) Remote sensing image scene classification method based on scale attention network
CN110059586B (en) Iris positioning and segmenting system based on cavity residual error attention structure
CN110119728A (en) Remote sensing images cloud detection method of optic based on Multiscale Fusion semantic segmentation network
CN111460914A (en) Pedestrian re-identification method based on global and local fine-grained features
CN112365514A (en) Semantic segmentation method based on improved PSPNet
CN111626200A (en) Multi-scale target detection network and traffic identification detection method based on Libra R-CNN
CN112686304A (en) Target detection method and device based on attention mechanism and multi-scale feature fusion and storage medium
CN110633727A (en) Deep neural network ship target fine-grained identification method based on selective search
CN113610024B (en) Multi-strategy deep learning remote sensing image small target detection method
CN113807188A (en) Unmanned aerial vehicle target tracking method based on anchor frame matching and Simese network
CN111553414A (en) In-vehicle lost object detection method based on improved Faster R-CNN
CN110706235A (en) Far infrared pedestrian detection method based on two-stage cascade segmentation
CN112183649A (en) Algorithm for predicting pyramid feature map
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN111915558A (en) Pin state detection method for high-voltage transmission line
CN112101113B (en) Lightweight unmanned aerial vehicle image small target detection method
CN114067126A (en) Infrared image target detection method
CN116363361A (en) Automatic driving method based on real-time semantic segmentation network
CN113469287A (en) Spacecraft multi-local component detection method based on instance segmentation network
CN111832508B (en) DIE _ GA-based low-illumination target detection method
CN113627481A (en) Multi-model combined unmanned aerial vehicle garbage classification method for smart gardens
Lin et al. Traffic sign detection algorithm based on improved YOLOv4

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant