CN111126359B - High-definition image small target detection method based on self-encoder and YOLO algorithm - Google Patents

High-definition image small target detection method based on self-encoder and YOLO algorithm Download PDF

Info

Publication number
CN111126359B
CN111126359B CN202010143805.7A CN202010143805A CN111126359B CN 111126359 B CN111126359 B CN 111126359B CN 202010143805 A CN202010143805 A CN 202010143805A CN 111126359 B CN111126359 B CN 111126359B
Authority
CN
China
Prior art keywords
network
data
yolo
encoder
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010143805.7A
Other languages
Chinese (zh)
Other versions
CN111126359A (en
Inventor
吴宪云
孙力
李云松
王柯俨
刘凯
雷杰
郭杰
苏丽雪
王康
司鹏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Yixin Yiyi Information Technology Co ltd
Xidian University
Original Assignee
Nanjing Yixin Yiyi Information Technology Co ltd
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Yixin Yiyi Information Technology Co ltd, Xidian University filed Critical Nanjing Yixin Yiyi Information Technology Co ltd
Publication of CN111126359A publication Critical patent/CN111126359A/en
Application granted granted Critical
Publication of CN111126359B publication Critical patent/CN111126359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a high-definition image small target detection method based on a self-encoder and a YOLO algorithm, and mainly solves the problem that the accuracy and speed of high-definition image small target detection in the prior art cannot be considered at the same time. The method comprises the following implementation steps: 1) Acquiring and labeling high-definition images to obtain a training set and a test set; 2) Carrying out data expansion on the marked training set; 3) Generating corresponding Mask data according to the labeling information; 4) Building a self-encoder model; 5) Training it using a training set; 6) Splicing the trained coding network of the self-encoder with a YOLO-V3 detection network to obtain a mixed network and training the mixed network by using a training set; 7) And performing target detection on the test set by using the trained hybrid network. The method reduces the calculated amount of target detection, improves the detection speed, improves the detection precision of small targets in high-definition images under the condition of ensuring the detection speed, and can be used for target identification of aerial images of unmanned aerial vehicles.

Description

High-definition image small target detection method based on self-encoder and YOLO algorithm
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a method for detecting a small target of a high-definition image, which can be used for target identification of an aerial image of an unmanned aerial vehicle.
Technical Field
Currently, with the development of target detection technology, especially in recent years, target detection algorithms based on deep learning, such as fast-RCNN, SSD series, YOLO series, have been proposed, and compared with conventional target detection algorithms, the target detection algorithms based on deep learning greatly exceed the conventional detection algorithms in terms of accuracy and efficiency. However, the current algorithms are optimized based on the existing data sets, such as ImageNet, COCO, and the like, and in practical applications, such as unmanned aerial vehicle aerial image target detection, since the flying height of an unmanned aerial vehicle is high, the size of the acquired image is large and generally high-definition images are obtained, and in the acquired image, the size of the target is generally small, the method is mainly used for small target detection in the aspect of target detection of the high-definition images.
In target detection, there are two main processing modes for high-definition images, one is a down-sampling size scaling mode, and the other is an image cropping mode, which is specifically as follows:
joseph Redmon et al, in the non-patent document "YOLO9000: better, fast, stronger" of IEEE international conference on computer vision and pattern recognition, proposed an improvement to the YOLO network by removing the full connection layer, so that the network can detect input images of different sizes, which, in the experimental results using the data set of VOC2007+ VOC2012, could reach 91FPS in speed by scaling the input images to 288x288 size by means of down-sampling size scaling, but only 69.0 mag in accuracy, if the input images were scaled to 544 size, the speed would be reduced to 40FPS, and the accuracy would be increased to 78.6 mag. It can be seen from the experiment that the large-size input image target detection inevitably increases the calculation amount, thereby reducing the speed of target detection, and the downsampling size scaling mode also causes the loss of target space information, thereby reducing the precision of target detection. In the small target detection of the high-definition image, if the high-definition image is directly sent to a network for detection, the detection speed is reduced more seriously, and if the small target detection is carried out in a size scaling mode, the characteristic information of the small target is reduced, so that the precision is reduced.
The second common mode is image cropping, which specifically comprises the following steps: and cutting the original high-definition image into small images, sending the small images into a network for detection, and merging after the detection is finished. The method has the advantages that through cutting, the spatial information of the image is guaranteed not to be lost, and a good effect is achieved on the target detection precision, but because one image is cut into a plurality of images, the target detection speed is doubled.
In summary, how to perform fast and accurate target detection on a high-definition image in practical application becomes a problem to be solved.
Disclosure of Invention
The invention aims to provide a high-definition image small target detection method based on an autoencoder and a YOLO algorithm aiming at overcoming the defects of the existing method, and aims to improve the detection precision of the high-definition image small target under the condition of ensuring that the detection speed of the high-definition image is not reduced.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
(1) Collecting high-definition image data to form a data set, labeling the data set to obtain correct label data, and dividing the data set and the label data into a training set and a test set according to the proportion of 8;
(2) Carrying out data expansion on the marked training set;
(3) For each piece of high-definition image data, generating target Mask data of a corresponding image according to the size of the image and the labeling information;
(4) Building a full convolution self-encoder model comprising an encoding network and a decoding network, wherein the encoding network is used for carrying out feature extraction and data compression on a high-definition image, and the decoding network is used for restoring a compressed feature map to an original size;
(5) Sending high-definition image training set data into a full convolution self-encoder model for training to obtain a trained full convolution self-encoder model:
(5a) Initializing the offset of the network to 0, initializing the weight parameters of the network by adopting a kaiming Gaussian initialization method, and setting the iteration times T of the self-encoder according to the size of a high-definition image training set 1
(5b) The partition-based mean square error loss function is defined as follows:
Figure BDA0002400010530000021
wherein, mask-MSE-Loss (y, y _) is a Loss function to be calculated; y is the decoder output image; y _ is an input original high-definition image; alpha is the loss penalty weight of the target area and is set to be 0.9; beta is a penalty weight of a background area and is set to be 0.1; w is the input image size width from the encoder; h is the input image size width from the encoder; mask (i, j) is the value of the (i, j) th position of the Mask data in (3);
(5c) Inputting high-definition image training set data into a full convolution self-coding network, carrying out forward propagation to obtain a coded feature map, and recovering the feature map through a decoder;
(5d) Calculating loss values of the input image and the output image by using the partition area-based mean square error loss function defined in the step (5 b);
(5e) Updating the weight and the offset of the full convolution self-encoder by using a back propagation algorithm to finish one iteration of training the full convolution self-encoder;
(5f) Repeating (5 c) - (5 e) until the iteration times T of all the self-encoders are completed 1 Obtaining a trained full convolution self-encoder;
(6) Splicing the coding network of the trained full-convolution self-encoder with a YOLO-V3 detection network, and training the spliced network:
(6a) Splicing the coding network of the trained full-convolution self-encoder to the front of a YOLO-V3 detection network to form a spliced mixed network;
(6b) Training the spliced hybrid network:
(6b1) Reading parameters of the trained full-convolution self-encoder, initializing the coding network by using the read parameter values, and setting the parameters of the coding network in a non-trainable state;
(6b2) Setting the input image size of the YOLO-V3 network to be the same as the input size of the full-convolution self-encoder network;
(6b3) Downloading pre-trained parameters on ImageNet data sets from a Yolo official network, initializing the parameters of the Yolo-V3 network by using the parameters, and setting the iteration times T of the Yolo-V3 network according to the size of the acquired data set in the step (1) 2
(6b4) Sending the high-definition image training set data into the spliced hybrid network for forward propagation to obtain an output detection result;
(6b5) Calculating a loss value between an output detection result and the correct label data marked in the step (1) by using a loss function in a YOLO-V3 algorithm;
(6b6) Updating the weight and the offset of the hybrid network by using a back propagation algorithm according to the loss value, and completing one iteration of training the hybrid network;
(6b7) Repeating (6 b 4) - (6 b 6) until all YOLO-V3 iterations are completed T 2 Obtaining a trained hybrid network;
(7) And (3) inputting the test set data in the step (1) into the trained mixed model to obtain a final detection result.
Compared with the prior art, the invention has the following advantages:
the invention combines the coding network of the self-encoder with the YOLO-V3 detection network, compresses the high-definition image on the premise of little loss of the target area characteristics through the coding network, and detects the small target of the compressed image through the YOLO-V3 detection network, and the coding network only compresses the background characteristic information and retains the target characteristic information, thereby improving the precision of the small target detection in the high-definition image under the condition of ensuring the detection speed.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a labeling diagram of the high-definition image acquired in the invention;
FIG. 3 is a Mask data diagram generated by labeling information in the present invention;
FIG. 4 is a network architecture of the convolutional auto-encoder of the present invention;
FIG. 5 is a block diagram of an encoder in combination with a YOLO-V3 network according to the present invention;
FIG. 6 is a graph of simulated test results on a test specimen using the present invention;
FIG. 7 is a diagram of a simulation detection result of a prior downsampling compressed high-definition image method on a test sample through YOLO-V3.
Detailed Description
The following describes embodiments and effects of the present invention in further detail with reference to the accompanying drawings, where the embodiments are used for detecting a small target at a sewage discharge port of a high-definition image captured by an unmanned aerial vehicle.
Referring to fig. 1, the implementation steps of this example include the following:
step 1, collecting high-definition images to obtain a training set and a test set.
Acquiring high-definition image data aerial photographed by an unmanned aerial vehicle, wherein the image width is 1920 pixels, and the image height is 1080 pixels;
performing target annotation on the acquired image data by using a common image annotation tool LabelImg to obtain correct label data, as shown in FIG. 2;
the data set and the label data are divided into a training set and a test set in a ratio of 8.
And 2, performing data expansion on the marked data set.
2.1 Respectively carrying out left-right turning, rotation, translation, noise addition, brightness adjustment, contrast adjustment and saturation adjustment on each high-definition image in the acquired unmanned aerial vehicle aerial photography training set;
2.2 Processed image data is added to the original training data set to obtain an expanded training data set.
And 3, generating a target Mask data image of the corresponding image.
3.1 According to the size and the labeling information of the high-definition image acquired by the unmanned aerial vehicle aerial photography, setting a Mask data image as binary image data, wherein the width and the height of the Mask data image are the same as those of the high-definition image acquired by the unmanned aerial vehicle aerial photography, namely the width of the Mask data image is 1920 pixels, the height of the Mask data image is 1080 pixels, and the number of channels is 1;
3.2 Read the position information of the pixel point in the original image, and set the value of the pixel point corresponding to the Mask data through the position information:
if the pixel point is in the background area, the value of the Mask data corresponding to the pixel position is set as 0,
if the pixel point is in the target area, the value of the corresponding pixel position of the Mask data is set as 1,
the formula is expressed as follows:
Figure BDA0002400010530000051
the (i, j) refers to the ith row and the jth column of the pixel point in the unmanned aerial vehicle aerial image data, and the Mask (i, j) is the value of the Mask image data at the (i, j) position.
The Mask map generated by the method of fig. 2 according to the 3.2) is shown in fig. 3.
And 4, building a full convolution self-encoder model.
The full convolution self-encoder model comprises an encoding network and a decoding network, wherein the encoding network is used for carrying out feature extraction and data compression on a high-definition image, the decoding network is used for restoring a compressed feature map to an original size, and the building process comprises the following steps:
4.1 Build a coding network:
the coding network comprises 5 convolutional layers, wherein each convolutional layer is connected in series, and the parameters of each convolutional layer are set as follows:
a first layer: the convolution kernel size is 3 × 3, the number is 16, the convolution step size is 1, the ReLU is adopted as the activation function, and the output feature size is 1664 × 16;
a second layer: the convolution kernel size is 3 × 3, the number is 32, the convolution step size is 2, the activation function adopts the ReLU, and the output feature map size is 832 × 32;
and a third layer: the convolution kernel size is 3 × 3, the number is 64, the convolution step size is 1, the activation function adopts ReLU, and the output feature map size is 832 × 64;
a fourth layer: the convolution kernel size is 3 × 3, the number is 128, the convolution step size is 2, the ReLU is adopted as the activation function, and the output feature map size is 416 × 128;
and a fifth layer: the convolution kernel size is 1 × 1, the number is 3, the convolution step size is 1, the activation function adopts Sigmoid, and the output feature graph size is 416 × 3;
4.2 Build a decoding network:
the decoding network comprises 5 deconvolution layers, wherein each deconvolution layer is connected in series, and the parameters of each deconvolution layer are set as follows:
layer 1: convolution kernel size is 1 × 1, number is 128, convolution step size is 1, activation function adopts ReLU, output feature map size is 416 × 128;
layer 2: convolution kernel size is 3 × 3, number is 64, convolution step is 2, activation function adopts ReLU, and output feature map size is 832 × 64;
layer 3: the convolution kernel size is 3 × 3, the number is 32, the convolution step size is 1, the activation function adopts ReLU, and the output feature map size is 832 × 32;
layer 4: convolution kernel size 3 × 3, number 16, convolution step 2, activation function with ReLU, output signature size 1664 × 16;
layer 5: the convolution kernel size is 3 × 3, the number is 3, the convolution step size is 1, the Sigmoid is adopted by the activation function, and the output feature size is 1664 × 3;
the description form of the size of the convolution kernel is w x h, and the meaning of the description form indicates that the width of the convolution kernel is w and the height of the convolution kernel is h;
the characteristic diagram size description form is w x h c, and the meaning of the characteristic diagram size description form is that the width of the characteristic diagram is w pixels, the height of the characteristic diagram is h pixels, and the number of channels is c;
the constructed full convolutional network is shown in fig. 4.
And 5, training the built full convolution self-encoder model.
5.1 Initializing network parameters:
initializing the offset of the network to 0, and initializing the weight parameters of the network by adopting a kaiming Gaussian initialization method so as to ensure that the weight parameters are distributed as follows:
Figure BDA0002400010530000061
wherein: w is a group of l Is the weight of the l-th layer; n is Gaussian distribution, namely nominal normal distribution; a is the negative half-axis slope of the ReLU activation function or Leaky ReLU activation function, n l For the data dimension of each layer, n l = length of convolution kernel edge 2 The number of channels is multiplied, and the channels are the number of channels input by each layer of convolution;
the iteration times of the self-encoder are set to 8000 according to the size of the high-definition image training set;
5.2 Up-sampling the training set image data and making the size of the up-sampled training set image data the same as the input size of the full convolution network, i.e. 1664 pixels wide, 1664 pixels high and 3 channels;
5.3 Mask data is up-sampled, and the size of the up-sampled Mask data is the same as the data width and height of a full convolution network, namely the width is 1664 pixels, the height is 1664 pixels, and the number of channels is 1;
5.4 Inputting the up-sampled image into a full convolution self-coding network, carrying out forward propagation to obtain a coded feature map, and recovering the feature map through a decoder;
5.5 Partition-based mean square error loss function is constructed as follows:
Figure BDA0002400010530000071
wherein, mask-MSE-Loss (y, y _) is a Loss function to be calculated; y is the decoder output image; y _ is an input original high-definition image; alpha is the loss penalty weight of the target area and is set to be 0.9; beta is a penalty weight of a background area and is set to be 0.1; w is the width of the encoder input data, 1664; h is the height of the encoder data, which is 1664, and Mask (i, j) is the value of the upsampled Mask image data at the (i, j) position;
5.6 Using a loss function of 5.5), computing loss values for the input image and the output image:
5.7 Using a back propagation algorithm to update the weight and the offset of the full convolution self-encoder, completing one iteration of training the full convolution self-encoder:
5.7.1 Using a back propagation algorithm to update the weights, the formula is as follows:
Figure BDA0002400010530000072
wherein: w t+1 Is the updated weight; w t To be moreA new previous weight; μ is the learning rate of the back propagation algorithm, set here to 0.001;
Figure BDA0002400010530000073
partial derivative of the loss function of 5.5) with respect to the weight W;
5.7.2 Using a back propagation algorithm to update the offset, which is formulated as follows:
Figure BDA0002400010530000074
wherein: b t+1 Is the updated offset; b t Is the offset before updating; mu is the learning rate of the back propagation algorithm, and the value is 0.001;
Figure BDA0002400010530000075
partial derivative of the loss function of 5.5) with respect to the offset b;
5.8 5.2) to 5.7) until the iteration times of the full convolution self-encoder are completed, and the trained full convolution self-encoder is obtained.
Step 6, splicing the coding network of the full convolution self-encoder and the YOLO-V3 detection network, training the spliced mixed network:
6.1 Before the trained coding network of the full-convolution auto-encoder is spliced to the YOLO-V3 detection network, a hybrid network after splicing is formed, as shown in fig. 5;
6.2 Training the spliced hybrid network:
6.2.1 Read the parameter of the trained full convolution self-encoder, initialize the coding network with the read parameter value, and set the parameter of the coding network to be in a non-trainable state;
6.2.2 Set the input image size of the YOLO-V3 network to be the same as the input size of the full-convolution self-encoder network;
6.2.3 Download the pre-training parameter on the ImageNet data set from the Yolo organ network, initialize the parameter of the Yolo-V3 network by using the parameter, and set the iteration number of the Yolo-V3 network to be 5000 times according to the size of the data set collected in the step (1);
6.2.4 Sending high-definition image training set data of unmanned aerial vehicle aerial photography into a spliced mixed network for forward propagation to obtain an output detection result;
6.2.5 Using a loss function in the YOLO-V3 algorithm, calculating a loss value between the output detection result and the correct tag data labeled in (1),
the loss function in the YOLO-V3 algorithm is expressed as follows:
Figure BDA0002400010530000081
wherein: lambda [ alpha ] coord Setting the penalty weight of the predicted coordinate loss as 5;
λ noobj setting the penalty weight of confidence coefficient loss when the target is not detected to be 0.5;
k is the scale size of the output characteristic diagram;
m is the number of bounding boxes;
Figure BDA0002400010530000082
whether a jth bounding box of an ith unit in the output feature map contains a target or not is judged, if so, the value is 1, otherwise, the value is 0;
Figure BDA0002400010530000083
and &>
Figure BDA0002400010530000084
Conversely, if a target is included, the value is 0, otherwise the value is 1;
x i the abscissa value of the predicted central position of the bounding box in the ith cell in the feature diagram output by the YOLO-V3 network;
Figure BDA0002400010530000085
as the actual bounding box center in the ith cellAn abscissa value of the position;
y i the ordinate value of the predicted boundary box center position in the ith cell in the feature diagram output by the YOLO-V3 network;
Figure BDA0002400010530000091
the vertical coordinate value of the center position of the actual boundary frame in the ith cell;
w i the width of the predicted bounding box in the ith cell in the feature map output for the YOLO-V3 network;
Figure BDA0002400010530000092
the width of an actual bounding box in the ith cell;
h i predicting the height of a bounding box in the ith cell in a feature map output by the YOLO-V3 network;
Figure BDA0002400010530000093
the height of the actual bounding box in the ith cell;
C i confidence of the ith cell prediction output for the YOLO-V3 network;
Figure BDA0002400010530000094
confidence that the ith cell is true;
p i (c) The probability that the ith cell type in the feature diagram output by the YOLO-V3 network is c;
Figure BDA0002400010530000095
the probability that the ith cell is of the type c.
6.2.6 According to the loss value calculated by 6.2.5), updating the weight and the offset of the hybrid network by using a back propagation algorithm, wherein the updating method of the weight and the offset is the same as the updating formula of 5.7), and finishing one iteration of training the hybrid network;
6.2.7 Repeating (6.2.4) - (6.2.6) until the iteration times of all YOLO-V3 are completed, and obtaining a trained mixed network;
and 7, using the trained network to detect the target.
Inputting the test set data in the step 1 into the trained hybrid model to obtain a final detection result, and detecting a small target in the image, wherein the result is shown in fig. 6.
In fig. 6 and 7, the area where the text is drawn indicates that the target is successfully detected in the area, and it can be seen from the results of the conventional method in fig. 7 that two obvious small dark-tube targets are not detected in the lower left corner, and one more obvious small dark-tube target is not detected in the lower right corner. Compared with the detection result in fig. 6, the present invention successfully detects the target at the lower left corner and the lower right corner because the spatial characteristics of the target are preserved in the image compression process. Compared with the prior art, the method has obvious advantages in the aspect of small target detection of high-definition images.

Claims (5)

1. A high-definition image small target detection method based on an auto-encoder and a YOLO algorithm is characterized by comprising the following steps:
(1) Collecting high-definition image data to form a data set, labeling the data set to obtain correct label data, and dividing the data set and the label data into a training set and a test set according to the proportion of 8;
(2) Carrying out data expansion on the marked training set;
(3) For each piece of high-definition image data, generating target Mask data of a corresponding image according to the size of the image and the labeling information;
(4) Building a full convolution self-encoder model comprising an encoding network and a decoding network, wherein the encoding network is used for carrying out feature extraction and data compression on a high-definition image, and the decoding network is used for restoring a compressed feature map to an original size;
(5) Sending high-definition image training set data into a full convolution self-encoder model for training to obtain a trained full convolution self-encoder model:
(5a) Initializing the offset of the network to 0, initializing the weight parameters of the network by adopting a kaiming Gaussian initialization method, and setting the iteration times T of a self-encoder according to the size of a high-definition image training set 1
(5b) The partition-based mean square error loss function is defined as follows:
Figure FDA0002400010520000011
wherein, mask-MSE-Loss (y, y _) is a Loss function to be calculated; y is the decoder output image; y _ is an input original high-definition image; alpha is the loss penalty weight of the target area and is set to be 0.9; beta is a penalty weight of a background area and is set to be 0.1; w is the input image size width from the encoder; h is the input image size width from the encoder; mask (i, j) is the value of the (i, j) th position of the Mask data in (3);
(5c) Inputting high-definition image training set data into a full convolution self-coding network, carrying out forward propagation to obtain a coded feature map, and recovering the feature map through a decoder;
(5d) Calculating loss values of the input image and the output image by using the partition area-based mean square error loss function defined in the step (5 b);
(5e) Updating the weight and the offset of the full convolution self-encoder by using a back propagation algorithm to finish one iteration of training the full convolution self-encoder;
(5f) Repeating (5 c) - (5 e) until the iteration times T of all the self-encoders are completed 1 Obtaining a trained full convolution self-encoder;
(6) Splicing the coding network of the trained full-convolution self-encoder with a YOLO-V3 detection network, and training the spliced network:
(6a) Splicing the coding network of the trained full-convolution self-encoder to the front of a YOLO-V3 detection network to form a spliced mixed network;
(6b) Training the spliced hybrid network:
(6b1) Reading parameters of the trained full-convolution self-encoder, initializing the coding network by using the read parameter values, and setting the parameters of the coding network in a non-trainable state;
(6b2) Setting the input image size of the YOLO-V3 network to be the same as the input size of the full-convolution self-encoder network;
(6b3) Downloading pre-trained parameters on ImageNet data sets from a YoLO organ network, initializing the parameters of the YoLO-V3 network by using the parameters, and setting the iteration times T of the YoLO-V3 network according to the size of the acquired data sets in the step (1) 2
(6b4) Sending the high-definition image training set data into the spliced hybrid network for forward propagation to obtain an output detection result;
(6b5) Calculating a loss value between an output detection result and the correct label data marked in the step (1) by using a loss function in a YOLO-V3 algorithm;
(6b6) Updating the weight and the offset of the hybrid network by using a back propagation algorithm according to the loss value, and completing one iteration of training the hybrid network;
(6b7) Repeating (6 b 4) - (6 b 6) until all iterations T of YOLO-V3 are completed 2 Obtaining a trained hybrid network;
(7) And (3) inputting the test set data in the step (1) into the trained mixed model to obtain a final detection result.
2. The method according to claim 1, wherein the step (2) of performing data expansion on the labeled training set comprises performing left-right flipping, rotation, translation, noise adding, brightness adjustment, contrast adjustment and saturation adjustment on each high-definition image in the original data set, and adding the processed image data into the original data set to obtain expanded data.
3. The method as claimed in claim 1, wherein for each high-definition image data, the step (3) generates target Mask data of the corresponding image according to the image size and the label information, which is implemented as follows:
(3a) Setting Mask data as binary image data, wherein the width and the height of the Mask data are the same as those of the acquired high-definition image;
(3b) Reading position information of pixel points in an original image according to the labeling data, and setting values of the pixel points corresponding to Mask data:
if the pixel point is in the target area, the value of the pixel point corresponding to the Mask data is set to be 1,
if the pixel point is in the background area, the value of the pixel point corresponding to the Mask data is set to be 0,
the formula is expressed as follows:
Figure FDA0002400010520000031
4. the method of claim 1, wherein the initializing of the weight parameters of the network in step (5 a) using a kaiming gaussian initialization method is randomly initializing the weights of the network to obey the following distribution:
Figure FDA0002400010520000032
wherein: w l Is the weight of the l-th layer; n is Gaussian distribution, namely nominal normal distribution; a is the negative half-axis slope of the ReLU activation function or Leaky ReLU activation function, n l For the data dimension of each layer, n l = convolution kernel side length 2 The number of channels, channel being the number of channels input for each layer of convolution.
5. The method of claim 1, wherein the loss function in the YOLO-V3 algorithm used in step (6 b 5) is expressed as follows:
Figure FDA0002400010520000041
wherein: lambda [ alpha ] coord Setting the penalty weight of the predicted coordinate loss as 5;
λ noobj setting the penalty weight of confidence coefficient loss when the target is not detected to be 0.5;
k is the scale size of the output characteristic diagram;
m is the number of bounding boxes;
Figure FDA0002400010520000042
whether a jth bounding box of an ith unit in the output feature map contains a target or not is judged, if so, the value is 1, otherwise, the value is 0;
Figure FDA0002400010520000043
and/or>
Figure FDA0002400010520000044
Conversely, if a target is included, the value is 0, otherwise the value is 1;
x i the abscissa value of the predicted central position of the boundary box in the ith cell in the feature map output by the YOLO-V3 network;
Figure FDA0002400010520000045
the abscissa value of the center position of the actual boundary box in the ith cell is taken as the coordinate value;
y i the ordinate value of the predicted boundary box center position in the ith cell in the feature diagram output by the YOLO-V3 network;
Figure FDA0002400010520000046
the vertical coordinate value of the center position of the actual boundary frame in the ith cell is taken as the vertical coordinate value;
w i the width of the predicted bounding box in the ith cell in the feature map output for the YOLO-V3 network;
Figure FDA0002400010520000047
the width of an actual bounding box in the ith cell;
h i predicting the height of a bounding box in the ith cell in a feature map output by the YOLO-V3 network;
Figure FDA0002400010520000048
the height of the actual bounding box in the ith cell;
C i confidence of the ith cell prediction output for the YOLO-V3 network;
Figure FDA0002400010520000051
confidence that the ith cell is true;
p i (c) The probability that the ith cell type in the feature map output by the YOLO-V3 network is c;
Figure FDA0002400010520000052
is the probability that the ith cell category is c. />
CN202010143805.7A 2019-11-15 2020-03-04 High-definition image small target detection method based on self-encoder and YOLO algorithm Active CN111126359B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911117690 2019-11-15
CN2019111176908 2019-11-15

Publications (2)

Publication Number Publication Date
CN111126359A CN111126359A (en) 2020-05-08
CN111126359B true CN111126359B (en) 2023-03-28

Family

ID=70493460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010143805.7A Active CN111126359B (en) 2019-11-15 2020-03-04 High-definition image small target detection method based on self-encoder and YOLO algorithm

Country Status (1)

Country Link
CN (1) CN111126359B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832513B (en) * 2020-07-21 2024-02-09 西安电子科技大学 Real-time football target detection method based on neural network
CN111986160A (en) * 2020-07-24 2020-11-24 成都恒创新星科技有限公司 Method for improving small target detection effect based on fast-RCNN
CN111881982A (en) * 2020-07-30 2020-11-03 北京环境特性研究所 Unmanned aerial vehicle target identification method
CN112287998B (en) * 2020-10-27 2024-06-21 佛山市南海区广工大数控装备协同创新研究院 Method for detecting target under low illumination condition
CN112396582B (en) * 2020-11-16 2024-04-26 南京工程学院 Mask RCNN-based equalizing ring skew detection method
CN112766223B (en) * 2021-01-29 2023-01-06 西安电子科技大学 Hyperspectral image target detection method based on sample mining and background reconstruction
CN112926637B (en) * 2021-02-08 2023-06-09 天津职业技术师范大学(中国职业培训指导教师进修中心) Method for generating text detection training set
CN113255830A (en) * 2021-06-21 2021-08-13 上海交通大学 Unsupervised target detection method and system based on variational self-encoder and Gaussian mixture model
CN115841522A (en) * 2021-09-18 2023-03-24 华为技术有限公司 Method, apparatus, storage medium, and program product for determining image loss value
CN114419395A (en) * 2022-01-20 2022-04-29 江苏大学 Online target detection model training method based on intermediate position coding
CN114743116A (en) * 2022-04-18 2022-07-12 蜂巢航宇科技(北京)有限公司 Barracks patrol scene-based unattended special load system and method
CN114818838B (en) * 2022-06-30 2022-09-13 中国科学院国家空间科学中心 Low signal-to-noise ratio moving point target detection method based on pixel time domain distribution learning
CN115542282B (en) * 2022-11-28 2023-04-07 南京航空航天大学 Radar echo detection method, system, device and medium based on deep learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399362B (en) * 2018-01-24 2022-01-07 中山大学 Rapid pedestrian detection method and device
CN109447033A (en) * 2018-11-14 2019-03-08 北京信息科技大学 Vehicle front obstacle detection method based on YOLO
CN109785333A (en) * 2018-12-11 2019-05-21 华北水利水电大学 Object detection method and device for parallel manipulator human visual system
CN110087092B (en) * 2019-03-11 2020-06-05 西安电子科技大学 Low-bit-rate video coding and decoding method based on image reconstruction convolutional neural network
CN109886359B (en) * 2019-03-25 2021-03-16 西安电子科技大学 Small target detection method and detection system based on convolutional neural network

Also Published As

Publication number Publication date
CN111126359A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111126359B (en) High-definition image small target detection method based on self-encoder and YOLO algorithm
CN111598030B (en) Method and system for detecting and segmenting vehicle in aerial image
CN111612008B (en) Image segmentation method based on convolution network
CN111709416B (en) License plate positioning method, device, system and storage medium
CN112308860A (en) Earth observation image semantic segmentation method based on self-supervision learning
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN113780296A (en) Remote sensing image semantic segmentation method and system based on multi-scale information fusion
CN115147598B (en) Target detection segmentation method and device, intelligent terminal and storage medium
CN112464912B (en) Robot end face detection method based on YOLO-RGGNet
CN111242026B (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN115035295B (en) Remote sensing image semantic segmentation method based on shared convolution kernel and boundary loss function
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN112016512A (en) Remote sensing image small target detection method based on feedback type multi-scale training
CN116645592B (en) Crack detection method based on image processing and storage medium
CN114037640A (en) Image generation method and device
CN112686274A (en) Target object detection method and device
CN110310305A (en) A kind of method for tracking target and device based on BSSD detection and Kalman filtering
CN112101113B (en) Lightweight unmanned aerial vehicle image small target detection method
CN114913493A (en) Lane line detection method based on deep learning
CN116503709A (en) Vehicle detection method based on improved YOLOv5 in haze weather
CN112801021B (en) Method and system for detecting lane line based on multi-level semantic information
CN112785610B (en) Lane line semantic segmentation method integrating low-level features
CN116363610A (en) Improved YOLOv 5-based aerial vehicle rotating target detection method
CN118265998A (en) Dead pixel detection model training method, dead pixel detection method and dead pixel restoration method
CN115984568A (en) Target detection method in haze environment based on YOLOv3 network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20211123

Address after: 710071 Taibai South Road, Yanta District, Xi'an, Shaanxi Province, No. 2

Applicant after: XIDIAN University

Applicant after: Nanjing Yixin Yiyi Information Technology Co.,Ltd.

Address before: 710071 No. 2 Taibai South Road, Shaanxi, Xi'an

Applicant before: XIDIAN University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant