CN110991311B - Target detection method based on dense connection deep network - Google Patents

Target detection method based on dense connection deep network Download PDF

Info

Publication number
CN110991311B
CN110991311B CN201911188895.5A CN201911188895A CN110991311B CN 110991311 B CN110991311 B CN 110991311B CN 201911188895 A CN201911188895 A CN 201911188895A CN 110991311 B CN110991311 B CN 110991311B
Authority
CN
China
Prior art keywords
network
dense
size
image
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911188895.5A
Other languages
Chinese (zh)
Other versions
CN110991311A (en
Inventor
陈莹
潘志浩
化春键
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN201911188895.5A priority Critical patent/CN110991311B/en
Publication of CN110991311A publication Critical patent/CN110991311A/en
Application granted granted Critical
Publication of CN110991311B publication Critical patent/CN110991311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a target detection method based on a dense connection deep network, and belongs to the technical field of target detection. The target detection method based on the dense connection deep network fuses the dense connection mode into the yolo-tiny network, increases the convolution layer of the yolo-tiny network, and improves the feature extraction network. The improved network firstly normalizes an input image into a fixed size, then extracts and fuses the characteristics of each channel by using a DenseBlock module, and then predicts by using different prior frames on different scales to finish the classification and positioning of a target. Compared with the original algorithm, the improved algorithm has the advantages that the precision is improved by 15%, and the requirement of real-time detection can be met; the size of the model is only 44.7MB, and the requirements of memory occupation and real-time performance in actual use can be met.

Description

Target detection method based on dense connection deep network
Technical Field
The invention relates to a target detection method based on a dense connection deep network, and belongs to the technical field of target detection.
Background
There are many current deep learning based target detection algorithms, such as fast Rcnn (fast Region-based computational Network), ssd (single Shot MultiBox detector), R-fcn (Region-based global computational Network), yolo (You Only Look one), yolo-Tiny (You Only Look one-Tiny) and so on. However, the algorithms still have many defects, for example, the algorithms such as fast Rcnn, R-fcn, SSD and the like have the problems of low detection speed, complex system configuration environment and the like, the yolov3 algorithm has high detection speed, but the model occupies a large memory, and the yolov3-tiny has the problem of over-low detection precision.
Although the current yolov3-tiny detection network has high detection speed, various problems exist, such as inaccurate detection positioning, poor detection effect, and serious missed detection and false detection conditions. At present, a residual network structure is fused into yolov3-tiny in the literature, but the detection precision is only 60.92%.
A dense connection Convolutional neural network (Gao Huang, Zhuang Li, Laurens van der Maaten, Kilian Q. Weinberger. Densely Connected Convolutional Networks [ C ]. CVPR, 2017. DOI: 10.1109/CVPR.2017.243) is an independent and complete detection network, but the network has the disadvantages that the calculated amount of the network is increased sharply and a large amount of display memory is consumed due to the arrangement of output parameters of different Convolutional layers and the existence of full connection layers. This problem limits the use of the network in practical production.
Disclosure of Invention
In order to solve at least one problem, the invention provides a target detection method based on a dense connection deep network, which achieves the effects of high detection precision, high speed and small memory occupied by a model by improving the network structure of yolov3-tiny algorithm, and can meet the requirement of displaying and using the real-time performance.
According to the target detection method based on the dense connection deep network, a dense connection mode is integrated into a convolutional neural network, and each extracted feature is utilized extremely by cascading the output of each convolutional layer. The invention not only improves the feature utilization rate and information flow of the detection network, but also strengthens feature propagation and improves the detection effect.
The invention aims to provide a target detection method based on a dense connection deep network, which comprises the following steps:
step (1): reading image data in a Pascal VOC data set and extracting target data characteristics;
step (2): training a network model;
and (3): and carrying out target detection.
Optionally, the method comprises the following steps:
step (1): reading in image data in the Pascal VOC data set and extracting target data characteristics: reading input image data by a network, firstly normalizing the resolution of the input image data to 416 x 416, and then extracting and fusing the characteristics of each channel through a series of convolution layers and a Dense connection module Dense Block;
step (2): training a network model: setting a network batch to 64, and repeating iterative training to obtain a detection model;
and (3): and (3) carrying out target detection: the network firstly extracts features from an input image through a feature extraction network to obtain a feature map (assumed to be k x k) with a certain size, then the input image is divided into k x k unit cells, and each unit cell predicts a fixed number (3) of boundary frames; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is to be the target; then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
Optionally, the step (1) includes:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections; the Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is also called a botteleck layer, so as to reduce the number of input feature graphs, improve the calculation efficiency and fuse the features of each channel; the 3-by-3 convolution is used for extracting image features; the input of each layer in the Dense connection module Dense Block comes from the output of all the previous layers so as to achieve better effect and fewer parameters; the formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;
xl=Hl([x0,x1,···,xl-1])
wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. In the above formula Hl(. -) represents a complex function of three successive operations, consisting of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer;
(2) reducing the quantity of feature graphs output by the convolution layers in a Dense connection module Dense Block;
(3) in order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
Optionally, the step (2) includes:
setting the learning rate of the network to be 0.001, setting the momentum to be 0.9, setting the weight attenuation regular term to be 0.0005, setting the maximum iteration number of the network to be 500200, and attenuating the learning rate of the network by 10 times when the iteration number reaches 400000 and 450000; and simultaneously, the network uses multi-scale training, after the network reads data, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds at random, and are all multiples of 32.
Optionally, the step (3) includes:
(1) yolov3-tiny uses K-means clustering algorithm to cluster the real frames in the data set, sets 3 prior frames with different sizes for two scales obtained by down sampling, and clusters 6 prior frames with different sizes in total;
the 6 prior box sizes for the two different scales are shown in table 1 below:
TABLE 1
Figure BDA0002293076710000031
(2) Predicting on feature maps of three different scales using 6 different prior boxes (Anchors); when the bounding box is predicted, in order to better model data and support multi-label classification, a network adopts logistic regression (logistic regression); the coordinate prediction formula of the network bounding box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002293076710000032
Figure BDA0002293076710000033
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhCoordinates and width and height of the center of the finally obtained bounding box; the training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45.
The second purpose of the invention is to apply the target detection method based on the dense connection depth network in image target detection.
Optionally, the method for detecting a target based on a dense connection depth network according to the present invention is applied to image target detection, and comprises the following specific steps: the method comprises the steps that pedestrian image data under different scenes are read by a network to serve as training data, image resolution of the image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module (Dense Block); obtaining a corresponding detection model through a training network; the method comprises the steps of loading a model obtained through training, a network configuration file and an image to be detected, wherein the network firstly extracts features of an input image to be detected through a feature extraction network, and because the method adopts multi-scale prediction, feature graphs of 13x13 and 26x26 are obtained after the features are extracted, and prediction is carried out under the two different scales. Then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts a fixed number (3) of bounding boxes; logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the pedestrian category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The third purpose of the invention is to apply the target detection method based on the dense connection deep network in video target detection.
Optionally, the method for detecting a target based on a dense connection deep network according to the present invention is applied to video target detection, and comprises the following specific steps: the method comprises the steps that video target image data under different scenes are read by a network to serve as training data, image resolution of the target image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module (Dense Block); obtaining a detection model of a corresponding detection task through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; because the invention adopts multi-scale prediction, the characteristic graphs of 13 × 13 and 26 × 26 are obtained after the characteristics are extracted, and the prediction is carried out under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts a fixed number (3) of bounding boxes; during prediction, logistic regression is adopted for predicting the target score of each bounding box, namely the possibility of the region being a pedestrian category; then performing non-maximum suppression (NMS); the network has high detection speed, so that the effect of real-time detection can be achieved, and the network can be applied to real-time video detection and output a detection result.
The invention has the beneficial effects that:
(1) the method of the invention makes full use of each extracted feature, thereby not only improving the feature utilization rate of the network, strengthening the feature propagation, but also enhancing the learning of the network on the detail information.
(2) The method can reach 65.93% in detection precision, which is higher than 49.19% of yolo-tiny.
(3) The method can reach 83fps/s in detection speed, and can be applied to various real-time target detection tasks in actual scenes.
(4) The size of the model adopted by the method is only 44.7MB, the requirement of the model on the memory of the computer is small, and the cost can be saved.
Drawings
Fig. 1 is a diagram of a dense connection network architecture.
Fig. 2 is an overall architecture diagram of the network.
Fig. 3 is the pedestrian detection results of the original algorithm in the Pascal VOC data set.
Figure 4 is the results of pedestrian detection in the Pascal VOC data set of example 2.
FIG. 5 is the detection result of the original algorithm in the Pascal VOC detection task.
FIG. 6 is the results of example 3 in the Pascal VOC detection task.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Example 1
The existing target detection method has high detection precision, but cannot meet the requirement of real-time detection in actual production, and has poor portability due to model memory. Aiming at the problems, the invention provides a target detection method based on a dense connection deep network, which is described in detail with reference to the accompanying drawings as follows:
as shown in fig. 1, it is a structural diagram of a dense connection network of a target detection method based on a dense connection deep network provided by the present invention; fig. 2 is a network overall architecture diagram of a target detection method based on a dense connection depth network according to the present invention. In this embodiment, a target detection method based on a dense connection deep network includes the following steps:
a.1, reading in image data in a Pascal VOC data set and extracting target data characteristics: the network reads the input image data, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a series of convolution layers and a Dense connection module (Dense Block).
The step A.1 comprises the following steps:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottomsheet layer, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3x 3 convolution is used to extract image features. The input of each layer in the Dense connection module Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters. The formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;
xl=Hl([x0,x1,···,xl-1])
wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. In the above formula Hl(. -) represents a complex function of three successive operations, consisting of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer.
(2) Reducing the quantity of feature graphs output by the convolution layers in the sense Block;
(3) in order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
B.1, training a network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.
The step B.1 comprises the following steps:
the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, the width and the height of the network input size are random values between 320 and 608, and the network input size is changed once every 10 rounds.
C.1, target detection:
the network firstly extracts features from an input image through a feature extraction network to obtain a feature map (assumed to be k × k) with a certain size, then divides the input image into k × k cells, and predicts a fixed number (3) of bounding boxes for each cell. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be the goal. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The step C.1 specifically comprises the following steps:
(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total.
The 6 prior box sizes for the two different scales are shown in table 2 below:
TABLE 2
Figure BDA0002293076710000061
(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002293076710000062
Figure BDA0002293076710000063
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhThe coordinates and width and height of the center of the resulting bounding box. The training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45.
Example 2
This example is the process and results for pedestrian detection on the Pascal VOC data set. The method comprises the following specific steps:
a.1, reading pedestrian image data under different scenes in a Pascal VOC data set as training data and extracting pedestrian data characteristics: the network reads the input image, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a convolution layer, a pooling layer and a Dense connection module (Dense Block).
The step A.1 comprises the following steps:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottomsheet layer, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3x 3 convolution is used to extract image features. The input of each layer in the Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters.
xl=Hl([x0,x1,···,xl-1])
Wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. Hl(. cndot.) consists of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer.
The image data is passed through a feature map of 208 × 48 obtained by a first Dense connection module Dense Block, and then passed through a convolution layer of 1 × 1, in order to reduce the number of input channels and reduce the complexity of network computation, and then passed through a pooling layer of 2 × 2, the function is to perform down-sampling on the feature map to obtain higher-level semantic information. The resulting output is used as input to the Dense connection module Dense Block 2.
(2) The number of convolutional layer output feature maps in the sense Block is reduced. The sense Block1 sets the number of feature maps to 16, and the sense blocks 2, 3, 4 and 5 to 32,64,128 and 256. The purpose of increasing the quantity of the output characteristic graphs is to enable the network to learn richer high-level semantic information in pedestrian image data and increase the positioning accuracy.
(3) In order to realize network downsampling operation, a network is divided into a plurality of Dense blocks, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
B.1 training the network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.
The step B.1 comprises the following steps:
the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, after the network reads in pedestrian image data, the width and the height of the image normalization resolution ratio take random values between 320 and 608, and the random values are changed once every 10 rounds, and are all multiples of 32.
C.1, target detection:
when detecting the image, firstly loading the model, the network configuration file and the image data to be detected, firstly extracting the characteristics of the input image to be detected by the network through the characteristic extraction network, and obtaining the characteristic diagrams of 13x13 and 26x26 after extracting the characteristics due to the adoption of multi-scale prediction, and predicting under the two different scales. The network then divides the image to be detected into 13 × 13, 26 × 26 cells, respectively, each cell predicting a fixed number (3) of bounding boxes. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the pedestrian category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The step C.1 specifically comprises the following steps:
(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total. The 6 prior box sizes for the two different scales are shown in table 3 below:
TABLE 3
Figure BDA0002293076710000081
(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002293076710000082
Figure BDA0002293076710000083
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhThe coordinates and width and height of the center of the resulting bounding box. The training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45. For filtering out overlapping blocks that occur during the prediction process.
FIG. 3 shows the pedestrian detection result of the original algorithm in the Pascal VOC data set, and the accuracy of the pedestrian class detection is 65.1%. The original algorithm was derived from the literature (Redmon J, Farhadi A. Yolov3: An innovative improvement [ J ]. arXiv preprint arXiv:1804.02767,2018.).
Fig. 4 shows the detection result of the pedestrian in the Pascal VOC data set in example 2, where the accuracy of detecting the pedestrian category is 79.8%, and compared with the original algorithm, the detection accuracy is improved by 14.7%.
Example 3
This example is the procedure and results for the detection of horse classes on the Pascal VOC data set. The method comprises the following specific steps:
a.1, reading image data of the horse in different scenes in a Pascal VOC data set as training data and extracting the class data characteristics of the horse: the network reads the input image, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a convolution layer, a pooling layer and a Dense connection module (Dense Block).
The step A.1 comprises the following steps:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, and the convolution operation of 1 × 1 is called a bottomsheet layer, so as to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3x 3 convolution is used to extract image features. The input of each layer in the Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters.
xl=Hl([x0,x1,···,xl-1])
Wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. Hl(. cndot.) consists of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer.
The image data is passed through a feature map of 208 × 48 obtained by a first Dense connection module Dense Block, and then passed through a convolution layer of 1 × 1, in order to reduce the number of input channels and reduce the complexity of network computation, and then passed through a pooling layer of 2 × 2, the function is to perform down-sampling on the feature map to obtain higher-level semantic information. The resulting output is provided as input to a Dense Block 2.
(2) And the quantity of the convolution layer output characteristic graphs in a Dense connection module Dense Block is reduced. The sense Block1 sets the number of feature maps to 16, and the sense blocks 2, 3, 4 and 5 to 32,64,128 and 256. The purpose of increasing the number of the output characteristic graphs is to enable a network to learn richer high-level semantic information in the image data of the horse and increase the positioning accuracy.
(3) In order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
B.1 training the network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.
The step B.1 comprises the following steps:
the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, after the network reads in image data of a horse, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds, and are all multiples of 32.
C.1, target detection:
when detecting the image, firstly loading the model, the network configuration file and the image data to be detected, firstly extracting the characteristics of the input image to be detected by the network through the characteristic extraction network, and obtaining the characteristic diagrams of 13x13 and 26x26 after extracting the characteristics due to the adoption of multi-scale prediction, and predicting under the two different scales. The network then divides the image to be detected into 13 × 13, 26 × 26 cells, respectively, each cell predicting a fixed number (3) of bounding boxes. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the horse category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The step C.1 specifically comprises the following steps:
(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total. The 6 prior box sizes for the two different scales are shown in table 4 below:
TABLE 4
Figure BDA0002293076710000101
(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002293076710000102
Figure BDA0002293076710000103
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhThe coordinates and width and height of the center of the resulting bounding box. The training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45. For filtering out overlapping blocks that occur during the prediction process.
FIG. 5 is a detection result of the original algorithm in the Pascal VOC detection task, and it can be known from the image that the original algorithm network cannot well detect the type in the image, and the condition of missing detection occurs. The accuracy of detecting horse class is 63.2%. The original algorithm was derived from the literature (Redmon J, Farhadi A. Yolov3: An innovative improvement [ J ]. arXiv preprint arXiv:1804.02767,2018.).
Fig. 6 shows the detection result of the Pascal VOC detection task in example 2, which can well detect and locate the categories in the image. The accuracy of detecting the horse is 79.4%, and compared with the original algorithm, the accuracy of detecting the horse is improved by 16.2%.
The invention integrates the dense connection mode into the yolo-tiny network, increases the convolution layer of the yolo-tiny network and improves the characteristic extraction network. The improved network firstly normalizes an input image into a fixed size, then extracts and fuses the characteristics of each channel by using a Dense Block module, and then predicts by using different prior frames on different scales to finish the classification and positioning of a target. Compared with the original algorithm, the improved algorithm has the advantages that the precision is improved by 15%, and the requirement of real-time detection can be met; the size of the model is only 44.7MB, and the requirements of memory occupation and real-time performance in actual use can be met.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (5)

1. A target detection method based on a dense connection deep network is characterized by comprising the following steps:
step (1): reading image data in a Pascal VOC data set and extracting target data characteristics;
step (2): training a network model;
and (3): carrying out target detection;
the method comprises the following specific steps:
step (1): reading in image data in the Pascal VOC data set and extracting target data characteristics: the method comprises the steps that input image data are read through a network, the resolution of the input image data is normalized to 416 x 416, a feature mapping graph with the output size of 208 x 208 is obtained after the input image data pass through a convolution layer and a pooling layer, feature extraction is carried out on an image to be detected through 5 dense connection modules, a feature graph with the size of 13x13 is obtained, the extracted feature graph with the size of 13x13 is subjected to up-sampling, and a feature mapping graph with the size of 26x26 is obtained; wherein the convolution kernel size in the convolution layer is 3x 3, and the step length is 1; the size of the pooled nuclei in the pooling layer was 2 x2, step size was 2;
step (2): training a network model: setting a network batch to 64, and repeating iterative training to obtain a detection model;
and (3): and (3) carrying out target detection: the method comprises the steps that firstly, a network extracts features of an input image through a feature extraction network to obtain a k x k feature map with a certain size, then the input image is divided into k x k unit cells, and each unit cell predicts a fixed number of boundary frames; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is to be the target; then, performing non-maximum value suppression NMS, and finally outputting a detection result;
the step (1) further comprises:
the intensive connection mode is introduced, so that an L (L +1)/2 connections exist in an L-layer network; the Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottompiece layer; the 3-by-3 convolution is used for extracting image features; the input of each layer in the Dense connection module Dense Block comes from the output of all the previous layers; the formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;
xl=Hl([x0,x1,…,xl-1])
wherein x islRepresents the output of the l-th layer, [ x ]0,x1,L,xl-1]Represents the cascade of the output of the 0 th, L, L-1 layer; in the above formula Hl(g) A complex function representing three successive operations, consisting of BN, ReLU and a 3x 3 convolutional layer;
reducing the quantity of feature graphs output by the convolution layers in a Dense connection module Dense Block; the number of feature maps is 16 for Dense Block1, and 32,64,128 and 256 for Dense Block2, Dense Block3, Dense Block4 and Dense Block 5; the purpose of continuously increasing the number of the output characteristic graphs is to enable a network to learn richer high-level semantic information in image data and increase the positioning accuracy;
thirdly, dividing the network into a plurality of Dense connection modules Dense blocks, wherein the number of feature maps of different Dense blocks is set to be different, the output number of the feature maps of each Dense Block is increased by multiple times, the number is respectively 16,32,64,128 and 256, and the output sizes of the feature maps obtained by convolution in each Dense Block are set to be the same;
the step (2) comprises the following steps:
setting the learning rate of the network to be 0.001, setting the momentum to be 0.9, setting the weight attenuation regular term to be 0.0005, setting the maximum iteration number of the network to be 500200, and attenuating the learning rate of the network by 10 times when the iteration number reaches 400000 and 450000; meanwhile, the network uses multi-scale training, after the network reads data, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds at random, and are all multiples of 32;
the step (3) comprises the following steps:
yolov3-tiny clusters the real frames in the data set by using a K-means clustering algorithm, sets 3 prior frames with different sizes for the feature maps of two scales 13x13 and 26x26 obtained in the step (1), and clusters 6 prior frames with different sizes in total;
the 6 prior box sizes for the two different scales are as follows:
Figure FDA0003212505610000021
predicting on feature maps of two different scales 13x13 and 26x26 by using 6 different prior boxes Anchors; when the bounding box is predicted, in order to better model data and support multi-label classification, a network adopts logistic regression; the coordinate prediction formula of the network bounding box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure FDA0003212505610000022
Figure FDA0003212505610000023
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhCoordinates and width and height of the center of the finally obtained bounding box; the training of coordinates adopts the square sum and the error loss;
and setting the threshold value of the non-maximum value for suppressing NMS to be 0.45.
2. The use of the object detection method based on the dense connection depth network as claimed in claim 1 in image object detection.
3. The application of claim 2, wherein the specific application steps are as follows: the method comprises the steps that pedestrian image data under different scenes are read by a network to serve as training data, the image resolution of the image data is firstly normalized to 416 x 416, then a feature map with the output size of 208 x 208 is obtained through a convolution layer and a pooling layer, feature extraction is carried out on an image to be detected through 5 dense connection modules to obtain a feature map with the size of 13x13, the extracted feature map with the size of 13x13 is subjected to up-sampling to obtain a feature map with the size of 26x 26; wherein the convolution kernel size in the convolution layer is 3x 3, and the step length is 1; the size of the pooled nuclei in the pooling layer was 2 x2, step size was 2; obtaining a corresponding detection model through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; obtaining feature maps of 13 × 13 and 26 × 26 after extracting features, and predicting under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts 3 fixed bounding boxes; during prediction, logistic regression is adopted for predicting the target score of each bounding box, namely the possibility of the region being a pedestrian category; then, non-maximum value suppression NMS is carried out, and finally, a detection result is output.
4. The use of the dense connection depth network-based object detection method of claim 1 in video object detection.
5. The application of claim 4, wherein the specific application steps are as follows: the method comprises the steps that video target image data under different scenes are read by a network to serve as training data, firstly, the image resolution of the target image data is normalized to 416 x 416, feature extraction is conducted on an image to be detected through 5 dense connection modules, a feature map with the size of 13x13 is obtained, the extracted feature map with the size of 13x13 is subjected to up-sampling, and a feature mapping map with the size of 26x26 is obtained; wherein the convolution kernel size in the convolution layer is 3x 3, and the step length is 1; the size of the pooled nuclei in the pooling layer was 2 x2, step size was 2; obtaining a detection model of a corresponding detection task through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; obtaining feature maps of 13 × 13 and 26 × 26 after extracting features, and predicting under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts 3 fixed bounding boxes; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is in the target category; then, carrying out non-maximum value suppression NMS; and finally, outputting a detection result.
CN201911188895.5A 2019-11-28 2019-11-28 Target detection method based on dense connection deep network Active CN110991311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911188895.5A CN110991311B (en) 2019-11-28 2019-11-28 Target detection method based on dense connection deep network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911188895.5A CN110991311B (en) 2019-11-28 2019-11-28 Target detection method based on dense connection deep network

Publications (2)

Publication Number Publication Date
CN110991311A CN110991311A (en) 2020-04-10
CN110991311B true CN110991311B (en) 2021-09-24

Family

ID=70087704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911188895.5A Active CN110991311B (en) 2019-11-28 2019-11-28 Target detection method based on dense connection deep network

Country Status (1)

Country Link
CN (1) CN110991311B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553406B (en) * 2020-04-24 2023-04-28 上海锘科智能科技有限公司 Target detection system, method and terminal based on improved YOLO-V3
CN112287740B (en) * 2020-05-25 2022-08-30 国网江苏省电力有限公司常州供电分公司 Target detection method and device for power transmission line based on YOLOv3-tiny, and unmanned aerial vehicle
CN111723737B (en) * 2020-06-19 2023-11-17 河南科技大学 Target detection method based on multi-scale matching strategy deep feature learning
CN111832489A (en) * 2020-07-15 2020-10-27 中国电子科技集团公司第三十八研究所 Subway crowd density estimation method and system based on target detection
CN111862056A (en) * 2020-07-23 2020-10-30 东莞理工学院 Retinal vessel image segmentation method based on deep learning
CN111860681B (en) * 2020-07-30 2024-04-30 江南大学 Deep network difficulty sample generation method under double-attention mechanism and application
CN112132034B (en) * 2020-09-23 2024-04-16 平安国际智慧城市科技股份有限公司 Pedestrian image detection method, device, computer equipment and storage medium
CN112861919A (en) * 2021-01-15 2021-05-28 西北工业大学 Underwater sonar image target detection method based on improved YOLOv3-tiny
CN112949389A (en) * 2021-01-28 2021-06-11 西北工业大学 Haze image target detection method based on improved target detection network
CN113449806A (en) * 2021-07-12 2021-09-28 苏州大学 Two-stage forestry pest identification and detection system and method based on hierarchical structure
CN113705359B (en) * 2021-08-03 2024-05-03 江南大学 Multi-scale clothes detection system and method based on drum images of washing machine
CN113705583B (en) * 2021-08-16 2024-03-22 南京莱斯电子设备有限公司 Target detection and identification method based on convolutional neural network model
CN114998220B (en) * 2022-05-12 2023-06-13 湖南中医药大学 Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment
CN115410184A (en) * 2022-08-24 2022-11-29 江西山水光电科技股份有限公司 Target detection license plate recognition method based on deep neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522966A (en) * 2018-11-28 2019-03-26 中山大学 A kind of object detection method based on intensive connection convolutional neural networks
CN109685152A (en) * 2018-12-29 2019-04-26 北京化工大学 A kind of image object detection method based on DC-SPP-YOLO
CN109685008A (en) * 2018-12-25 2019-04-26 云南大学 A kind of real-time video object detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10614326B2 (en) * 2017-03-06 2020-04-07 Honda Motor Co., Ltd. System and method for vehicle control based on object and color detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522966A (en) * 2018-11-28 2019-03-26 中山大学 A kind of object detection method based on intensive connection convolutional neural networks
CN109685008A (en) * 2018-12-25 2019-04-26 云南大学 A kind of real-time video object detection method
CN109685152A (en) * 2018-12-29 2019-04-26 北京化工大学 A kind of image object detection method based on DC-SPP-YOLO

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Densely Connected Convolutional Networks;Gao Huang 等;《 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;20170726;第2261-2269页 *
YOLO-RD: A lightweight object detection network for range doppler radar images;ZHOU Long 等;《IOP conference series》;20190731;第1-6页 *
YOLOv3: An Incremental Improvement;Joseph Redmon 等;《arXiv:1804.02767》;20180408;第1-6页 *
ZHOU Long 等.YOLO-RD: A lightweight object detection network for range doppler radar images.《IOP conference series》.2019,第1-6页. *

Also Published As

Publication number Publication date
CN110991311A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN110991311B (en) Target detection method based on dense connection deep network
CN110084292B (en) Target detection method based on DenseNet and multi-scale feature fusion
CN107103754B (en) Road traffic condition prediction method and system
WO2017215622A1 (en) Object segmentation method and apparatus and computing device
CN112132093B (en) High-resolution remote sensing image target detection method and device and computer equipment
CN111626128A (en) Improved YOLOv 3-based pedestrian detection method in orchard environment
CN111753682B (en) Hoisting area dynamic monitoring method based on target detection algorithm
CN110610143B (en) Crowd counting network method, system, medium and terminal for multi-task combined training
CN111091101B (en) High-precision pedestrian detection method, system and device based on one-step method
CN110837786B (en) Density map generation method and device based on spatial channel, electronic terminal and medium
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN112906865B (en) Neural network architecture searching method and device, electronic equipment and storage medium
CN114140683A (en) Aerial image target detection method, equipment and medium
CN111461145A (en) Method for detecting target based on convolutional neural network
WO2023116632A1 (en) Video instance segmentation method and apparatus based on spatio-temporal memory information
CN110135428B (en) Image segmentation processing method and device
CN114639102A (en) Cell segmentation method and device based on key point and size regression
CN113487610A (en) Herpes image recognition method and device, computer equipment and storage medium
CN111340139B (en) Method and device for judging complexity of image content
CN113221855A (en) Small target detection method and system based on scale sensitive loss and feature fusion
CN112149518A (en) Pine cone detection method based on BEGAN and YOLOV3 models
CN112132207A (en) Target detection neural network construction method based on multi-branch feature mapping
CN112101113A (en) Lightweight unmanned aerial vehicle image small target detection method
CN116958809A (en) Remote sensing small sample target detection method for feature library migration
CN116797973A (en) Data mining method and system applied to sanitation intelligent management platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant