CN110991311A - Target detection method based on dense connection deep network - Google Patents
Target detection method based on dense connection deep network Download PDFInfo
- Publication number
- CN110991311A CN110991311A CN201911188895.5A CN201911188895A CN110991311A CN 110991311 A CN110991311 A CN 110991311A CN 201911188895 A CN201911188895 A CN 201911188895A CN 110991311 A CN110991311 A CN 110991311A
- Authority
- CN
- China
- Prior art keywords
- network
- image
- training
- dense connection
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a target detection method based on a dense connection deep network, and belongs to the technical field of target detection. The target detection method based on the dense connection deep network fuses the dense connection mode into the yolo-tiny network, increases the convolution layer of the yolo-tiny network, and improves the feature extraction network. The improved network firstly normalizes an input image into a fixed size, then extracts and fuses the characteristics of each channel by using a DenseBlock module, and then predicts by using different prior frames on different scales to finish the classification and positioning of a target. Compared with the original algorithm, the improved algorithm has the advantages that the precision is improved by 15%, and the requirement of real-time detection can be met; the size of the model is only 44.7MB, and the requirements of memory occupation and real-time performance in actual use can be met.
Description
Technical Field
The invention relates to a target detection method based on a dense connection deep network, and belongs to the technical field of target detection.
Background
There are many current deep learning based target detection algorithms, such as fast Rcnn (fast Region-based computational Network), ssd (single Shot MultiBox detector), R-fcn (Region-based fuzzy computational Network), yolo (you Only Look one), yolo-Tiny (you Only Look one-Tiny) and so on. However, the algorithms still have many defects, for example, the algorithms such as fast Rcnn, R-fcn, SSD and the like have the problems of low detection speed, complex system configuration environment and the like, the yolov3 algorithm has high detection speed, but the model occupies a large memory, and the yolov3-tiny has the problem of over-low detection precision.
Although the current yolov3-tiny detection network has high detection speed, various problems exist, such as inaccurate detection positioning, poor detection effect, and serious missed detection and false detection conditions. At present, a residual network structure is fused into yolov3-tiny in the literature, but the detection precision is only 60.92%.
A dense connection Convolutional neural network (Gao Huang, Zhuang Li, Laurens van der Maaten, Kilian Q. Weinberger. Densely Connected Convolutional Networks [ C ]. CVPR, 2017. DOI: 10.1109/CVPR.2017.243) is an independent and complete detection network, but the network has the disadvantages that the calculated amount of the network is increased sharply and a large amount of display memory is consumed due to the arrangement of output parameters of different Convolutional layers and the existence of full connection layers. This problem limits the use of the network in practical production.
Disclosure of Invention
In order to solve at least one problem, the invention provides a target detection method based on a dense connection deep network, which achieves the effects of high detection precision, high speed and small memory occupied by a model by improving the network structure of yolov3-tiny algorithm, and can meet the requirement of displaying and using the real-time performance.
According to the target detection method based on the dense connection deep network, a dense connection mode is integrated into a convolutional neural network, and each extracted feature is utilized extremely by cascading the output of each convolutional layer. The invention not only improves the feature utilization rate and information flow of the detection network, but also strengthens feature propagation and improves the detection effect.
The invention aims to provide a target detection method based on a dense connection deep network, which comprises the following steps:
step (1): reading image data in a Pascal VOC data set and extracting target data characteristics;
step (2): training a network model;
and (3): and carrying out target detection.
Optionally, the method comprises the following steps:
step (1): reading in image data in the Pascal VOC data set and extracting target data characteristics: reading input image data by a network, firstly normalizing the resolution of the input image data to 416 x 416, and then extracting and fusing the characteristics of each channel through a series of convolution layers and a Dense connection module Dense Block;
step (2): training a network model: setting a network batch to 64, and repeating iterative training to obtain a detection model;
and (3): and (3) carrying out target detection: the network firstly extracts features from an input image through a feature extraction network to obtain a feature map (assumed to be k x k) with a certain size, then the input image is divided into k x k unit cells, and each unit cell predicts a fixed number (3) of boundary frames; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is to be the target; then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
Optionally, the step (1) includes:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections; the Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is also called a bottelecklayer, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel; the 3-by-3 convolution is used for extracting image features; the input of each layer in the Dense connection module Dense Block comes from the output of all the previous layers so as to achieve better effect and fewer parameters; the formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;
xl=Hl([x0,x1,···,xl-1])
wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. In the above formula Hl(. -) represents a complex function of three successive operations, consisting of Batch Normalization (BN), normalized linear regression (ReLU) and a 3 x 3 convolutional layer;
(2) reducing the quantity of feature graphs output by the convolution layers in a Dense connection module Dense Block;
(3) in order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
Optionally, the step (2) includes:
setting the learning rate of the network to be 0.001, setting the momentum to be 0.9, setting the weight attenuation regular term to be 0.0005, setting the maximum iteration number of the network to be 500200, and attenuating the learning rate of the network by 10 times when the iteration number reaches 400000 and 450000; and simultaneously, the network uses multi-scale training, after the network reads data, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds at random, and are all multiples of 32.
Optionally, the step (3) includes:
(1) yolov3-tiny uses K-means clustering algorithm to cluster the real frames in the data set, sets 3 prior frames with different sizes for two scales obtained by down sampling, and clusters 6 prior frames with different sizes in total;
the 6 prior box sizes for the two different scales are shown in table 1 below:
TABLE 1
(2) Predicting on feature maps of three different scales using 6 different prior boxes (Anchors); when the bounding box is predicted, in order to better model data and support multi-label classification, a network adopts logistic regression (logistic regression); the coordinate prediction formula of the network bounding box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhCoordinates and width and height of the center of the finally obtained bounding box; the training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45.
The second purpose of the invention is to apply the target detection method based on the dense connection depth network in image target detection.
Optionally, the method for detecting a target based on a dense connection depth network according to the present invention is applied to image target detection, and comprises the following specific steps: the method comprises the steps that pedestrian image data under different scenes are read by a network to serve as training data, image resolution of the image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a dense connection module (DenseBlock); obtaining a corresponding detection model through a training network; the method comprises the steps of loading a model obtained through training, a network configuration file and an image to be detected, wherein the network firstly extracts features of an input image to be detected through a feature extraction network, and because the method adopts multi-scale prediction, feature graphs of 13 x 13 and 26 x 26 are obtained after the features are extracted, and prediction is carried out under the two different scales. Then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts a fixed number (3) of bounding boxes; logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the pedestrian category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The third purpose of the invention is to apply the target detection method based on the dense connection deep network in video target detection.
Optionally, the method for detecting a target based on a dense connection deep network according to the present invention is applied to video target detection, and comprises the following specific steps: the method comprises the steps that video target image data under different scenes are read by a network to serve as training data, image resolution of the target image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module (Dense Block); obtaining a detection model of a corresponding detection task through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; because the invention adopts multi-scale prediction, the characteristic graphs of 13 × 13 and 26 × 26 are obtained after the characteristics are extracted, and the prediction is carried out under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts a fixed number (3) of bounding boxes; during prediction, logistic regression is adopted for predicting the target score of each bounding box, namely the possibility of the region being a pedestrian category; then performing non-maximum suppression (NMS); the network has high detection speed, so that the effect of real-time detection can be achieved, and the network can be applied to real-time video detection and output a detection result.
The invention has the beneficial effects that:
(1) the method of the invention makes full use of each extracted feature, thereby not only improving the feature utilization rate of the network, strengthening the feature propagation, but also enhancing the learning of the network on the detail information.
(2) The method can reach 65.93% in detection precision, which is higher than 49.19% of yolo-tiny.
(3) The method can reach 83fps/s in detection speed, and can be applied to various real-time target detection tasks in actual scenes.
(4) The size of the model adopted by the method is only 44.7MB, the requirement of the model on the memory of the computer is small, and the cost can be saved.
Drawings
Fig. 1 is a diagram of a dense connection network architecture.
Fig. 2 is an overall architecture diagram of the network.
Fig. 3 is the pedestrian detection results of the original algorithm in the Pascal VOC data set.
Figure 4 is the results of pedestrian detection in the Pascal VOC data set of example 2.
FIG. 5 is the detection result of the original algorithm in the Pascal VOC detection task.
FIG. 6 is the results of example 3 in the Pascal VOC detection task.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Example 1
The existing target detection method has high detection precision, but cannot meet the requirement of real-time detection in actual production, and has poor portability due to model memory. Aiming at the problems, the invention provides a target detection method based on a dense connection deep network, which is described in detail with reference to the accompanying drawings as follows:
as shown in fig. 1, it is a structural diagram of a dense connection network of a target detection method based on a dense connection deep network provided by the present invention; fig. 2 is a network overall architecture diagram of a target detection method based on a dense connection depth network according to the present invention. In this embodiment, a target detection method based on a dense connection deep network includes the following steps:
a.1, reading in image data in a Pascal VOC data set and extracting target data characteristics: the network reads the input image data, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a series of convolution layers and a Dense connection module (Dense Block).
The step A.1 comprises the following steps:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottelecklayer again, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3 x 3 convolution is used to extract image features. The input of each layer in the Dense connection module Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters. The formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;
xl=Hl([x0,x1,···,xl-1])
wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. In the above formula Hl(. -) represents a complex function of three successive operations, consisting of Batch Normalization (BN), normalized linear regression (ReLU) and a 3 x 3 convolutional layer.
(2) Reducing the quantity of feature graphs output by the convolution layers in the sense Block;
(3) in order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
B.1, training a network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.
The step B.1 comprises the following steps:
the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, the width and the height of the network input size are random values between 320 and 608, and the network input size is changed once every 10 rounds.
C.1, target detection:
the network firstly extracts features from an input image through a feature extraction network to obtain a feature map (assumed to be k × k) with a certain size, then divides the input image into k × k cells, and predicts a fixed number (3) of bounding boxes for each cell. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be the goal. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The step C.1 specifically comprises the following steps:
(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total.
The 6 prior box sizes for the two different scales are shown in table 2 below:
TABLE 2
(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhThe coordinates and width and height of the center of the resulting bounding box. The training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45.
Example 2
This example is the process and results for pedestrian detection on the Pascal VOC data set. The method comprises the following specific steps:
a.1, reading pedestrian image data under different scenes in a Pascal VOC data set as training data and extracting pedestrian data characteristics: the network reads the input image, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a convolution layer, a pooling layer and a Dense connection module (Dense Block).
The step A.1 comprises the following steps:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottelecklayer again, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3 x 3 convolution is used to extract image features. The input of each layer in the Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters.
xl=Hl([x0,x1,···,xl-1])
Wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. Hl(. cndot.) consists of Batch Normalization (BN), normalized linear unit (ReLU) and a 3 x 3 convolutional layer.
The image data is passed through a feature map of 208 × 48 obtained by a first Dense connection module Dense Block, and then passed through a convolution layer of 1 × 1, in order to reduce the number of input channels and reduce the complexity of network computation, and then passed through a pooling layer of 2 × 2, the function is to perform down-sampling on the feature map to obtain higher-level semantic information. The resulting output is used as input to the Dense connection module Dense Block 2.
(2) The number of convolutional layer output feature maps in the sense Block is reduced. The sense Block1 sets the number of feature maps to 16, and the sense blocks 2, 3, 4 and 5 to 32, 64, 128 and 256. The purpose of increasing the quantity of the output characteristic graphs is to enable the network to learn richer high-level semantic information in pedestrian image data and increase the positioning accuracy.
(3) In order to realize network downsampling operation, a network is divided into a plurality of Dense blocks, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
B.1 training the network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.
The step B.1 comprises the following steps:
the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, after the network reads in pedestrian image data, the width and the height of the image normalization resolution ratio take random values between 320 and 608, and the random values are changed once every 10 rounds, and are all multiples of 32.
C.1, target detection:
when detecting the image, firstly loading the model, the network configuration file and the image data to be detected, firstly extracting the characteristics of the input image to be detected by the network through the characteristic extraction network, and obtaining the characteristic diagrams of 13 x 13 and 26 x 26 after extracting the characteristics due to the adoption of multi-scale prediction, and predicting under the two different scales. The network then divides the image to be detected into 13 × 13, 26 × 26 cells, respectively, each cell predicting a fixed number (3) of bounding boxes. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the pedestrian category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The step C.1 specifically comprises the following steps:
(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total. The 6 prior box sizes for the two different scales are shown in table 3 below:
TABLE 3
(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhThe coordinates and width and height of the center of the resulting bounding box. The coordinate training adopts the method of flatteningSquare and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45. For filtering out overlapping blocks that occur during the prediction process.
FIG. 3 shows the pedestrian detection result of the original algorithm in the Pascal VOC data set, and the accuracy of the pedestrian class detection is 65.1%. The original algorithm was derived from the literature (Redmon J, Farhadi A. Yolov3: An innovative improvement [ J ]. arXiv preprint arXiv:1804.02767,2018.).
Fig. 4 shows the detection result of the pedestrian in the Pascal VOC data set in example 2, where the accuracy of detecting the pedestrian category is 79.8%, and compared with the original algorithm, the detection accuracy is improved by 14.7%.
Example 3
This example is the procedure and results for the detection of horse classes on the Pascal VOC data set. The method comprises the following specific steps:
a.1, reading image data of the horse in different scenes in a Pascal VOC data set as training data and extracting the class data characteristics of the horse: the network reads the input image, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a convolution layer, a pooling layer and a Dense connection module (Dense Block).
The step A.1 comprises the following steps:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, and the convolution operation of 1 × 1 is called a bottomsheet layer, so as to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3 x 3 convolution is used to extract image features. The input of each layer in the Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters.
xl=Hl([x0,x1,···,xl-1])
Wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Representing the cascade of layer 0, …, l-1 outputs. Hl(. The) consists of Batch Normalization (BN), normalized linear unit (ReLU) and one3 x 3 of a convolutional layer.
The image data is passed through a feature map of 208 × 48 obtained by a first Dense connection module Dense Block, and then passed through a convolution layer of 1 × 1, in order to reduce the number of input channels and reduce the complexity of network computation, and then passed through a pooling layer of 2 × 2, the function is to perform down-sampling on the feature map to obtain higher-level semantic information. The resulting output is provided as input to a Dense Block 2.
(2) And the quantity of the convolution layer output characteristic graphs in a Dense connection module Dense Block is reduced. The DenseBlock1 sets the number of feature maps to 16, and the DenseBlock 2, DenseBlock 3, DenseBlock 4 and DenseBlock5 to 32, 64, 128 and 256. The purpose of increasing the number of the output characteristic graphs is to enable a network to learn richer high-level semantic information in the image data of the horse and increase the positioning accuracy.
(3) In order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.
B.1 training the network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.
The step B.1 comprises the following steps:
the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, after the network reads in image data of a horse, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds, and are all multiples of 32.
C.1, target detection:
when detecting the image, firstly loading the model, the network configuration file and the image data to be detected, firstly extracting the characteristics of the input image to be detected by the network through the characteristic extraction network, and obtaining the characteristic diagrams of 13 x 13 and 26 x 26 after extracting the characteristics due to the adoption of multi-scale prediction, and predicting under the two different scales. The network then divides the image to be detected into 13 × 13, 26 × 26 cells, respectively, each cell predicting a fixed number (3) of bounding boxes. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the horse category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.
The step C.1 specifically comprises the following steps:
(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total. The 6 prior box sizes for the two different scales are shown in table 4 below:
TABLE 4
(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhFor the edge finally obtainedCoordinates and width and height of the center of the bounding box. The training of coordinates uses the square and error loss.
(3) The threshold for non-maximum suppression (NMS) was set to 0.45. For filtering out overlapping blocks that occur during the prediction process.
FIG. 5 is a detection result of the original algorithm in the Pascal VOC detection task, and it can be known from the image that the original algorithm network cannot well detect the type in the image, and the condition of missing detection occurs. The accuracy of detecting horse class is 63.2%. The original algorithm was derived from the literature (Redmon J, Farhadi A. Yolov3: An innovative improvement [ J ]. arXivpreprint arXiv:1804.02767,2018.).
Fig. 6 shows the detection result of the Pascal VOC detection task in example 2, which can well detect and locate the categories in the image. The accuracy of detecting the horse is 79.4%, and compared with the original algorithm, the accuracy of detecting the horse is improved by 16.2%.
The invention integrates the dense connection mode into the yolo-tiny network, increases the convolution layer of the yolo-tiny network and improves the characteristic extraction network. The improved network firstly normalizes an input image into a fixed size, then extracts and fuses the characteristics of each channel by using a Dense Block module, and then predicts by using different prior frames on different scales to finish the classification and positioning of a target. Compared with the original algorithm, the improved algorithm has the advantages that the precision is improved by 15%, and the requirement of real-time detection can be met; the size of the model is only 44.7MB, and the requirements of memory occupation and real-time performance in actual use can be met.
Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (9)
1. A target detection method based on a dense connection deep network is characterized by comprising the following steps:
step (1): reading image data in a Pascal VOC data set and extracting target data characteristics;
step (2): training a network model;
and (3): and carrying out target detection.
2. The method according to claim 1, characterized by the specific steps of:
step (1): reading in image data in the Pascal VOC data set and extracting target data characteristics: reading input image data by a network, firstly normalizing the resolution of the input image data to 416 x 416, and then extracting and fusing the characteristics of each channel through a series of convolution layers and a Dense connection module Dense Block;
step (2): training a network model: setting a network batch to 64, and repeating iterative training to obtain a detection model;
and (3): and (3) carrying out target detection: the method comprises the steps that firstly, a network extracts features of an input image through a feature extraction network to obtain a k x k feature map with a certain size, then the input image is divided into k x k unit cells, and each unit cell predicts a fixed number of boundary frames; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is to be the target; then, non-maximum value suppression NMS is carried out, and finally, a detection result is output.
3. The method of claim 1, wherein step (1) comprises:
(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections; the dense connection module DenseBlock is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottompiece layer; the 3-by-3 convolution is used for extracting image features; the input of each layer in the Dense connection module Dense Block comes from the output of all the previous layers; the formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;
xl=Hl([x0,x1,…,xl-1])
wherein x islRepresents the output of the l-th layer, [ x ]0,x1,…,xl-1]Represents a cascade of layer 0, …, l-1 outputs; in the above formulaHl(. cndot.) represents a complex function of three successive operations, consisting of BN, ReLU and a 3 x 3 convolutional layer;
(2) reducing the quantity of feature graphs output by the convolution layers in a Dense connection module Dense Block;
(3) the network is divided into a plurality of Dense connection modules, and the output of the feature maps in each Dense connection module is set to be the same, and the number of the feature maps in different scales is different.
4. The method of claim 1, wherein step (2) comprises:
setting the learning rate of the network to be 0.001, setting the momentum to be 0.9, setting the weight attenuation regular term to be 0.0005, setting the maximum iteration number of the network to be 500200, and attenuating the learning rate of the network by 10 times when the iteration number reaches 400000 and 450000; and simultaneously, the network uses multi-scale training, after the network reads data, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds at random, and are all multiples of 32.
5. The method of claim 1, wherein step (3) comprises:
(1) yolov3-tiny uses K-means clustering algorithm to cluster the real frames in the data set, sets 3 prior frames with different sizes for two scales obtained by down sampling, and clusters 6 prior frames with different sizes in total;
the 6 prior box sizes for the two different scales are as follows:
(2) predicting on feature maps of three different scales by using 6 different prior boxes Anchors; when the bounding box is predicted, in order to better model data and support multi-label classification, a network adopts logistic regression; the coordinate prediction formula of the network bounding box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
wherein t isx、ty、tw、thAs the actual predicted value of the model, cxAnd cyDenotes the coordinate offset, p, of grid cellwAnd phWidth and height of the anchor box, bx、by、bwAnd bhCoordinates and width and height of the center of the finally obtained bounding box; the training of coordinates adopts the square sum and the error loss;
(3) the threshold for non-maximum suppression NMS is set to 0.45.
6. The use of the object detection method based on the dense connection depth network as claimed in claim 1 in image object detection.
7. The application of claim 6, wherein the specific application steps are as follows: the method comprises the steps that pedestrian image data under different scenes are read by a network to serve as training data, image resolution of the image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module Dense Block; obtaining a corresponding detection model through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; obtaining feature maps of 13 × 13 and 26 × 26 after extracting features, and predicting under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts 3 fixed bounding boxes; during prediction, logistic regression is adopted for predicting the target score of each bounding box, namely the possibility of the region being a pedestrian category; then, non-maximum value suppression NMS is carried out, and finally, a detection result is output.
8. The use of the dense connection depth network-based object detection method of claim 1 in video object detection.
9. The application of claim 8, wherein the specific application steps are as follows: the method comprises the steps that video target image data under different scenes are read by a network to serve as training data, image resolution of the target image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module Dense Block; obtaining a detection model of a corresponding detection task through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; obtaining feature maps of 13 × 13 and 26 × 26 after extracting features, and predicting under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts 3 bounding boxes; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is in the target category; then, carrying out non-maximum value suppression NMS; and outputting a detection result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911188895.5A CN110991311B (en) | 2019-11-28 | 2019-11-28 | Target detection method based on dense connection deep network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911188895.5A CN110991311B (en) | 2019-11-28 | 2019-11-28 | Target detection method based on dense connection deep network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110991311A true CN110991311A (en) | 2020-04-10 |
CN110991311B CN110991311B (en) | 2021-09-24 |
Family
ID=70087704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911188895.5A Active CN110991311B (en) | 2019-11-28 | 2019-11-28 | Target detection method based on dense connection deep network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110991311B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111553406A (en) * | 2020-04-24 | 2020-08-18 | 上海锘科智能科技有限公司 | Target detection system, method and terminal based on improved YOLO-V3 |
CN111723737A (en) * | 2020-06-19 | 2020-09-29 | 河南科技大学 | Target detection method based on multi-scale matching strategy deep feature learning |
CN111832489A (en) * | 2020-07-15 | 2020-10-27 | 中国电子科技集团公司第三十八研究所 | Subway crowd density estimation method and system based on target detection |
CN111862056A (en) * | 2020-07-23 | 2020-10-30 | 东莞理工学院 | Retinal vessel image segmentation method based on deep learning |
CN111860681A (en) * | 2020-07-30 | 2020-10-30 | 江南大学 | Method for generating deep network difficult sample under double-attention machine mechanism and application |
CN112132034A (en) * | 2020-09-23 | 2020-12-25 | 平安国际智慧城市科技股份有限公司 | Pedestrian image detection method and device, computer equipment and storage medium |
CN112287740A (en) * | 2020-05-25 | 2021-01-29 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and device for power transmission line based on YOLOv3-tiny, and unmanned aerial vehicle |
CN112861919A (en) * | 2021-01-15 | 2021-05-28 | 西北工业大学 | Underwater sonar image target detection method based on improved YOLOv3-tiny |
CN112949389A (en) * | 2021-01-28 | 2021-06-11 | 西北工业大学 | Haze image target detection method based on improved target detection network |
CN113449806A (en) * | 2021-07-12 | 2021-09-28 | 苏州大学 | Two-stage forestry pest identification and detection system and method based on hierarchical structure |
CN113705583A (en) * | 2021-08-16 | 2021-11-26 | 南京莱斯电子设备有限公司 | Target detection and identification method based on convolutional neural network model |
CN113705359A (en) * | 2021-08-03 | 2021-11-26 | 江南大学 | Multi-scale clothes detection system and method based on washing machine drum image |
CN113989939A (en) * | 2021-11-16 | 2022-01-28 | 河北工业大学 | Small-target pedestrian detection system based on improved YOLO algorithm |
CN114998220A (en) * | 2022-05-12 | 2022-09-02 | 湖南中医药大学 | Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment |
CN115410184A (en) * | 2022-08-24 | 2022-11-29 | 江西山水光电科技股份有限公司 | Target detection license plate recognition method based on deep neural network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190012551A1 (en) * | 2017-03-06 | 2019-01-10 | Honda Motor Co., Ltd. | System and method for vehicle control based on object and color detection |
CN109522966A (en) * | 2018-11-28 | 2019-03-26 | 中山大学 | A kind of object detection method based on intensive connection convolutional neural networks |
CN109685152A (en) * | 2018-12-29 | 2019-04-26 | 北京化工大学 | A kind of image object detection method based on DC-SPP-YOLO |
CN109685008A (en) * | 2018-12-25 | 2019-04-26 | 云南大学 | A kind of real-time video object detection method |
-
2019
- 2019-11-28 CN CN201911188895.5A patent/CN110991311B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190012551A1 (en) * | 2017-03-06 | 2019-01-10 | Honda Motor Co., Ltd. | System and method for vehicle control based on object and color detection |
CN109522966A (en) * | 2018-11-28 | 2019-03-26 | 中山大学 | A kind of object detection method based on intensive connection convolutional neural networks |
CN109685008A (en) * | 2018-12-25 | 2019-04-26 | 云南大学 | A kind of real-time video object detection method |
CN109685152A (en) * | 2018-12-29 | 2019-04-26 | 北京化工大学 | A kind of image object detection method based on DC-SPP-YOLO |
Non-Patent Citations (3)
Title |
---|
GAO HUANG 等: "Densely Connected Convolutional Networks", 《 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
JOSEPH REDMON 等: "YOLOv3: An Incremental Improvement", 《ARXIV:1804.02767》 * |
ZHOU LONG 等: "YOLO-RD: A lightweight object detection network for range doppler radar images", 《IOP CONFERENCE SERIES》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111553406B (en) * | 2020-04-24 | 2023-04-28 | 上海锘科智能科技有限公司 | Target detection system, method and terminal based on improved YOLO-V3 |
CN111553406A (en) * | 2020-04-24 | 2020-08-18 | 上海锘科智能科技有限公司 | Target detection system, method and terminal based on improved YOLO-V3 |
CN112287740A (en) * | 2020-05-25 | 2021-01-29 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and device for power transmission line based on YOLOv3-tiny, and unmanned aerial vehicle |
CN112287740B (en) * | 2020-05-25 | 2022-08-30 | 国网江苏省电力有限公司常州供电分公司 | Target detection method and device for power transmission line based on YOLOv3-tiny, and unmanned aerial vehicle |
CN111723737B (en) * | 2020-06-19 | 2023-11-17 | 河南科技大学 | Target detection method based on multi-scale matching strategy deep feature learning |
CN111723737A (en) * | 2020-06-19 | 2020-09-29 | 河南科技大学 | Target detection method based on multi-scale matching strategy deep feature learning |
CN111832489A (en) * | 2020-07-15 | 2020-10-27 | 中国电子科技集团公司第三十八研究所 | Subway crowd density estimation method and system based on target detection |
CN111862056A (en) * | 2020-07-23 | 2020-10-30 | 东莞理工学院 | Retinal vessel image segmentation method based on deep learning |
CN111860681A (en) * | 2020-07-30 | 2020-10-30 | 江南大学 | Method for generating deep network difficult sample under double-attention machine mechanism and application |
CN111860681B (en) * | 2020-07-30 | 2024-04-30 | 江南大学 | Deep network difficulty sample generation method under double-attention mechanism and application |
CN112132034A (en) * | 2020-09-23 | 2020-12-25 | 平安国际智慧城市科技股份有限公司 | Pedestrian image detection method and device, computer equipment and storage medium |
CN112132034B (en) * | 2020-09-23 | 2024-04-16 | 平安国际智慧城市科技股份有限公司 | Pedestrian image detection method, device, computer equipment and storage medium |
CN112861919A (en) * | 2021-01-15 | 2021-05-28 | 西北工业大学 | Underwater sonar image target detection method based on improved YOLOv3-tiny |
CN112949389A (en) * | 2021-01-28 | 2021-06-11 | 西北工业大学 | Haze image target detection method based on improved target detection network |
CN113449806A (en) * | 2021-07-12 | 2021-09-28 | 苏州大学 | Two-stage forestry pest identification and detection system and method based on hierarchical structure |
CN113705359B (en) * | 2021-08-03 | 2024-05-03 | 江南大学 | Multi-scale clothes detection system and method based on drum images of washing machine |
CN113705359A (en) * | 2021-08-03 | 2021-11-26 | 江南大学 | Multi-scale clothes detection system and method based on washing machine drum image |
CN113705583A (en) * | 2021-08-16 | 2021-11-26 | 南京莱斯电子设备有限公司 | Target detection and identification method based on convolutional neural network model |
CN113705583B (en) * | 2021-08-16 | 2024-03-22 | 南京莱斯电子设备有限公司 | Target detection and identification method based on convolutional neural network model |
CN113989939A (en) * | 2021-11-16 | 2022-01-28 | 河北工业大学 | Small-target pedestrian detection system based on improved YOLO algorithm |
CN113989939B (en) * | 2021-11-16 | 2024-05-14 | 河北工业大学 | Small target pedestrian detection system based on improved YOLO algorithm |
CN114998220A (en) * | 2022-05-12 | 2022-09-02 | 湖南中医药大学 | Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment |
CN115410184A (en) * | 2022-08-24 | 2022-11-29 | 江西山水光电科技股份有限公司 | Target detection license plate recognition method based on deep neural network |
Also Published As
Publication number | Publication date |
---|---|
CN110991311B (en) | 2021-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110991311B (en) | Target detection method based on dense connection deep network | |
CN109977943B (en) | Image target recognition method, system and storage medium based on YOLO | |
CN107103754B (en) | Road traffic condition prediction method and system | |
CN111626128A (en) | Improved YOLOv 3-based pedestrian detection method in orchard environment | |
CN111368636B (en) | Object classification method, device, computer equipment and storage medium | |
CN111753682B (en) | Hoisting area dynamic monitoring method based on target detection algorithm | |
CN109492596B (en) | Pedestrian detection method and system based on K-means clustering and regional recommendation network | |
US20240256377A1 (en) | Fault diagnosis method and apparatus, electronic device, and storage medium | |
US20220351502A1 (en) | Multiple object detection method and apparatus | |
CN111091101B (en) | High-precision pedestrian detection method, system and device based on one-step method | |
CN113487610B (en) | Herpes image recognition method and device, computer equipment and storage medium | |
CN114898327B (en) | Vehicle detection method based on lightweight deep learning network | |
CN110532959B (en) | Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network | |
WO2021185121A1 (en) | Model generation method and apparatus, object detection method and apparatus, device, and storage medium | |
CN112906865B (en) | Neural network architecture searching method and device, electronic equipment and storage medium | |
CN111461145A (en) | Method for detecting target based on convolutional neural network | |
CN112101113A (en) | Lightweight unmanned aerial vehicle image small target detection method | |
CN116310688A (en) | Target detection model based on cascade fusion, and construction method, device and application thereof | |
CN111340139B (en) | Method and device for judging complexity of image content | |
CN113177956A (en) | Semantic segmentation method for unmanned aerial vehicle remote sensing image | |
CN112132207A (en) | Target detection neural network construction method based on multi-branch feature mapping | |
CN116958809A (en) | Remote sensing small sample target detection method for feature library migration | |
CN114882490A (en) | Unlimited scene license plate detection and classification method based on point-guided positioning | |
CN115424012A (en) | Lightweight image semantic segmentation method based on context information | |
Zhihao et al. | Object detection algorithm based on dense connection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |