CN110991311B

CN110991311B - Target detection method based on dense connection deep network

Info

Publication number: CN110991311B
Application number: CN201911188895.5A
Authority: CN
Inventors: 陈莹; 潘志浩; 化春键
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2021-09-24
Anticipated expiration: 2039-11-28
Also published as: CN110991311A

Abstract

The invention discloses a target detection method based on a dense connection deep network, and belongs to the technical field of target detection. The target detection method based on the dense connection deep network fuses the dense connection mode into the yolo-tiny network, increases the convolution layer of the yolo-tiny network, and improves the feature extraction network. The improved network firstly normalizes an input image into a fixed size, then extracts and fuses the characteristics of each channel by using a DenseBlock module, and then predicts by using different prior frames on different scales to finish the classification and positioning of a target. Compared with the original algorithm, the improved algorithm has the advantages that the precision is improved by 15%, and the requirement of real-time detection can be met; the size of the model is only 44.7MB, and the requirements of memory occupation and real-time performance in actual use can be met.

Description

Target detection method based on dense connection deep network

Technical Field

The invention relates to a target detection method based on a dense connection deep network, and belongs to the technical field of target detection.

Background

There are many current deep learning based target detection algorithms, such as fast Rcnn (fast Region-based computational Network), ssd (single Shot MultiBox detector), R-fcn (Region-based global computational Network), yolo (You Only Look one), yolo-Tiny (You Only Look one-Tiny) and so on. However, the algorithms still have many defects, for example, the algorithms such as fast Rcnn, R-fcn, SSD and the like have the problems of low detection speed, complex system configuration environment and the like, the yolov3 algorithm has high detection speed, but the model occupies a large memory, and the yolov3-tiny has the problem of over-low detection precision.

Although the current yolov3-tiny detection network has high detection speed, various problems exist, such as inaccurate detection positioning, poor detection effect, and serious missed detection and false detection conditions. At present, a residual network structure is fused into yolov3-tiny in the literature, but the detection precision is only 60.92%.

A dense connection Convolutional neural network (Gao Huang, Zhuang Li, Laurens van der Maaten, Kilian Q. Weinberger. Densely Connected Convolutional Networks [ C ]. CVPR, 2017. DOI: 10.1109/CVPR.2017.243) is an independent and complete detection network, but the network has the disadvantages that the calculated amount of the network is increased sharply and a large amount of display memory is consumed due to the arrangement of output parameters of different Convolutional layers and the existence of full connection layers. This problem limits the use of the network in practical production.

Disclosure of Invention

In order to solve at least one problem, the invention provides a target detection method based on a dense connection deep network, which achieves the effects of high detection precision, high speed and small memory occupied by a model by improving the network structure of yolov3-tiny algorithm, and can meet the requirement of displaying and using the real-time performance.

According to the target detection method based on the dense connection deep network, a dense connection mode is integrated into a convolutional neural network, and each extracted feature is utilized extremely by cascading the output of each convolutional layer. The invention not only improves the feature utilization rate and information flow of the detection network, but also strengthens feature propagation and improves the detection effect.

The invention aims to provide a target detection method based on a dense connection deep network, which comprises the following steps:

step (1): reading image data in a Pascal VOC data set and extracting target data characteristics;

step (2): training a network model;

and (3): and carrying out target detection.

Optionally, the method comprises the following steps:

step (1): reading in image data in the Pascal VOC data set and extracting target data characteristics: reading input image data by a network, firstly normalizing the resolution of the input image data to 416 x 416, and then extracting and fusing the characteristics of each channel through a series of convolution layers and a Dense connection module Dense Block;

step (2): training a network model: setting a network batch to 64, and repeating iterative training to obtain a detection model;

and (3): and (3) carrying out target detection: the network firstly extracts features from an input image through a feature extraction network to obtain a feature map (assumed to be k x k) with a certain size, then the input image is divided into k x k unit cells, and each unit cell predicts a fixed number (3) of boundary frames; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is to be the target; then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.

Optionally, the step (1) includes:

(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections; the Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is also called a botteleck layer, so as to reduce the number of input feature graphs, improve the calculation efficiency and fuse the features of each channel; the 3-by-3 convolution is used for extracting image features; the input of each layer in the Dense connection module Dense Block comes from the output of all the previous layers so as to achieve better effect and fewer parameters; the formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;

x_l＝H_l([x₀,x₁,···,x_l-1])

wherein x is_lRepresents the output of the l-th layer, [ x ]₀,x₁,…,x_l-1]Representing the cascade of layer 0, …, l-1 outputs. In the above formula H_l(. -) represents a complex function of three successive operations, consisting of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer;

(2) reducing the quantity of feature graphs output by the convolution layers in a Dense connection module Dense Block;

(3) in order to realize network down-sampling operation, a network is divided into a plurality of Dense connection modules, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.

Optionally, the step (2) includes:

setting the learning rate of the network to be 0.001, setting the momentum to be 0.9, setting the weight attenuation regular term to be 0.0005, setting the maximum iteration number of the network to be 500200, and attenuating the learning rate of the network by 10 times when the iteration number reaches 400000 and 450000; and simultaneously, the network uses multi-scale training, after the network reads data, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds at random, and are all multiples of 32.

Optionally, the step (3) includes:

(1) yolov3-tiny uses K-means clustering algorithm to cluster the real frames in the data set, sets 3 prior frames with different sizes for two scales obtained by down sampling, and clusters 6 prior frames with different sizes in total;

the 6 prior box sizes for the two different scales are shown in table 1 below:

TABLE 1

(2) Predicting on feature maps of three different scales using 6 different prior boxes (Anchors); when the bounding box is predicted, in order to better model data and support multi-label classification, a network adopts logistic regression (logistic regression); the coordinate prediction formula of the network bounding box is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein t is_x、t_y、t_w、t_hAs the actual predicted value of the model, c_xAnd c_yDenotes the coordinate offset, p, of grid cell_wAnd p_hWidth and height of the anchor box, b_x、b_y、b_wAnd b_hCoordinates and width and height of the center of the finally obtained bounding box; the training of coordinates uses the square and error loss.

(3) The threshold for non-maximum suppression (NMS) was set to 0.45.

The second purpose of the invention is to apply the target detection method based on the dense connection depth network in image target detection.

Optionally, the method for detecting a target based on a dense connection depth network according to the present invention is applied to image target detection, and comprises the following specific steps: the method comprises the steps that pedestrian image data under different scenes are read by a network to serve as training data, image resolution of the image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module (Dense Block); obtaining a corresponding detection model through a training network; the method comprises the steps of loading a model obtained through training, a network configuration file and an image to be detected, wherein the network firstly extracts features of an input image to be detected through a feature extraction network, and because the method adopts multi-scale prediction, feature graphs of 13x13 and 26x26 are obtained after the features are extracted, and prediction is carried out under the two different scales. Then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts a fixed number (3) of bounding boxes; logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the pedestrian category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.

The third purpose of the invention is to apply the target detection method based on the dense connection deep network in video target detection.

Optionally, the method for detecting a target based on a dense connection deep network according to the present invention is applied to video target detection, and comprises the following specific steps: the method comprises the steps that video target image data under different scenes are read by a network to serve as training data, image resolution of the target image data is normalized to 416 x 416, and then features of all channels are extracted and fused through a convolution layer, a pooling layer and a Dense connection module (Dense Block); obtaining a detection model of a corresponding detection task through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; because the invention adopts multi-scale prediction, the characteristic graphs of 13 × 13 and 26 × 26 are obtained after the characteristics are extracted, and the prediction is carried out under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts a fixed number (3) of bounding boxes; during prediction, logistic regression is adopted for predicting the target score of each bounding box, namely the possibility of the region being a pedestrian category; then performing non-maximum suppression (NMS); the network has high detection speed, so that the effect of real-time detection can be achieved, and the network can be applied to real-time video detection and output a detection result.

The invention has the beneficial effects that:

(1) the method of the invention makes full use of each extracted feature, thereby not only improving the feature utilization rate of the network, strengthening the feature propagation, but also enhancing the learning of the network on the detail information.

(2) The method can reach 65.93% in detection precision, which is higher than 49.19% of yolo-tiny.

(3) The method can reach 83fps/s in detection speed, and can be applied to various real-time target detection tasks in actual scenes.

(4) The size of the model adopted by the method is only 44.7MB, the requirement of the model on the memory of the computer is small, and the cost can be saved.

Drawings

Fig. 1 is a diagram of a dense connection network architecture.

Fig. 2 is an overall architecture diagram of the network.

Fig. 3 is the pedestrian detection results of the original algorithm in the Pascal VOC data set.

Figure 4 is the results of pedestrian detection in the Pascal VOC data set of example 2.

FIG. 5 is the detection result of the original algorithm in the Pascal VOC detection task.

FIG. 6 is the results of example 3 in the Pascal VOC detection task.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

Example 1

The existing target detection method has high detection precision, but cannot meet the requirement of real-time detection in actual production, and has poor portability due to model memory. Aiming at the problems, the invention provides a target detection method based on a dense connection deep network, which is described in detail with reference to the accompanying drawings as follows:

as shown in fig. 1, it is a structural diagram of a dense connection network of a target detection method based on a dense connection deep network provided by the present invention; fig. 2 is a network overall architecture diagram of a target detection method based on a dense connection depth network according to the present invention. In this embodiment, a target detection method based on a dense connection deep network includes the following steps:

a.1, reading in image data in a Pascal VOC data set and extracting target data characteristics: the network reads the input image data, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a series of convolution layers and a Dense connection module (Dense Block).

The step A.1 comprises the following steps:

(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottomsheet layer, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3x 3 convolution is used to extract image features. The input of each layer in the Dense connection module Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters. The formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;

x_l＝H_l([x₀,x₁,···,x_l-1])

wherein x is_lRepresents the output of the l-th layer, [ x ]₀,x₁,…,x_l-1]Representing the cascade of layer 0, …, l-1 outputs. In the above formula H_l(. -) represents a complex function of three successive operations, consisting of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer.

(2) Reducing the quantity of feature graphs output by the convolution layers in the sense Block;

B.1, training a network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.

The step B.1 comprises the following steps:

the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, the width and the height of the network input size are random values between 320 and 608, and the network input size is changed once every 10 rounds.

C.1, target detection:

the network firstly extracts features from an input image through a feature extraction network to obtain a feature map (assumed to be k × k) with a certain size, then divides the input image into k × k cells, and predicts a fixed number (3) of bounding boxes for each cell. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be the goal. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.

The step C.1 specifically comprises the following steps:

(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total.

The 6 prior box sizes for the two different scales are shown in table 2 below:

TABLE 2

(2) Prediction was performed on feature maps at three different scales using 6 different a priori boxes (Anchors). When predicting the bounding box, the network uses logistic regression (logistic regression) for better data modeling and support of multi-label classification. The coordinate prediction formula of the bounding box of the network is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein t is_x、t_y、t_w、t_hAs the actual predicted value of the model, c_xAnd c_yDenotes the coordinate offset, p, of grid cell_wAnd p_hWidth and height of the anchor box, b_x、b_y、b_wAnd b_hThe coordinates and width and height of the center of the resulting bounding box. The training of coordinates uses the square and error loss.

(3) The threshold for non-maximum suppression (NMS) was set to 0.45.

Example 2

This example is the process and results for pedestrian detection on the Pascal VOC data set. The method comprises the following specific steps:

a.1, reading pedestrian image data under different scenes in a Pascal VOC data set as training data and extracting pedestrian data characteristics: the network reads the input image, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a convolution layer, a pooling layer and a Dense connection module (Dense Block).

The step A.1 comprises the following steps:

(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottomsheet layer, and the purpose is to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3x 3 convolution is used to extract image features. The input of each layer in the Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters.

x_l＝H_l([x₀,x₁,···,x_l-1])

Wherein x is_lRepresents the output of the l-th layer, [ x ]₀,x₁,…,x_l-1]Representing the cascade of layer 0, …, l-1 outputs. H_l(. cndot.) consists of Batch Normalization (BN), normalized linear unit (ReLU) and a 3x 3 convolutional layer.

The image data is passed through a feature map of 208 × 48 obtained by a first Dense connection module Dense Block, and then passed through a convolution layer of 1 × 1, in order to reduce the number of input channels and reduce the complexity of network computation, and then passed through a pooling layer of 2 × 2, the function is to perform down-sampling on the feature map to obtain higher-level semantic information. The resulting output is used as input to the Dense connection module Dense Block 2.

(2) The number of convolutional layer output feature maps in the sense Block is reduced. The sense Block1 sets the number of feature maps to 16, and the sense blocks 2, 3, 4 and 5 to 32,64,128 and 256. The purpose of increasing the quantity of the output characteristic graphs is to enable the network to learn richer high-level semantic information in pedestrian image data and increase the positioning accuracy.

(3) In order to realize network downsampling operation, a network is divided into a plurality of Dense blocks, the output of feature maps in each Dense Block is set to be the same, and the number of feature maps in different scales is different.

B.1 training the network model: and setting the network batch to 64, and repeating iterative training to obtain a detection model.

The step B.1 comprises the following steps:

the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, after the network reads in pedestrian image data, the width and the height of the image normalization resolution ratio take random values between 320 and 608, and the random values are changed once every 10 rounds, and are all multiples of 32.

C.1, target detection:

when detecting the image, firstly loading the model, the network configuration file and the image data to be detected, firstly extracting the characteristics of the input image to be detected by the network through the characteristic extraction network, and obtaining the characteristic diagrams of 13x13 and 26x26 after extracting the characteristics due to the adoption of multi-scale prediction, and predicting under the two different scales. The network then divides the image to be detected into 13 × 13, 26 × 26 cells, respectively, each cell predicting a fixed number (3) of bounding boxes. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the pedestrian category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.

The step C.1 specifically comprises the following steps:

(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total. The 6 prior box sizes for the two different scales are shown in table 3 below:

TABLE 3

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

(3) The threshold for non-maximum suppression (NMS) was set to 0.45. For filtering out overlapping blocks that occur during the prediction process.

FIG. 3 shows the pedestrian detection result of the original algorithm in the Pascal VOC data set, and the accuracy of the pedestrian class detection is 65.1%. The original algorithm was derived from the literature (Redmon J, Farhadi A. Yolov3: An innovative improvement [ J ]. arXiv preprint arXiv:1804.02767,2018.).

Fig. 4 shows the detection result of the pedestrian in the Pascal VOC data set in example 2, where the accuracy of detecting the pedestrian category is 79.8%, and compared with the original algorithm, the detection accuracy is improved by 14.7%.

Example 3

This example is the procedure and results for the detection of horse classes on the Pascal VOC data set. The method comprises the following specific steps:

a.1, reading image data of the horse in different scenes in a Pascal VOC data set as training data and extracting the class data characteristics of the horse: the network reads the input image, firstly normalizes the resolution to 416 x 416, and then extracts and fuses the characteristics of each channel through a convolution layer, a pooling layer and a Dense connection module (Dense Block).

The step A.1 comprises the following steps:

(1) the intensive connection mode is introduced, so that the L-layer network has L (L +1)/2 connections. The Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, and the convolution operation of 1 × 1 is called a bottomsheet layer, so as to reduce the number of input feature maps, improve the calculation efficiency and fuse the features of each channel. The 3x 3 convolution is used to extract image features. The input of each layer in the Dense Block comes from the output of all previous layers to achieve better effect and fewer parameters.

x_l＝H_l([x₀,x₁,···,x_l-1])

The image data is passed through a feature map of 208 × 48 obtained by a first Dense connection module Dense Block, and then passed through a convolution layer of 1 × 1, in order to reduce the number of input channels and reduce the complexity of network computation, and then passed through a pooling layer of 2 × 2, the function is to perform down-sampling on the feature map to obtain higher-level semantic information. The resulting output is provided as input to a Dense Block 2.

(2) And the quantity of the convolution layer output characteristic graphs in a Dense connection module Dense Block is reduced. The sense Block1 sets the number of feature maps to 16, and the sense blocks 2, 3, 4 and 5 to 32,64,128 and 256. The purpose of increasing the number of the output characteristic graphs is to enable a network to learn richer high-level semantic information in the image data of the horse and increase the positioning accuracy.

The step B.1 comprises the following steps:

the learning rate of the network is set to 0.001, the momentum is set to 0.9, the weight attenuation regular term is 0.0005, the maximum iteration number of the network is 500200, and the learning rate of the network is attenuated by a factor of 10 when the iteration number reaches 400000 and 450000. Meanwhile, the network uses multi-scale training, after the network reads in image data of a horse, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds, and are all multiples of 32.

C.1, target detection:

when detecting the image, firstly loading the model, the network configuration file and the image data to be detected, firstly extracting the characteristics of the input image to be detected by the network through the characteristic extraction network, and obtaining the characteristic diagrams of 13x13 and 26x26 after extracting the characteristics due to the adoption of multi-scale prediction, and predicting under the two different scales. The network then divides the image to be detected into 13 × 13, 26 × 26 cells, respectively, each cell predicting a fixed number (3) of bounding boxes. Logistic regression is used in the prediction to predict the goal score of each bounding box, i.e., how likely the block is to be in the horse category. Then, non-maximum suppression (NMS) is carried out, and finally, a detection result is output.

The step C.1 specifically comprises the following steps:

(1) yolov3-tiny clusters the real frames in the dataset using the K-means clustering algorithm, sets 3 prior frames of different sizes for each downsampling scale, and clusters 6 prior frames of different sizes in total. The 6 prior box sizes for the two different scales are shown in table 4 below:

TABLE 4

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

FIG. 5 is a detection result of the original algorithm in the Pascal VOC detection task, and it can be known from the image that the original algorithm network cannot well detect the type in the image, and the condition of missing detection occurs. The accuracy of detecting horse class is 63.2%. The original algorithm was derived from the literature (Redmon J, Farhadi A. Yolov3: An innovative improvement [ J ]. arXiv preprint arXiv:1804.02767,2018.).

Fig. 6 shows the detection result of the Pascal VOC detection task in example 2, which can well detect and locate the categories in the image. The accuracy of detecting the horse is 79.4%, and compared with the original algorithm, the accuracy of detecting the horse is improved by 16.2%.

The invention integrates the dense connection mode into the yolo-tiny network, increases the convolution layer of the yolo-tiny network and improves the characteristic extraction network. The improved network firstly normalizes an input image into a fixed size, then extracts and fuses the characteristics of each channel by using a Dense Block module, and then predicts by using different prior frames on different scales to finish the classification and positioning of a target. Compared with the original algorithm, the improved algorithm has the advantages that the precision is improved by 15%, and the requirement of real-time detection can be met; the size of the model is only 44.7MB, and the requirements of memory occupation and real-time performance in actual use can be met.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A target detection method based on a dense connection deep network is characterized by comprising the following steps:

step (2): training a network model;

and (3): carrying out target detection;

the method comprises the following specific steps:

step (1): reading in image data in the Pascal VOC data set and extracting target data characteristics: the method comprises the steps that input image data are read through a network, the resolution of the input image data is normalized to 416 x 416, a feature mapping graph with the output size of 208 x 208 is obtained after the input image data pass through a convolution layer and a pooling layer, feature extraction is carried out on an image to be detected through 5 dense connection modules, a feature graph with the size of 13x13 is obtained, the extracted feature graph with the size of 13x13 is subjected to up-sampling, and a feature mapping graph with the size of 26x26 is obtained; wherein the convolution kernel size in the convolution layer is 3x 3, and the step length is 1; the size of the pooled nuclei in the pooling layer was 2 x2, step size was 2;

and (3): and (3) carrying out target detection: the method comprises the steps that firstly, a network extracts features of an input image through a feature extraction network to obtain a k x k feature map with a certain size, then the input image is divided into k x k unit cells, and each unit cell predicts a fixed number of boundary frames; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is to be the target; then, performing non-maximum value suppression NMS, and finally outputting a detection result;

the step (1) further comprises:

the intensive connection mode is introduced, so that an L (L +1)/2 connections exist in an L-layer network; the Dense connection module Dense Block is mainly composed of convolution layers of 1 × 1 and 3 × 3, wherein the convolution operation of 1 × 1 is called a bottompiece layer; the 3-by-3 convolution is used for extracting image features; the input of each layer in the Dense connection module Dense Block comes from the output of all the previous layers; the formula indicates that the input of the l layer is the sum of the outputs of all the previous layers;

x_l＝H_l([x₀,x₁,…,x_l-1])

wherein x is_lRepresents the output of the l-th layer, [ x ]₀,x₁,L,x_l-1]Represents the cascade of the output of the 0 th, L, L-1 layer; in the above formula H_l(g) A complex function representing three successive operations, consisting of BN, ReLU and a 3x 3 convolutional layer;

reducing the quantity of feature graphs output by the convolution layers in a Dense connection module Dense Block; the number of feature maps is 16 for Dense Block1, and 32,64,128 and 256 for Dense Block2, Dense Block3, Dense Block4 and Dense Block 5; the purpose of continuously increasing the number of the output characteristic graphs is to enable a network to learn richer high-level semantic information in image data and increase the positioning accuracy;

thirdly, dividing the network into a plurality of Dense connection modules Dense blocks, wherein the number of feature maps of different Dense blocks is set to be different, the output number of the feature maps of each Dense Block is increased by multiple times, the number is respectively 16,32,64,128 and 256, and the output sizes of the feature maps obtained by convolution in each Dense Block are set to be the same;

the step (2) comprises the following steps:

setting the learning rate of the network to be 0.001, setting the momentum to be 0.9, setting the weight attenuation regular term to be 0.0005, setting the maximum iteration number of the network to be 500200, and attenuating the learning rate of the network by 10 times when the iteration number reaches 400000 and 450000; meanwhile, the network uses multi-scale training, after the network reads data, the width and the height of the normalized resolution of the image take random values between 320 and 608, and the random values are changed once every 10 rounds at random, and are all multiples of 32;

the step (3) comprises the following steps:

yolov3-tiny clusters the real frames in the data set by using a K-means clustering algorithm, sets 3 prior frames with different sizes for the feature maps of two scales 13x13 and 26x26 obtained in the step (1), and clusters 6 prior frames with different sizes in total;

the 6 prior box sizes for the two different scales are as follows:

predicting on feature maps of two different scales 13x13 and 26x26 by using 6 different prior boxes Anchors; when the bounding box is predicted, in order to better model data and support multi-label classification, a network adopts logistic regression; the coordinate prediction formula of the network bounding box is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein t is_x、t_y、t_w、t_hAs the actual predicted value of the model, c_xAnd c_yDenotes the coordinate offset, p, of grid cell_wAnd p_hWidth and height of the anchor box, b_x、b_y、b_wAnd b_hCoordinates and width and height of the center of the finally obtained bounding box; the training of coordinates adopts the square sum and the error loss;

and setting the threshold value of the non-maximum value for suppressing NMS to be 0.45.

2. The use of the object detection method based on the dense connection depth network as claimed in claim 1 in image object detection.

3. The application of claim 2, wherein the specific application steps are as follows: the method comprises the steps that pedestrian image data under different scenes are read by a network to serve as training data, the image resolution of the image data is firstly normalized to 416 x 416, then a feature map with the output size of 208 x 208 is obtained through a convolution layer and a pooling layer, feature extraction is carried out on an image to be detected through 5 dense connection modules to obtain a feature map with the size of 13x13, the extracted feature map with the size of 13x13 is subjected to up-sampling to obtain a feature map with the size of 26x 26; wherein the convolution kernel size in the convolution layer is 3x 3, and the step length is 1; the size of the pooled nuclei in the pooling layer was 2 x2, step size was 2; obtaining a corresponding detection model through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; obtaining feature maps of 13 × 13 and 26 × 26 after extracting features, and predicting under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts 3 fixed bounding boxes; during prediction, logistic regression is adopted for predicting the target score of each bounding box, namely the possibility of the region being a pedestrian category; then, non-maximum value suppression NMS is carried out, and finally, a detection result is output.

4. The use of the dense connection depth network-based object detection method of claim 1 in video object detection.

5. The application of claim 4, wherein the specific application steps are as follows: the method comprises the steps that video target image data under different scenes are read by a network to serve as training data, firstly, the image resolution of the target image data is normalized to 416 x 416, feature extraction is conducted on an image to be detected through 5 dense connection modules, a feature map with the size of 13x13 is obtained, the extracted feature map with the size of 13x13 is subjected to up-sampling, and a feature mapping map with the size of 26x26 is obtained; wherein the convolution kernel size in the convolution layer is 3x 3, and the step length is 1; the size of the pooled nuclei in the pooling layer was 2 x2, step size was 2; obtaining a detection model of a corresponding detection task through a training network; loading a model, a network configuration file and an image to be detected obtained by training, wherein the network firstly extracts features of the input image to be detected through a feature extraction network; obtaining feature maps of 13 × 13 and 26 × 26 after extracting features, and predicting under the two different scales; then the network divides the image to be detected into 13 × 13 and 26 × 26 cells respectively, and each cell predicts 3 fixed bounding boxes; when predicting, adopting logistic regression to predict the target score of each bounding box, namely how likely the block area is in the target category; then, carrying out non-maximum value suppression NMS; and finally, outputting a detection result.