CN109583456B

CN109583456B - Infrared surface target detection method based on feature fusion and dense connection

Info

Publication number: CN109583456B
Application number: CN201811386234.9A
Authority: CN
Inventors: 周慧鑫; 施元斌; 赵东; 郭立新; 张嘉嘉; 秦翰林; 王炳健; 赖睿; 李欢; 宋江鲁奇; 姚博; 于跃; 贾秀萍; 周峻
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2023-04-28
Anticipated expiration: 2038-11-20
Also published as: CN109583456A

Abstract

The invention discloses an infrared target detection method based on feature fusion and dense connection, which comprises the steps of constructing an infrared image data set containing a required identification target, calibrating the position and the type of the required identification target in the infrared image data set, and obtaining an original known label image; dividing the infrared image data set into a training set and a verification set; preprocessing the images in the training set, extracting features and fusing the features, and obtaining classification results and boundary boxes through a regression network; performing loss function calculation on the classification result, the boundary box and the original known label image, and updating the parameter value of the convolutional neural network; repeatedly carrying out iterative updating on the convolutional neural network parameters until the error is small enough or the iterative times reach a set upper limit; and processing the image in the verification set through the trained convolutional neural network parameters to acquire the accuracy and the required time of target detection and a final target detection result diagram.

Description

Infrared surface target detection method based on feature fusion and dense connection

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an infrared surface target detection method based on feature fusion and dense connection.

Background

At present, main target detection methods can be roughly divided into two types, namely a target detection method based on background modeling, a target detection method based on foreground modeling, and a target detection method based on background modeling, wherein a region with large difference from the background in an image is judged to be a target by constructing a background model; due to the complexity of the background, the detection effect of the method is not ideal. The foreground modeling-based method judges the region which is more consistent with the feature information as a target by extracting the feature information of the target, wherein the most representative is a target detection method based on deep learning. The target detection method based on deep learning automatically extracts target characteristics through a deep convolutional neural network and detects the types and positions of targets. And then comparing the extracted characteristics with calibration information in a training set, calculating a loss function, and improving the extracted characteristics of the network by a gradient descent method so as to enable the extracted characteristics to be more in line with the actual conditions of the target. Meanwhile, the parameters of the subsequent detection part are updated, so that the detection result is more accurate. Training is repeated until the expected detection effect is achieved.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a target detection method based on feature fusion and dense blocks.

The technical scheme adopted by the invention is as follows:

the embodiment of the invention provides an infrared target detection method based on feature fusion and dense connection, which is realized by the following steps:

step 1, constructing an infrared image data set containing a required identification target, and calibrating the position and the type of the required identification target in the infrared image data set to obtain an original known label image;

step 2, dividing the infrared image data set into a training set and a verification set;

step 3, preprocessing the image enhancement of the images in the training set;

step 4, carrying out feature extraction and feature fusion on the preprocessed image, and obtaining a classification result and a boundary box through a regression network; performing loss function calculation on the classification result, the boundary box and the original known label image, performing back propagation on a prediction error in the convolutional neural network by using a random gradient descent method containing momentum, and updating parameter values of the convolutional neural network;

step 5, repeating the steps 3 and 4 to update the convolutional neural network parameters in an iterative manner until the error is small enough or the iteration times reach a set upper limit;

and 6, processing the image in the verification set through the trained convolutional neural network parameters to obtain the accuracy and the required time of target detection and a final target detection result diagram.

In the above scheme, in the step 4, feature extraction and feature fusion are performed on the preprocessed image, and a classification result and a bounding box are obtained through a regression network, specifically by the following steps:

step 401, randomly extracting a fixed number of images in the training set, and dividing 10×10 areas for each image;

step 402, inputting the image divided in the step 401 into a dense connection network for feature extraction;

step 403, performing feature fusion on the extracted feature map to obtain a fused feature map;

step 404, generating a fixed number of suggestion boxes for each region in the fused feature map;

and step 405, sending the fused feature map and the suggestion frame into a regression network to carry out classification and bounding box regression, and removing redundancy by using a non-maximum suppression method to obtain a classification result and a bounding box.

In the above scheme, the calculation method of the dense connection network in step 402 includes the following formula:

d _l ＝H _l ([d ₀ ,d ₁ ,...,d _l-1 ])

wherein d _l Representing the output result of the first convolution layer in the dense connection network, and if the dense connection network contains B convolution layers, l takes a value between 0 and B, H _l Is a combined operation of regularization, convolution and linear rectification activation function, d ₀ D for inputting image _l-1 The output result of the first layer-1.

In the above scheme, in the step 403, feature fusion is performed on the extracted feature images, which is to directly fuse the extracted feature images with different scales through a pooling method.

In the above scheme, in the step 403, feature fusion is performed on the extracted feature map, which is specifically implemented by the following steps:

step 4031, a first set of feature maps F ₁ Through pooling operation, the new smaller feature images are converted into a new smaller feature image, and then the new smaller feature image is matched with a second group of feature images F ₂ Fusion to obtain a new productFeature map F ₂ ’；

Step 4032, new feature map F ₂ ' through pooling operation, and then with the third group of feature graphs F ₃ Fusion to obtain a new feature map F ₃ ’；

Step 4033, using the new profile F ₂ ' and F ₃ ' replace second set of feature maps F ₂ And a third set of feature maps F ₃ Entering a regression network.

In the above scheme, in step 405, the fused feature map and the suggestion box are sent to a regression network to perform classification and bounding box regression, and redundancy is removed by using a non-maximum suppression method, so as to obtain a classification result and a bounding box, which is specifically implemented through the following steps:

step 4051, dividing the feature map into 10×10 areas, and inputting the areas into a regression detection network;

step 4051, for each region, the regression detection network will output the location and type of 7 possible targets; wherein, the total number of target types is A, namely the possibility of outputting corresponding A targets is related to the setting of the training set; the position parameters comprise 3 data including the central position coordinates, width and height of the target boundary frame;

in step 4052, the non-maximum suppression method is to calculate the intersection ratio of the same kind of bounding boxes obtained by using the following formula:

wherein S is the calculated intersection ratio, M and N represent two boundary boxes of the same class of targets, M and N represent the intersection of the boundary boxes M and N, and M and N represent the union of the boundary boxes M and N. For two bounding boxes with S greater than 0.75, the bounding box with the smaller classification result value is eliminated.

In the above scheme, in the step 4, the classification result and the bounding box are calculated as a loss function with the original known label image, and a random gradient descent method including momentum is used to counter-propagate the prediction error in the convolutional neural network, and update the parameter value of the convolutional neural network, specifically by the following steps:

step 401, calculating a loss function according to the classification result, the position and the type of the target in the bounding box, and the position and the type of the target to be identified calibrated in the training set, wherein the calculation formula of the loss function is as follows:

wherein 100 is the number of regions, 7 is the number of suggested frames and finally generated bounding boxes to be predicted for each region, i is the region number, j is the suggested frame and bounding box number, loss is the error value, obj is the presence of an object, noobj is the absence of an object, x and y are the predicted values of the abscissa and ordinate of the center of the suggested frame and bounding box, respectively, w and h are the wide and high predicted values of the suggested frame and bounding box, respectively, C is the predicted value of whether the suggested frame and bounding box contain an object, contains a values, respectively corresponds to the likelihood of a class a objects,

for the corresponding labeling value, ++>

And->

The jth suggestion box and the boundary box respectively represent that the target falls into and does not fall into the region i;

step 402, updating the weight according to the loss function calculation result by using a random gradient descent method containing momentum.

In the above scheme, the preprocessing in step 3 is to expand the training set by random rotation, mirroring, flipping, scaling, translation, scale transformation, contrast transformation, noise disturbance and color change.

Compared with the prior art, the method has the advantages that the infrared image is learned, so that the target detection network can acquire the recognition capability for visible light and infrared targets, and meanwhile, the method has a better detection effect compared with the traditional deep learning method by improving the network structure.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a network architecture diagram of the present invention;

FIG. 3 is a graph showing the results of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention provides an infrared target detection method based on feature fusion and dense connection, which is realized by the following steps as shown in fig. 1:

step 1, constructing a data set

If the detection algorithm is required to have the ability to identify the infrared image, the infrared image needs to be added to the data set. The invention constructs a data set by using the infrared image and manually marks the image in the data set by using the boundary box.

Step 2, expanding training set

The training set is expanded by random rotation, mirroring, flipping, scaling, translation, scale transformation, contrast transformation, noise disturbance, color change, and the like. The defect of difficult data set acquisition can be overcome, and the training effect of the small data set is improved.

Step 3, dividing 10×10 regions

The original image is divided into 10 x 10 areas, and each area is respectively responsible for checking the target of the center falling into the area, so that the detection speed can be greatly increased.

Step 4, using dense network to extract the characteristics

The feature extraction process comprises the following steps:

in a first step, the input image is computed using a convolution layer of size 3*3, number 32,then carrying out pooling operation of 2 x 2 to obtain a characteristic diagram F ₁ 。

Second, dense block pair F is used, which contains 64 3*3 convolution kernels and 64 1*1 convolution kernels ₁ Extracting features, calculating residual errors, and performing 2 x 2 pooling operation to obtain a feature map F ₂ 。

Third, dense block pair F is used, which contains 64 1*1 convolution kernels and 64 3*3 convolution kernels ₂ Extracting features, calculating residual errors, and performing 2 x 2 pooling operation to obtain a feature map F ₃ 。

Fourth step, dense block pair F comprising 64 1*1 convolution kernels and 64 3*3 convolution kernels is used ₄ Extracting features, performing 1*1 convolution, calculating residual error, and performing 2×2 pooling operation to obtain feature map F ₄ 。

Fifth step, dense block pair F is used, which contains 256 1*1 convolution kernels and 256 3*3 convolution kernels ₄ Extracting features, performing 1*1 convolution, calculating residual error, and performing 2×2 pooling operation to obtain feature map F ₅ 。

Sixth, dense block pair F is used that contains 1024 1*1 convolution kernels, 1024 3*3 convolution kernels, and 1024 1*1 convolution kernels ₅ Extracting features, convolving 1*1, and calculating residual to obtain feature map F ₆ 。

Step 5, carrying out feature fusion on the feature extraction result

The feature fusion method comprises the following steps:

first, extracting the feature map F obtained in the step 3 ₄ 、F ₅ 、F ₆ 。

Second step, for the characteristic diagram F ₄ Pooling for 4 times 2 x 2, and respectively taking the points of upper left, upper right, lower left and lower right in four fields to form a new characteristic diagram F ₄ ' and feature map F ₅ Combined into a feature diagram group F ₇ 。

Third step, for the characteristic diagram F ₇ Pooling for 4 times 2 x 2, and respectively taking the points of upper left, upper right, lower left and lower right in four fields to form a new characteristic diagram F ₇ ' and (3)Feature map F ₆ Combined into a feature diagram group F ₈ 。

Step 6, regression detection to obtain classification result and boundary frame

The method for obtaining the classification result and the bounding box is as follows: for each region, the classification and regression detection network will output the location and kind of 7 targets that may exist. Wherein, the total number of target types is A, namely the possibility of outputting corresponding A targets is related to the setting of the training set; the position parameters comprise 3 data including the central position coordinates, width and height of the target boundary frame;

step 7, calculating the loss function and updating parameters

And (3) calculating a loss function according to the position and the type of the target output in the step (6) and the position and the type of the target to be identified calibrated in the training set, wherein the step is only performed in the training process. The calculation formula of the loss function is as follows:

wherein 100 is the number of regions, 7 is the number of suggested frames to be predicted and finally generated edit frames for each region, i is the region number, j is the suggested frame and bounding box number, loss is the error value, obj represents the presence of a target, and noobj represents the absence of a target. x and y are the predicted values of the abscissa and ordinate of the center of the suggestion and bounding boxes, w and h are the predicted values of the width and height of the suggestion and bounding boxes, respectively, C is the predicted value of whether the suggestion and bounding boxes contain objects, contains a values, corresponds to the likelihood of class a objects, respectively,

for the corresponding labeling value, ++>

And->

Jth representing the target drop and non-drop region i, respectivelySuggested boxes and bounding boxes. Then, the weight is updated using a random gradient descent method including momentum according to the loss function calculation result.

Repeating the steps 3-7 until the error meets the requirement or the iteration number reaches the set upper limit.

Step 8, testing by using the test set

And (3) processing the image in the verification set by using the target detection network trained in the step (7) to acquire the accuracy and the required time of target detection and a final target detection result diagram.

The network structure of the present invention will be further described with reference to FIG. 2

1. Network layer number setting

The neural network used in the invention is divided into two parts, wherein the first part is a characteristic extraction network and consists of 5 dense blocks, and the neural network totally comprises 25 layers of convolutional neural networks. The second part is a feature fusion and regression detection network, which comprises an 8-layer convolutional neural network and a 1-layer full convolutional network.

2. Dense block arrangement

The feature extraction network portion uses dense block settings as follows:

(1) Dense block 1 contains 2 layers of convolutional neural networks, the number of convolutional kernels used in the first layer is 64, the size is 1*1, and the step size is 1; the number of convolution kernels used in the second layer is 64, the size is 3*3, and the step size is 1. Dense block 1 was used 1 time.

(2) Dense block 2 contains 2 layers of convolutional neural network, the number of convolution kernels used in the first layer is 64, the size is 3*3, and the step length is 1; the number of convolution kernels used in the second layer is 64, the size is 1*1, and the step size is 1. Dense block 2 was used 1 time.

(3) Dense block 3 contains 2 layers of convolutional neural networks, the number of convolutional kernels used in the first layer is 64, the size is 1*1, and the step size is 1; the number of convolution kernels used in the second layer is 64, the size is 3*3, and the step size is 1. The dense block 3 was used 2 times.

(4) The dense block 4 comprises a 2-layer convolutional neural network, the number of convolution kernels used by the first layer is 256, the size is 1*1, and the step length is 1; the number of convolution kernels used in the second layer is 256, the size is 3*3, and the step size is 1. The dense block 4 was used 4 times.

(5) The dense block 5 comprises a 3-layer convolutional neural network, wherein the number of convolution kernels used by the first layer is 1024, the size is 1*1, and the step length is 1; the number of convolution kernels used in the second layer is 1024, the size is 3*3, and the step length is 1; the number of convolution kernels used for the third layer is 1024, the size is 1*1, and the step size is 1. The dense block 5 is used 2 times.

3. And (5) feature fusion setting.

The 3 sets of feature maps used for feature fusion are derived from layer 9, layer 18, and layer 25 results of the feature extraction network. The generated feature map is then combined with the shallow feature map by convolution and upsampling. The results were further processed through the 3*3 convolution layer and 1*1 convolution layer, and the resulting three new feature maps were feature fused.

The simulation effect of the present invention will be further described with reference to fig. 3.

1. Simulation conditions:

the image size to be detected used by the simulation of the invention is 480 multiplied by 640, and the simulation comprises pedestrians and bicycles.

2. Simulation results and analysis:

FIG. 3 is a graph showing the results of the present invention, wherein FIG. 3 (a) is a graph to be tested; fig. 3 (b) is a feature map obtained by extraction; fig. 2 (c) is a diagram of the detection result.

The feature extraction of fig. 3 (a) is performed using a dense network to obtain a series of feature maps, and only two of the feature maps, i.e., fig. 3 (b) and fig. 3 (c), are extracted due to the too many feature maps in the intermediate process. Wherein, fig. 3 (b) is a feature map extracted from a shallower network, the image size is larger, the detail information is more and the semantic information is less; fig. 3 (c) is a feature map extracted from a deeper network, the image size is smaller, the detail information is less, and the semantic information is more.

After the feature graphs are fused and subjected to regression detection, positions of pedestrians and bicycles can be obtained, and the positions are marked on the original graph, so that a final result graph 3 (c) is obtained.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. The infrared target detection method based on feature fusion and dense connection is characterized by comprising the following steps of:

step 3, preprocessing the image enhancement of the images in the training set;

step 6, processing the image in the verification set through the trained convolutional neural network parameters to obtain the accuracy and the required time of target detection and a final target detection result diagram;

in the step 4, feature extraction and feature fusion are performed on the preprocessed image, and a classification result and a bounding box are obtained through a regression network, specifically through the following steps:

step 405, sending the fused feature map and the suggestion frame into a regression network to perform classification and bounding box regression, and removing redundancy by using a non-maximum suppression method to obtain a classification result and a bounding box;

the calculation method of the dense connection network in step 402 is as follows:

d _l ＝H _l ([d ₀ ,d ₁ ,…,d _l-1 ])

wherein d _l Representing the output result of the first convolution layer in the dense connection network, and if the dense connection network contains B convolution layers, l takes a value between 0 and B, H _l Is a combined operation of regularization, convolution and linear rectification activation function, d ₀ D for inputting image _l-1 The output result of the first layer is the output result of the first layer-1;

in the step 403, feature fusion is performed on the extracted feature images, which is to directly fuse the extracted feature images with different scales through a pooling method;

in the step 403, feature fusion is performed on the extracted feature map, which is specifically implemented through the following steps:

step 4031, a first set of feature maps F ₁ Through pooling operation, the new smaller feature images are converted into a new smaller feature image, and then the new smaller feature image is matched with a second group of feature images F ₂ Fusion to obtain a new feature map F ₂ ’；

Step 4033, using the new profile F ₂ ' and F ₃ ' replace second set of feature maps F ₂ And a third set of feature maps F ₃ Entering a regression network;

in the step 405, the fused feature map and the suggestion frame are sent to a regression network to perform classification and bounding box regression, and redundancy is removed by using a non-maximum suppression method, so as to obtain a classification result and a bounding box, which is realized specifically through the following steps:

s is the calculated intersection ratio, M and N represent two boundary frames of the same class of targets, M and N represent the intersection of the boundary frames M and N, M and N represent the union of the boundary frames M and N, and for two boundary frames with S more than 0.75, the boundary frame with smaller classification result value is removed;

in the step 4, the classification result and the bounding box are subjected to loss function calculation with the original known label image, a random gradient descent method containing momentum is used for carrying out back propagation on a prediction error in the convolutional neural network, and parameter values of the convolutional neural network are updated, and the method is specifically realized through the following steps:

wherein 100 is the number of regions, 7 is the number of suggested frames and finally generated boundary frames to be predicted for each region, i is the region number, j is the suggested frames and boundary frame numbers, loss is the error value, obj represents the existence target, and noobj indicates that no object is present, x and y are predicted values of the abscissa and ordinate of the center of the suggestion and bounding boxes, respectively, w and h are predicted values of the width and height of the suggestion and bounding boxes, respectively, C is the predicted value of whether the suggestion and bounding boxes contain an object, contains a values, corresponds to the likelihood of a class a object, respectively,

for the corresponding labeling value, ++>

And->

2. The method of claim 1, wherein the preprocessing of step 3 is to augment training sets by random rotation, mirroring, flipping, scaling, translation, scaling, contrast transformation, noise perturbation, and color change.