CN112861915A

CN112861915A - Anchor-frame-free non-cooperative target detection method based on high-level semantic features

Info

Publication number: CN112861915A
Application number: CN202110039412.6A
Authority: CN
Inventors: 张弘; 闫超奇; 陈浩; 杨一帆; 袁丁; 李岩
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2021-05-28

Abstract

The invention relates to a non-anchor frame non-cooperative target detection method based on high-level semantic features, which mainly comprises the following steps: inputting an image, a feature extraction and fusion module, a detection head module and a detection result. The specific detection method comprises the following steps: acquiring a target detection data set and preprocessing an input image to acquire label information of network requirements; setting network model parameters with high-level semantic feature information and various training parameters; carrying out step fusion on the multi-scale feature layer subjected to the feature extraction and fusion module to obtain a detection feature layer with high-level semantic information; sending the obtained high-level semantic information characteristic layer into a detection head, and obtaining a target central point heat map and a target scale prediction map; carrying out classification and regression operation, calculating loss, carrying out back propagation, and carrying out iterative updating on network parameters; and after the network training is finished, carrying out actual scene test.

Description

Anchor-frame-free non-cooperative target detection method based on high-level semantic features

Technical Field

The invention relates to a non-cooperative target detection method without an anchor frame based on high-level semantic features, which is suitable for the field of non-cooperative target detection under a complex scene of a high-definition image.

Background

The object detection technology is one of the most basic problems in the field of computer vision, and is generally regarded as a low-level technology, for example, it covers typical edge detection tasks including Canny, Sobel, region-of-interest detection including LoG, MSER and the like. Image feature detection is of great significance to a variety of computer vision tasks from image representation, image matching to three-dimensional scene reconstruction, and the like. In general, features are defined as the "interesting" parts of an image, and therefore the purpose of feature detection is to compute an abstract representation of the image information, i.e. to make a local decision on each image point, whether or not there is an image feature of a given type at that point. With the rapid development of computer vision tasks regarding the abstract representation of image information, many researchers have begun to address the problem of object detection and segmentation using CNNs. Compared with the traditional algorithm, the algorithm based on the convolutional neural network is more efficient, a matching model does not need to be established, the operation speed is higher, meanwhile, the convolutional neural network can abstract low-level features into high-level features, deeper information is obtained, and then classification, detection or segmentation results are converted through the information.

Target detection currently comprises one-stage and two-stage, wherein the two-stage means that a detection algorithm needs to be completed in two steps, firstly a candidate region needs to be obtained, and then classification is carried out, such as an R-CNN series; in contrast, one-stage detection can be understood as one-step detection, and a candidate area does not need to be searched separately, typically, SSD. For example, the Faster R-CNN algorithm generates candidate boxes first, and then classifies each candidate box (also correcting the position). This type of algorithm is relatively slow because it requires multiple runs of the detection and classification process. The one-stage detection method can predict all the bounding boxes only by sending the bounding boxes to the network once, so the method has higher speed and is very suitable for a mobile terminal. Further, the above algorithms all need to set up a well-established anchor box for taking charge of target detection in different areas and different sizes. Researchers generally consider that pre-defined parameters are key to the success or failure of a target detection model. Researchers have also demonstrated in related experiments in the past that the hyper-parameters of anchor points have a significant impact on the predictive capabilities of the model.

However, the above target detection method based on the anchor frame has the following disadvantages:

(1) the performance of the detection model is very sensitive to some hyper-parameters, such as the size, proportion and number of the anchor point frame, and some models with fluctuation of AP value can reach more than 4%, therefore, in order to obtain a better model, the hyper-parameters need to be carefully adjusted and tested, meanwhile, the robustness based on the anchor point model is reduced by the carefully established parameters, the hyper-parameters of the anchor point frame need to be redesigned every time a brand-new data set is met, and the complexity of model parameter debugging is increased.

(2) Even through careful design, because the proportion and the size of the anchor point frame must be fixed when the model is built, a serious problem is generated, namely when the detection model detects a target set with large shape change, especially a small target, the detection difficulty of the model is obviously improved.

(3) In order to further obtain better recall rate, a model is usually provided with dense anchor points on each layer of feature layer, but most anchor points are marked as negative samples in the training process, and the imbalance of the number of the positive samples and the negative samples of the model is aggravated by too many negative samples, so that the model cannot be sensitive to the background.

Disclosure of Invention

The technical problem solved by the invention is as follows: the method for detecting the non-cooperative target based on the anchor frame overcomes the defects of the prior art, can directly measure the real position information of the detected target by a sensor, and can obtain the non-cooperative target of the accurate position of the target without any other technical means, and provides the method for detecting the non-cooperative target based on the high-level semantic features without the anchor frame, so that the target detection precision of the small target under various variable-scale conditions in a complex scene is improved, the traditional mode set based on the anchor frame is abandoned, and the model detection precision is improved.

The technical solution of the invention is as follows: the anchor-frame-free non-cooperative target detection method based on the high-level semantic features comprises the following steps:

(1) and collecting shooting data images according to the target characteristics, and determining a training and testing data set.

(2) Constructing an anchor frame-free target detection network architecture, setting parameters such as a pre-training model of an algorithm, maximum iteration times, a learning rate, a back propagation method, a training batch size batch _ size, a number inter _ size of each iteration batch size, momentum parameters, a classification IOU threshold value and the like, and setting the initial iteration times of the model to be 0.

(3) And performing data amplification on the input sample.

(4) And a feature extraction module is used for extracting features of the input image, sequentially performing bilinear interpolation and upsampling on the last layer of features to obtain the dimension of the previous layer of feature layer, and fusing in an eltw-sum mode to form a new feature layer. The newly formed feature layer is up-sampled to the size of the feature layer of the previous layer by adopting bilinear interpolation till the feature layer of the shallowest layer.

(5) And performing deconvolution on the feature layers with different sizes obtained by cross-order fusion to the size of the shallowest feature layer after L2Norm, fusing the feature layers together in a concatenate mode, and performing 1 × 1 convolution to form a high-level semantic feature layer for subsequent detection.

(6) And sending the high-level semantic feature layer to a detection head module, passing through a 3 x 3 convolutional layer, a BN layer and a ReLu activation function, and respectively passing through two branched 1 x 1 convolutional kernels to directly perform position estimation and scale estimation on the target.

(7) The classification Loss is calculated by using a Cross entry Loss function, the regression Loss is calculated by using a Smooth L1 function, the total Loss is the weighted sum of the classification Loss and the regression Loss, and the detection correctness is judged by a classification IOU threshold value.

(8) Judging whether the iteration of the inter _ size sub batch _ size picture set in the step (2) is finished or not, and if so, turning to the step (9); otherwise, returning to the step (7) to continue training the network model until the convolutional neural network model of the parameter optimal solution is obtained.

(9) And (3) testing the test set by using the convolutional neural network model obtained in the step (8) to obtain the identification accuracy, judging the identification accuracy, if the identification accuracy can meet the actual engineering requirement, applying the convolutional neural network model to the actual target detection task, executing the step (10), and if the identification accuracy can not meet the actual engineering requirement, restarting the steps (1), (2) and (3) until the actual engineering requirement is met.

(10) And applying the parameters of the convolutional neural network model which actually meet the engineering requirements to an actual target detection scene, and identifying the acquired target detection picture.

In the step (8), the method for obtaining the convolutional neural network model with the optimal parameter solution comprises the following steps: and when the Loss function Loss of the training set does not exceed 0.001 in descending amplitude and the Loss function Loss of the verification set tends to rise at a critical point, obtaining the convolutional neural network model of the parameter optimal solution.

In the step (1), the adopted data sets are VOC Dataset and actual outdoor scene target data sets.

In the step (2), relevant parameters such as maximum iteration times, learning rate and momentum parameters are set, and the back propagation method is selected as follows:

and (3) experimental environment configuration: ubuntu18.04, GPU model RTX 3070, cuDNN version 8.0.5, CPU model Intel (R) core (TM) i7-10850K @3.60 GHz.

Maximum number of iterations: 160,000 times;

training batch size batch _ size: 16;

learning rate: the initial learning rate is 0.001, and the learning rate is attenuated by 10 times when the iteration is carried out for 80,000 times and 120,000 times;

the back propagation method comprises the following steps: an SGD random gradient descent algorithm;

momentum parameter: 0.9;

classification IOU threshold parameter: 0.5;

and (4) the backbone feature network adopted by the feature extraction module in the step (4) is ResNet-50.

The loss function adopted in the step (7) performs computational regression on the two types of position errors and confidence errors, and the loss function is a weighted sum of the two types of errors:

L＝λ_cL_center+λ_sL_scale

wherein λ is_cAnd λ_sThe weights representing the predicted branch loss at the center position and the loss weight of the metric regression branch are set to 0.01 and 1 respectively in the experiment.

In order to reduce the influence of the background around the positive sample on the target, a 2D gaussian mask G (·) is used to process the center of each target, and the formula is as follows:

where K is the number of single picture objects, (x)_k,y_k,w_k,h_k) Information on the center position and width and height of the target, variable (. sigma.)_w,σ_h) Proportional to the width and height of a single picture.

For the central point prediction branch, Cross entry Loss is used as a classification task, wherein the classification Loss function can be expressed as:

wherein

In the above formula, the size of the input image is H × W, r represents the down-sampling factor, p_ij∈[0,1]Position center, y, representing whether or not there is an object at position (i, j)_ijE {0,1} represents a ground tretAnd h, label information, gamma, is a hyper-parameter, and gamma is set to be 2 according to the experimental condition. Wherein, y _ij1 represents the detected picture as a positive sample, α_ijRepresenting the weight, β is set to 4 according to the experimental situation. When the detected picture is a positive sample, α_ijIs set to 1.

For the scale prediction branch, the above is regarded as a regression task by using Smooth L1Loss, and can be expressed as:

wherein s is_k,t_kRespectively representing the predicted result of the network and the ground route of each positive sample.

Compared with the prior art, the invention has the advantages that:

(1) the invention discloses a novel target detection framework, namely, target detection can be simplified into a direct center and scale prediction task through convolution, the traditional presetting based on anchor frame is abandoned, and the complex post-processing process in target detection is eliminated.

(2) The invention carries out down sampling on the feature map after the backbone feature extraction network for many times, ensures the recognition precision of the model on the multi-scale target, carries out simple and effective strategy fusion on the obtained multi-scale feature map, and directly carries out classification and regression operation on the fused feature layer with high-level semantic information.

(3) The high-level visual task of target detection, namely the target detection, is simplified into the problem of semantic feature point detection, namely the position of a central point heat map corresponds to the central position of a detection frame, the predicted size corresponds to the size of the detection frame, and the confidence coefficient on the central point heat map corresponds to the score of the detection frame. The detector has a simple structure, greatly simplifies a detection head module compared with series of work such as R-CNN and SSD, and is a simple and efficient novel detection idea.

Drawings

FIG. 1 is a structural block diagram of a non-anchor-frame cooperative target detection method based on high-level semantic features;

FIG. 2 is a block diagram of a model training and testing process of the non-cooperative target detection method without an anchor frame based on high-level semantic features in the embodiment of the present invention.

FIG. 3 is a graph illustrating the test results of the classification IOU threshold set to 0.5 on the VOC test data set according to the present invention.

Fig. 4 is a schematic diagram of a test result when the classification IOU threshold is set to 0.5 on an actual collected outdoor scene test data set according to the embodiment of the present invention.

Detailed Description

For better understanding of the technical solutions of the present invention, the following detailed description is provided for the embodiments of the present invention with reference to the accompanying drawings, but the embodiments of the present invention are not limited thereto.

Embodiment 1 anchor-frame-free non-cooperative target detection method based on high-level semantic features

The embodiment provides an anchor-frame-free non-cooperative target detection method based on high-level semantic features, a network structure of the method comprises an image input part, a feature extraction and fusion part, a detection head part and a detection output part, and fig. 1 shows a general block diagram of the network structure of the proposed method, which comprises the following steps in sequence:

(1) the method comprises the steps of firstly, obtaining a VOC target detection data set, and converting data set labeling information into a label required by a network.

(2) And secondly, taking the model trained on the ImageNet data set by the method as a pre-trained model, setting the iteration frequency to be 160,000 times, wherein the initial learning rate is 0.001, the learning rate is attenuated by 10 times when the model is iterated to 80,000 times and 120,000 times, the attenuation is 0.0001 and 0.00001, the optimization method is SGD (random gradient descent method), the batch size batch _ size trained by the algorithm is 16, the number iter _ size of each iteration batch size is 2, and the threshold of the classification IOU is 0.5. The initial iteration number of the model is set to 0.

(3) And thirdly, adding 1 to the number of model training iterations, and continuing the model training.

(4) And fourthly, inputting 16 training pictures from the training set, performing data amplification operation on the input images, and performing data expansion on the input training sample set by adopting operations such as inversion, translation, scaling, brightness change, cutting, color conversion and the like.

(5) And fifthly, performing feature extraction on the input image by the basic network module ResNet-50, wherein the downsampling rates of the output feature maps are respectively 2,4,8,16 and 32, and selecting FL2, FL3, FL4 and FL5 as multi-scale feature maps with the sizes of 1/4,1/8,1/16 and 1/32 of the input image respectively as shown in FIG. 1.

(6) Sixthly, respectively carrying out L2Normalization on the multi-scale feature maps FL2, FL3, FL4 and FL5, then carrying out up-sampling on FL5 by using bilinear interpolation, unifying the dimension to FL4 dimension, and carrying out Eltw-sum fusion operation with FL4 to form a new feature layer, wherein the new feature layer is recorded as FL4_ 1; performing upsampling on FL4_1 by utilizing bilinear interpolation, unifying the dimension to FL3 dimension, and performing Eltw-sum fusion operation with FL3 to form a new characteristic layer, wherein the new characteristic layer is recorded as FL3_ 1; performing upsampling on FL3_1 by utilizing bilinear interpolation, unifying the dimension to FL2 dimension, and performing Eltw-sum fusion operation with FL2 to form a new characteristic layer, wherein the new characteristic layer is recorded as FL2_ 1;

(7) seventhly, unifying FL5, FM4_1 and FM3_1 to FL2_1 size through deconvolution operation, carrying out Cccontenate fusion operation on the feature layers with unified size, and adjusting the number of channels by using 1 × 1 convolution kernel to form a feature layer FL6 containing high-level semantic features.

(8) And eighthly, sending the high-level semantic feature layer FL6 to the detection head module, activating the high-level semantic feature layer FL6 through a 3 x 512 convolution kernel, a BN layer and a ReLu function, and then forming two detection branches through two 1 x 1 convolution kernels to respectively generate a target central point heat map and a target scale prediction map.

(9) And ninthly, starting training according to the setting and selection, classifying and regressing the contained high-level semantic feature layers, judging the detection correctness through a classification IOU threshold value, calculating the classification Loss by using a Cross entry Loss function, calculating the regression Loss by using a Smooth L1 function, wherein the total Loss is the weighted sum of the classification Loss and the regression Loss.

(10) And step ten, judging whether the iteration of 16 input images is finished for 2 times, if so, turning to the step ten, and if not, returning to the step two.

(11) And step eleven, taking the average value of the losses obtained by 2 times of 16 image training as the loss of each total iteration, performing back propagation by using a random gradient descent method, and updating the coefficients of the network in the feature extraction and fusion module, the two-stage feature fusion module and the detection head module.

(12) And step ten, judging whether the total iteration number reaches 160,000, if so, storing the finally trained weight coefficient, finishing the training of the model, and otherwise, returning to the step three to continue the training.

(13) And a thirteenth step of importing the obtained network model parameters into a grid model for testing, as shown in fig. 3, the network model parameters are the actual testing effect of the VOC data set.

Embodiment 2 an anchor-frame-free non-cooperative target detection method accuracy contrast experiment based on high-level semantic features

The experiment compares the average precision mean value of the target detection results of the advanced semantic feature-based anchor-frame-free non-cooperative target detection method (marked as HFNB _ AF) and the existing YOLO v3 algorithm in the embodiment 1;

average precision mean comparison

An average precision average (mAP), which is an average value of AP (access precision), is a main evaluation index of a target detection algorithm and is used for describing the quality of a target detection model, wherein a higher mAP value indicates that the detection effect of the target detection model on a given data set is better;

the VOC data set was tested by two methods, the test results are shown in figure 3, the mAP values were calculated, and the results are shown in table 1:

TABLE 1 comparison of the mAP values obtained by HFNB _ OD and YOLO v3 methods

Experimental methods	mAP-50 value
		HFAF_OD	84.1％
YOLO v3	78.4％

The result shows that the advanced semantic feature based anchor-frame-free non-cooperative target detection method (labeled as HFAF _ OD) in embodiment 1 of the present invention is higher than the prior YOLO v3 algorithm mAP value result, which indicates that the method of the present invention has higher calculation accuracy in the aspect of target type discrimination than the anchor-frame-based method.

Comparison of rectangular detection frame conditions

The experiment compares the conditions of unrecalled frames, missed detection frames and classified error frames in the calculation result of the existing YOLO v3 algorithm based on the high-level semantic feature anchor-frame-free non-cooperative target detection method (marked as HFAF _ OD) in the embodiment 1;

the fewer the unrecalled frames, the missed detection frames and the classification error frames, the better the detection effect of the target detection model on the given data set is;

the two methods are used for carrying out target detection on 400 pictures in a collected and shot actual outdoor scene target data set, a schematic diagram of detection and comparison results is shown in figure 4, and the number of unrecalled frames, missed detection frames and classified error frames is calculated, and the results are shown in table 2.

TABLE 2 comparison of HFAF _ OD and YOLO v3 method rectangular detection frame condition results

Experimental methods	Number of unrecalled boxes	Number of missed inspection frames	Number of error boxes classified
				HFAF_OD	20	22	20
YOLO v3	45	42	46

The result shows that, compared with the existing YOLO v3 algorithm based on anchor point frame detection, the HFAF _ OD of the anchor-frame-free non-cooperative target detection method based on high-level semantic features in embodiment 1 of the present invention significantly reduces the occurrence of unrecalled frames and classification error frames, and the number of missed detection frames is also reduced; the method is proved to have higher calculation precision compared with the traditional algorithm, and the problem of poor target detection precision of the target in the shielding and motion fuzzy states is solved.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A non-cooperative target detection method based on advanced features and without an anchor frame is characterized by comprising the following steps:

s1, inputting an image, wherein the image is used for processing sample label information and sample amplification of the input image;

s2, a feature extraction and fusion module for extracting multi-level features of the image and fusing the multi-level features to obtain a high-level semantic feature layer of the image;

s3, a detection head module is used for classifying and regressing the high-level semantic feature layer to obtain optimal model parameters;

and S4, detecting results, and using the optimal model parameters for the actual application deployment of the non-cooperative target detection.

2. The network structure of claim 1, wherein the feature extraction fusion module is a ResNet50 built-in network, the down-sampling rates are 2,4,8,16, and 32 respectively, and four multi-scale feature layers and a high-level semantic feature layer after ladder fusion are obtained respectively;

the detection head module is used for forming two detection branches and respectively generating a target central point heat map and a target scale prediction map.

3. A non-anchor frame non-cooperative target detection method based on advanced features is characterized by comprising the following training and testing steps:

acquiring a target detection data set for training and testing, and converting the labeling information into a format which can be directly read by a network model;

initializing a network training model, preprocessing a training sample, and performing floating point conversion to obtain a floating point image;

extracting a multi-scale feature map by using a feature extraction and fusion module, and fusing the multi-scale feature map to form a high-level semantic feature map;

extracting high-level semantic features by using a detection head module to obtain two detection branches for generating a target central point heat map and a target scale prediction map;

calculating classification and regression loss and performing back propagation to perform iterative updating of network parameters;

completing network training;

and applying the stored network model to the actual test data to complete the test.

4. The target detection method according to claim 3, wherein parameters such as pre-training model parameters of an algorithm, maximum iteration times, learning rate, back propagation method, training batch size batch _ size, number of batch sizes per iteration, momentum parameters, classification IOU threshold and the like are set in the process of initializing the training model; and performing data expansion on the input training sample set by adopting operations of inversion, translation, scaling, brightness change, clipping, color transformation and the like.

5. The object detection method according to claim 3, characterized in that the input image is subjected to a feature fusion extraction module to obtain four feature layers of FL2, FL3, FL4 and FL5 as a multi-scale feature map, and the size of each feature layer is 1/4,1/8,1/16 and 1/32 of the size of the input image. And performing step fusion on the four feature layers, and fusing the processed feature layers in a collocation manner to form a feature layer with high-level semantic information.

6. The method of claim 3, wherein the detection head module is activated by a 3 x 512 convolution kernel, a BN layer and a ReLu function, and then two detection branches are formed by two 1 x 1 convolution kernels to generate a target center point heat map and a target scale prediction map respectively.

7. The feature-layer fusion method of claim 5 wherein the multi-scale feature maps FL2, FL3, FL4, FL5 are each subjected to L2Normalization followed by upsampling FL5 using bilinear interpolation to uniform the scale to FL4 size and performing an Eltw-sum fusion operation with FL4 to form a new feature layer denoted as FL4_ 1; performing upsampling on FL4_1 by utilizing bilinear interpolation, unifying the dimension to FL3 dimension, and performing Eltw-sum fusion operation with FL3 to form a new characteristic layer, wherein the new characteristic layer is recorded as FL3_ 1; FL3_1 is upsampled by bilinear interpolation, unified scaling is carried out to FL2 size, and Eltw-sum fusion operation is carried out with FL2 to form a new characteristic layer, which is recorded as FL2_ 1.

8. The method for obtaining advanced semantic features according to claim 5, characterized in that FL5, FM4_1 and FM3_1 are unified to FL2_1 size by deconvolution operation, Cccocatenate fusion operation is performed on the feature layers with unified size, and the number of channels is adjusted by 1 × 1 convolution kernel to form a feature layer FL6 containing advanced semantic features.

9. The detection method according to claim 3, characterized in that the high-level semantic feature layers are classified and regressed, the detection error is judged by the classification IOU threshold, the Loss of classification is calculated by using Cross entry Loss function, the Loss of regression is calculated by using Smooth L1 function, and the total Loss is the weighted sum of the classification Loss and the regression Loss.

10. The detection method according to claim 3, wherein the convolutional neural network model for parameter optimal solution is obtained when the Loss function Loss of the training set does not exceed 0.001, but the Loss function Loss of the verification set tends to rise at a critical point.

11. The detection method according to claim 3, wherein the parameters of the network model obtained by training are directly input into the network model to obtain the confidence and position of the target type in the test picture, thereby completing the test.