CN111553212A

CN111553212A - Remote sensing image target detection method based on smooth frame regression function

Info

Publication number: CN111553212A
Application number: CN202010302996.7A
Authority: CN
Inventors: 申原; 刘军; 李洪忠; 郭善昕
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2020-08-18
Anticipated expiration: 2040-04-16
Also published as: CN111553212B; WO2021208502A1

Abstract

A remote sensing image target detection method based on a smooth frame regression function comprises the steps of carrying out necessary preprocessing on a training image and setting a hyper-parameter of network training; inputting the picture into a target detection type convolutional neural network to obtain a characteristic diagram; then inputting the feature map into a regional suggestion network to obtain a candidate frame; sending the candidate frame and the feature map into an interested region pooling layer to obtain interested region features, and classifying the interested region features in a classifier; sending the obtained region of interest characteristics into a full-connection layer to obtain a predicted offset, sending the predicted offset into a smooth frame regression function to obtain an actual offset, and correcting a candidate frame to a new position; repeating the steps until the training process is finished; and inputting the image to be detected into a trained network after preprocessing to obtain a target detection result. The regression of the high-precision frame can be effectively realized, and the target detection with higher precision can be realized under the condition of high IoU threshold value.

Description

Remote sensing image target detection method based on smooth frame regression function

Technical Field

The invention belongs to the field of image processing and machine learning, and relates to a remote sensing image target detection method based on a smooth frame regression function.

Background

With the rapid development of remote sensing technology, the remote sensing data volume is rapidly increased, and how to rapidly and efficiently process an original remote sensing image in the face of increasingly large and complex remote sensing information becomes an important research topic which can be understood and used by a user. The remote sensing image target detection is one of core tasks in remote sensing image understanding, the main purpose of the remote sensing image target detection is to quickly find and accurately locate an interested target in a remote sensing image, and the target detection is an important task and is also the basis of a plurality of tasks, such as example segmentation, image understanding and the like. However, the detection accuracy of the remote sensing image is low, and only the intersection ratio of the position of the prediction frame and the target real reference frame is larger than 0.5, the detection can be regarded as correct detection, but with the improvement of algorithm performance, people need to detect the target under the condition of more accurate positioning, so that high-quality detection is realized.

When the traditional target detection technology faces huge and complex remote sensing information, the detection precision, robustness and mobility are not good enough, the problems mentioned above cannot be solved, the requirements of human beings are difficult to meet, and a more efficient and accurate method is urgently needed. The deep learning is the most popular artificial intelligence basic technology at present, the strong representation learning ability of the deep learning can automatically learn the characteristics from big data, and the deep learning has strong robustness.

However, the current depth algorithm has the problems of inaccurate regression positioning, poor detection precision performance and the like. In general, in a network, a deep neural network is used for extracting features of a picture, and then a detector is used for detecting the features, but the detector is sensitive to feature fluctuation, so that robustness is not good enough, and regression effect is poor.

In the network, the regression process is mainly realized by a frame regression function. Bounding box regression is to make the candidate box go through regression to a position closer to the true reference box. Bounding box regression achieves regression by minimizing the difference from the true candidate box. The loss function of L2 used in RCNN was improved for smooth L1 in Fast RCNN. Along with the progress of the training process, the quality of the candidate frame is gradually improved and is closer to the real frame, the difference is smaller, the fluctuation is higher, the regression is more difficult to be stable, and particularly, the regression is ineffective due to continuous oscillation near the zero point, so that the regression precision is not high.

Disclosure of Invention

Based on the above, the invention provides a remote sensing image target detection method based on a smooth frame regression function, and aims to solve the technical problem of providing a smooth frame regression function, so that the regression process of fluctuation caused by too small difference between a candidate frame and a real reference frame becomes more stable, and higher regression precision and detection precision are obtained.

The invention provides a target detection method based on a smooth frame regression function, which comprises the following steps:

step one, image preprocessing: carrying out necessary preprocessing on a training image, including image rotation, mirror image and other enhancement operations, image normalization operation, image size adjustment operation and setting a hyper-parameter of network training;

step two, feature extraction: inputting the picture into a target detection type convolutional neural network to obtain a characteristic diagram; then inputting the feature map into a regional suggestion network to obtain a candidate frame; then the candidate frame and the feature map are sent into the region of interest pooling layer together to obtain the region of interest features;

step three, classification: sending the region of interest characteristics obtained in the step two into a softmax classifier for classification;

step four, regression: sending the region of interest characteristics obtained in the step three into a full-connection layer to obtain a predicted offset, sending the predicted offset into a smooth frame regression function to obtain an actual offset, and correcting the candidate frame to a new position according to the offset;

step five, correction: taking the bounding box after the candidate frame is subjected to regression correction as a new candidate frame, sending the new candidate frame and the feature map into an interested area layer together to obtain the features of the interested area, and repeating the third step, the fourth step and the fifth step until the training process is finished to obtain a trained network;

and step six, inputting the image to be detected into a trained network after preprocessing to obtain a target detection result.

Further, the smooth bounding box regression function for regression is:

(sgn((t_x/c_x))×|(t_x/c_x)|)^4/3×p_w+p_x＝G_x

(sgn((t_y/c_y))×|(t_y/c_y)|)^4/3×p_h+p_y＝G_y

exp(sgn((t_w/c_w))×|(t_w/c_w)|)^4/3×p_w＝G_w

exp(sgn((t_h/c_h))×|(t_h/c_h)|)^4/3×p_h＝G_h

wherein sgn represents a coincidence function, so as to ensure no error in negative number operation, exp is an exponential function, c_x,c_y,c_w,c_hFor the weight adjustment value of the regression, t_x,t_y,t_h,t_wIs the offset, p, of the convolutional neural network prediction_x,p_yIs the position coordinate of the center point of the candidate frame, p_w,p_hIs the width and height of the candidate box, G_x,G_yIs the position coordinate of the center point of the bounding box after the regression correction, G_w,G_hIs the width and height of the bounding box after the regression correction.

Further, the target detection convolutional neural network includes, but is not limited to, fast RCNN, YOLO v1, YOLO v2, YOLO v3, SSD, FPN, RetinaNet, Cascade RCNN.

The method has the advantages that the stability of the regression process can be enhanced by constructing the smooth frame regression function, the regression process which generates fluctuation due to too small difference between the candidate frame and the real reference frame becomes more stable, the problem of regression failure caused by continuous oscillation near the zero point is solved, the detection precision is higher under the high IoU threshold value, and the higher target detection precision is obtained.

Drawings

FIG. 1 is a graph comparing the tuning ranges of a smooth bounding box regression function and an original bounding box regression function;

FIG. 2 is an enlarged comparison plot of the smoothed bounding box regression function and the original bounding box regression function near zero;

FIG. 3 is a feature map visualization of the convolutional neural network output;

FIG. 4 is a schematic diagram of a regional proposal network architecture;

FIG. 5 is a schematic diagram of a cascade detector;

FIG. 6 is a graph of the results of Cascade RCNN method using the regression function of the original frame;

FIG. 7 is a diagram of the regression function detection results of the Cascade RCNN method using the smooth bounding box provided by the present invention;

fig. 8 is a schematic view of a work flow of a hyperspectral image retrieval method according to an embodiment of the invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

With reference to fig. 1 and 2, the smooth bounding box regression function is described:

(sgn((t_x/c_x))×|(t_x/c_x)|)^4/3×p_w+p_x＝G_x

(sgn((t_y/c_y))×|(t_y/c_y)|)^4/3×p_h+p_y＝G_y

exp(sgn((t_w/c_w))×|(t_w/c_w)|)^4/3×p_w＝G_w

exp(sgn((t_h/c_h))×|(t_h/c_h)|)^4/3×p_h＝G_h

wherein sgn represents a coincidence function, and no error is ensured during negative number operation. exp is an exponential function, c_x,c_y,c_w,c_hThe weight adjustment value is a regression whose value is usually (10, 10, 5, 5), t by default_x,t_y,t_h,t_wIs the offset, p, of the convolutional neural network prediction_x,p_yIs the position coordinate of the center point of the candidate frame, p_w,p_hIs the width and height of the candidate box, G_x,G_yIs the position coordinate of the center point of the bounding box after the regression correction, G_w,G_hIs the width and height of the bounding box after the regression correction.

As shown in fig. 1 and 2, where the straight line represents the original regression function and the curve represents the modified bounding box function, fig. 2 is an enlarged view of fig. 1 as it approaches zero. It can be seen that the improved regression function is smoother near the true value, so that the bounding box tends to move to the true box after regression, and does not easily cross the true box, thereby enhancing the property of convergence.

The steps of the embodiment of the present invention will be described in detail below with Cascade RCNN as the basic network structure, as shown in FIG. 8. Cascade RCNN uses three cascaded detectors to achieve target detection.

1. Introduction to data set

The method is characterized in that the method is selected from a DOTA data set, the original DOTA data set is large in size and contains a large number of objects, pictures containing a large number of densely arranged small targets such as airplanes, ships, automobiles and the like are selected according to the requirements of the method, then the pictures are cut to a certain degree, the pictures are cut to be between 600 and 800, the needed data set is obtained, the training set comprises 15070 pictures, and the testing set comprises 2700 pictures.

2. Training process

1) Data pre-processing

Sending the picture into a network, and firstly carrying out operations such as horizontal mirroring, rotation and the like on the picture to enhance a data set; then, performing gray value normalization on the picture, and then scaling according to the size set by training, wherein the minimum edge size is usually set to be 600, and the maximum edge size is set to be 1000; the pictures are then screened and excluded if no object is present on the picture.

2) Training parameter settings

Training is carried out by using 4 GPUs, a frame used is cafe 2, a backbone network is resnet101, the minimum edge of a picture in the training is set to be 600, the maximum edge is limited to be 1000, the training mode uses SGD with momentum, the momentum is set to be 0.9, the initial learning rate is set to be 0.01, and the penalty term coefficient is 0.0001. Using segmented training, 360000 iterations are used herein, with a learning rate decay to 0.001,0.0001 at 240000,320000 times, respectively.

3) Feature extraction

And sequentially sending the preprocessed pictures into a convolutional neural network layer, carrying out operation calculation such as convolution and pooling on picture data through convolutional neural network, and extracting the features of the pictures for subsequent detection of a Cascade RCNN detector. Shown in fig. 3 is a visual display of a feature map of the convolutional neural network layer output.

4) Selection of candidate boxes

Features extracted from the convolutional neural network are input into an area suggestion network, and in the area suggestion network, a series of anchor points are preset for all areas on a picture in a sliding window mode, as shown in fig. 4. And screening all preset anchor points according to a foreground confidence degree sequencing mode to finally obtain the anchor point with the highest confidence degree as a candidate frame.

5) Iterative regression of cascaded detectors

After the feature map is obtained from the convolutional neural network, the feature map is sent to a detector to detect a target object. As shown in fig. 5, a Cascade detector structure of Cascade RCNN is shown, where B0 is a candidate region selected from a region-suggested network, conv represents a convolutional neural network, the candidate region and a feature map obtained from the convolutional neural network are sent to a ropooling layer together to obtain features of a region of interest, then the features are sent to a fully-connected layer (H1), and the features output by the fully-connected layer are respectively sent to a classifier (C1) for classification and fine-tuning positioning in a smooth bounding box regression function (B1) provided by the present invention. The network has three probes. And (3) taking the candidate box B1 of the previous layer which is subjected to fine tuning by the smooth border regression function provided by the invention as a new input, and sending the new input into the detector of the next layer until a candidate box B3 is obtained. And calculating the error between the B3 and the real frame as loss, carrying out backward propagation, and adjusting the parameters of the convolutional neural network. The above process is repeated until the training process is finished.

3. Test procedure

Firstly, preprocessing a test picture, zooming the test picture to the size set by a network, and carrying out gray value normalization. And then, the pictures are sequentially sent into a convolutional neural network to extract features to obtain a feature map. And inputting the extracted feature map into a regional suggestion network to obtain a candidate frame, and then inputting the candidate frame and the feature map into a region of interest pooling layer together to obtain the region of interest features. Sending the interesting region features into a first layer of a cascade detector to obtain regression offset, then calculating according to a smooth frame regression function to obtain the corrected position of a boundary frame, inputting the boundary frame serving as a new candidate frame and a feature map into an interesting region pooling layer to obtain new interesting region features, then inputting the new interesting region features into a detector of a second layer, and repeating the operation until the last layer of detectors. And a boundary frame obtained by the last layer of detectors after frame regression correction is the final boundary frame of the network. And simultaneously inputting the characteristics of the region of interest of the last layer into the classifier of each layer of detector to obtain a classification result, and then synthesizing the classification results of all the classifiers to obtain a final classification result of the network.

In order to verify the effectiveness of the remote sensing image target detection method, the DOTA data set is adopted for testing. In order to verify the effect of the smooth frame regression function provided by the invention, a YOLO v2, SSD, FasterRCNN, YOLO v3, RetinaNet, FPN and Cascade RCNN method is adopted for testing, the frame regression function of the original method is firstly adopted for calculation, then the frame regression function of the original method is replaced by the smooth frame regression function provided by the invention, and then the calculation is carried out again.

Evaluation indexes are as follows:

in the field of machine learning, evaluating the performance of a classifier is generally measured by two quantities, Precision (Precision) and Recall (Recall). To calculate these two indexes, samples can be classified into four categories according to the situation between the real value and the predicted value of the sample: true cases (True Positives, TP): predicting a positive sample as a positive example; false positives (FalsePositives, FP): predicting the positive sample as a negative example; true negative examples (True negotives, TN): predicting negative examples as counter examples; false negative (False Negatives, FN): predicting negative examples as positive examples; these four types of relationships can be clearly represented by a confusion matrix (consisionmatrix).

Precision P and recall R resolution are expressed as:

the higher the precision ratio and the recall ratio, the better, but the contradiction exists between the precision ratio and the recall ratio under the general condition, and the precision ratio is lower when the recall ratio is high; and when the precision ratio is high, the recall ratio is low. In general, we sort according to the scores of the classification, calculate the classification samples in sequence from high to low according to the scores, and count the current precision and recall. By taking the recall ratio as the horizontal axis and the precision ratio as the vertical axis, a curve can be made, which is called as a P-R curve. The performance can be reflected to a certain extent by calculating the area enclosed by the P-R curve and the transverse and longitudinal axes, and the performance is better when the area is higher.

The performance of the detector in target detection is measured by AP and mAP. For single-class target detection, Average Precision (AP) is generally sampled as an evaluation index. The AP for detecting the single type of target calculates the area enclosed by the P-R curve and the horizontal and vertical axes formed by the type. In target detection, determining four types of samples, namely TP, FP, TN and FN, requires calculation of IoU between each prediction frame and a real reference frame, and the sample can be determined as a positive sample only when the threshold is larger than a set threshold.

For a single class of target detection, the AP is expressed as:

in the present embodiment, the target detection accuracy is evaluated by using the respective matching methods using the indices AP, AP50, AP75, AP60, AP70, AP80, AP90, and the like. AP50 refers to the AP value when IoU was set to 0.5, the meaning of the remaining indicators is similar to AP 50. It can be seen that the difficulty is greater when IoU is higher, indicating that the accuracy of target detection is higher.

Fig. 6 and 7 show the target detection results of the original frame regression function and the smooth frame regression function provided by the present invention when using Cascade RCNN as the basic network architecture. FIG. 6 shows the result of the regression function of the original frame, and FIG. 7 shows the result of the regression function of the smoothed frame according to the present invention. It can be clearly seen that the positioning accuracy of the method of the present invention is higher than that of the regression function of the original frame.

Table 1 shows the comparison of the detection results of the method of the present invention and the original method under other network architectures. Where √ denotes the accuracy of using the smooth bounding box regression function provided by the present invention under a given network architecture, and absence of √ denotes that the original bounding box regression function is used.

TABLE 1 comparison of target detection accuracy under various network architectures

AP denotes the overall average accuracy indicator, AP50 denotes the average accuracy at a threshold of IoU greater than 0.5, and AP750 denotes the average accuracy at a threshold of IoU greater than 0.75, the same applies. It can be seen that after the smooth frame regression function provided by the present invention is adopted, the precision of each network architecture at each IoU level is obviously improved, and especially the detection improvement under the high IoU threshold value is higher, which indicates that the smooth frame regression function provided by the present invention has an obvious effect on improving the positioning precision, especially under the high IoU threshold value, the effect is more obvious.

By utilizing the technical scheme of the invention, the frame regression process in target detection can be better realized based on the smooth frame regression function, the detected target frame has better precision under the condition of high IoU threshold value, and compared with the original frame regression function, the smooth frame regression function provided by the invention can realize target detection with higher precision. The smooth frame regression function provided by the invention can be used in any target detection network framework.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A remote sensing image target detection method based on a smooth frame regression function is characterized by comprising the following steps:

2. The method for detecting the target of the remote sensing image based on the regression function of the smooth border of claim 1, wherein the regression function of the smooth border for regression is as follows:

(sgn((t_x/c_x))×|(t_x/c_x)|)^4/3×p_w+p_x＝G_x

(sgn((t_y/c_y))×|(t_y/c_y)|)^4/3×p_h+p_y＝G_y

exp(sgn((t_w/c_w))×|(t_w/c_w)|)^4/3×p_w＝G_w

exp(sgn((t_h/c_h))×|(t_h/c_h)|)^4/3×p_h＝G_h

wherein sgn represents a coincidence function, so as to ensure no error in negative number operation, exp is an exponential function, c_x,c_y,c_w,c_hFor the weight adjustment value of the regression, t_x,t_y,t_h,t_wIs the offset, p, of the convolutional neural network prediction_x,p_yIs the position coordinate of the center point of the candidate frame, p_w,p_hIs the width and height of the candidate box, G_x,G_yIs the edge after the regression correctionPosition coordinates of the center point of the bounding box, G_w,G_hIs the width and height of the bounding box after the regression correction.

3. The method as claimed in claim 1, wherein the object detection convolutional neural network includes but is not limited to fast RCNN, YOLO v1, YOLO v2, YOLO v3, SSD, FPN, RetinaNet, Cascade RCNN.