WO2021208502A1

WO2021208502A1 - Remote-sensing image target detection method based on smooth bounding box regression function

Info

Publication number: WO2021208502A1
Application number: PCT/CN2020/140022
Authority: WO
Inventors: 申原; 刘军; 李洪忠; 郭善昕
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-04-16
Filing date: 2020-12-28
Publication date: 2021-10-21
Also published as: CN111553212A; CN111553212B

Abstract

A remote-sensing image target detection method based on a smooth bounding box regression function, comprising: performing necessary preprocessing on a training image, and setting a hyperparameter of network training; inputting a picture into a target detection convolutional neural network to obtain a feature map; then inputting the feature map into a region suggestion network to obtain a candidate box; and then sending the candidate box and the feature map into a region-of-interest pooling layer to obtain features of a region of interest, and classifying, in a classifier, the features of the region of interest; sending the obtained features of the region of interest into a full connection layer to obtain a predicted offset, then sending the predicted offset into the smooth bounding box regression function to obtain an actual offset, and correcting the candidate box to a new position; repeating the steps until a training process is finished; and preprocessing an image to be detected and then inputting same into a trained network to obtain a target detection result. High-precision bounding box regression can be effectively realized, and higher-precision target detection can be realized under the condition of a high IoU threshold.

Description

A remote sensing image target detection method based on smooth border regression function

Technical field

The invention belongs to the field of image processing and machine learning, and relates to a remote sensing image target detection method based on a smooth border regression function.

Background technique

With the rapid development of remote sensing technology, the amount of remote sensing data is rising rapidly. In the face of increasingly large and complex remote sensing information, how to quickly and efficiently process the original remote sensing image to make it the information that users can understand and use becomes important. Research topics. Remote sensing image target detection is one of the core tasks in remote sensing image understanding. Its main purpose is to quickly find and accurately locate the target of interest in remote sensing images. Target detection itself is an important task, and it is also the basis of many tasks. Such as instance segmentation, image understanding, etc. However, the detection accuracy of remote sensing images was low before, and only the intersection ratio between the position of the prediction frame and the target real reference frame is greater than 0.5, it can be considered as a correct detection. However, as the performance of the algorithm improves, people need to change Detect targets under precise positioning to achieve high-quality detection.

Traditional target detection technology is not good enough in detection accuracy, robustness, and mobility when facing huge and complex remote sensing information. It cannot solve the problems mentioned in this article, and it is difficult to meet the needs of human beings. There is an urgent need for more efficient and accurate methods. . Deep learning is the most popular and cutting-edge basic artificial intelligence technology. Its powerful representation learning ability can automatically learn features from big data and has strong robustness.

However, current depth algorithms have problems such as inaccurate regression positioning and poor detection accuracy. Under normal circumstances, the network first uses a deep neural network to extract the features of the picture, and then uses a detector based on the feature to detect, but because the detector is sensitive to feature fluctuations, the robustness is not good enough, resulting in poor regression results.

In the network, the regression process is mainly realized by the border regression function. Frame regression is to make the candidate frame return to a position closer to the true reference frame. Box regression is achieved by minimizing the gap with the real candidate box. The L2 loss function used in RCNN is improved to a smooth L1 loss function in Fast RCNN. As the training process progresses, the quality of the candidate frame is gradually improved, getting closer and closer to the real frame, at this time the gap will become smaller, and the volatility is greater, the more difficult it is to stabilize the regression, especially at the zero point Nearby will cause the failure of the regression due to continuous oscillations, resulting in low accuracy of the regression.

Summary of the invention

Based on this, the present invention provides a remote sensing image target detection method based on a smooth frame regression function. The technical problem to be solved is to provide a smooth frame regression function, so that the candidate frame and the real reference frame will fluctuate due to the small gap between them. The regression process becomes more stable, so as to obtain higher regression accuracy and detection accuracy.

The present invention provides a target detection method based on a smooth border regression function, which includes the following steps:

Step 1. Image preprocessing: Perform necessary preprocessing on training images, including image rotation, mirroring and other enhancement operations, image normalization operations, image size adjustment operations, and setting of hyperparameters for network training;

Step 2. Feature extraction: input the image into the target detection class convolutional neural network to obtain the feature map; then input the feature map into the regional suggestion network to obtain the candidate frame; then send the candidate frame and the feature map to the sensor In the region of interest pooling layer, the characteristics of the region of interest are obtained;

Step three, classification: send the region of interest features obtained in step two to the softmax classifier for classification;

Step 4. Regression: Send the features of the region of interest obtained in step 3 to the fully connected layer to get the predicted offset, and send the predicted offset to the smooth border regression function to get the actual offset. Move the amount to correct the candidate frame to a new position;

Step 5. Correction: Use the bounding box of the candidate box after regression correction as the new candidate box, and send it to the region of interest layer together with the feature map to obtain the region of interest feature, repeat step 3, step 4, and step 5. Until the training process is over, a trained network is obtained;

Step 6. Input the image to be detected into the trained network after preprocessing to obtain the target detection result.

Further, the smooth border regression function used for regression is:

(sgn((t _x /c _x ))×|(t _x /c _x )|) ^4/3 ×p _w +p _x =G _x

(sgn((t _y /c _y ))×|(t _y /c _y )|) ^4/3 ×p _h +p _y ＝G _y

exp(sgn((t _w /c _w ))×|(t _w /c _w )|) ^4/3 ×p _w ＝G _w

exp(sgn((t _h /c _h ))×|(t _h /c _h )|) ^4/3 × p _h ＝G _h

Among them, sgn represents the coincidence function, to ensure that there is no error in the operation of negative numbers, exp is an exponential function, c _x , c _y , c _w , c _h are the weight adjustment values of the regression, t _x , t _y , t _h , t _w are The offset predicted by the convolutional neural network, p _x , p _y are the position coordinates of the center point of the candidate box, p _w , p _h are the width and height of the candidate box, G _x , G _y are the center of the bounding box after regression correction Point position coordinates, G _w , G _h are the width and height of the bounding box after regression correction.

Further, the target detection convolutional neural network includes but is not limited to Faster RCNN, YOLO v1, YOLO v2, YOLO v3, SSD, FPN, RetinaNet, and Cascade RCNN.

The beneficial effect of the present invention is that by constructing a smooth frame regression function, the stability of the regression process can be enhanced, the regression process where the gap between the candidate frame and the real reference frame is too small and the fluctuations will become more stable, and the solution to the problem of continuous occurrence near the zero point The oscillating causes the problem of regression failure, which makes the detection accuracy higher under the high IoU threshold, so as to obtain higher target detection accuracy.

Description of the drawings

Figure 1 is a comparison diagram of the adjustment range of the smooth frame regression function and the original frame regression function;

Figure 2 is an enlarged comparison diagram of the smooth border regression function and the original border regression function near the zero point;

Figure 3 is a visual display of the feature map output by the convolutional neural network;

Figure 4 is a schematic diagram of the regional proposal network structure;

Figure 5 is a schematic diagram of the structure of a cascade detector;

Figure 6 is the result of the Cascade RCNN method using the original frame regression function;

Fig. 7 is a detection result diagram of the Cascade RCNN method using the smooth border regression function provided by the present invention;

FIG. 8 is a schematic diagram of a workflow of a hyperspectral image retrieval method according to an embodiment of the present invention.

Detailed ways

In order to make the technical problems, technical solutions and beneficial effects solved by the present invention clearer, the following further describes the present invention in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention.

Combining Figure 1 and Figure 2, the smooth border regression function is explained:

(sgn((t _x /c _x ))×|(t _x /c _x )|) ^4/3 ×p _w +p _x =G _x

(sgn((t _y /c _y ))×|(t _y /c _y )|) ^4/3 ×p _h +p _y ＝G _y

exp(sgn((t _w /c _w ))×|(t _w /c _w )|) ^4/3 ×p _w ＝G _w

exp(sgn((t _h /c _h ))×|(t _h /c _h )|) ^4/3 × p _h ＝G _h

Among them, sgn represents the conformance function to ensure that there is no error in the operation of negative numbers. exp is an exponential function, c _x , c _y , c _w , c _h are the weight adjustment values of the regression, and its value usually defaults to (10, 10, 5, 5), t _x , t _y , t _h , t _w are The offset predicted by the convolutional neural network, p _x , p _y are the position coordinates of the center point of the candidate box, p _w , p _h are the width and height of the candidate box, G _x , G _y are the center of the bounding box after regression correction Point position coordinates, G _w , G _h are the width and height of the bounding box after regression correction.

As shown in Figure 1 and Figure 2, the straight line represents the original regression function, and the curve represents the improved frame function. Figure 2 is an enlarged view of Figure 1 when it approaches zero. It can be seen that the improved regression function is smoother near the true value, so that the frame tends to move to the true frame after the regression, and does not easily cross the true frame, which enhances the nature of convergence.

The following uses Cascade RCNN as the basic network structure to describe in detail the steps of the embodiments of the present invention, as shown in FIG. 8. Cascade RCNN uses three cascaded detectors to achieve target detection.

1. Introduction to the data set

Selected from the DOTA data set. The original DOTA data set is large in size and contains many objects. According to the requirements of the present invention, the focus is on selecting pictures containing a large number of densely arranged small targets such as airplanes, ships, cars, etc., and then doing the pictures With a certain amount of cropping, the pictures are cropped to between 600-800, and the data set we need is obtained. The training set contains 15,070 pictures, and the test set contains 2,700 pictures.

2. Training process

1) Data preprocessing

Send the picture to the network, first perform operations such as horizontal mirroring and rotation on the picture to enhance the data set; then normalize the gray value of the picture, and then scale it according to the size of the training setting, usually the smallest edge is set The size is 600, and the maximum side size is 1000; then the picture is filtered, and if there is no target on the picture, the picture is excluded.

2) Training parameter settings

Use 4 GPUs for training, the framework used is caffe2, the backbone network is resnet101, the minimum edge of the image is set to 600 during training, and the maximum edge is limited to 1000. The training method uses SGD with momentum, and the momentum is set to 0.9 , The initial learning rate is set to 0.01, and the penalty term coefficient is 0.0001. This article uses segmented training, a total of 360,000 iterations, and the learning rate decays to 0.001 and 0.0001 at 240,000 and 320,000 times, respectively.

3) Feature extraction

The pre-processed pictures are sequentially sent to the convolutional neural network layer, and the image data is convolved and pooled through the convolutional network neural to extract the characteristics of the picture for use in the subsequent Cascade RCNN detector Detection. Figure 3 shows the visual display of the feature map output by the convolutional neural network layer.

4) Selection of candidate frame

The features extracted from the convolutional neural network are input into the regional suggestion network. In the regional suggestion network, a series of anchor points are preset for all regions on the picture by means of a sliding window, as shown in Figure 4. By filtering all preset anchor points according to the foreground confidence ranking method, the anchor point with the highest confidence is finally obtained as the candidate frame.

5) Iterative regression of cascaded detectors

After obtaining the feature map from the convolutional neural network, the feature map is sent to the detector to detect the target object. As shown in Figure 5, it is the Cascade RCNN cascade detector structure, where B0 is the candidate area selected in the region suggestion network, conv represents the convolutional neural network, and the candidate area is sent together with the feature map obtained from the convolutional neural network. Enter the RoI Pooling layer to obtain the features of the region of interest, and then send the features to the fully connected layer (H1), and then send the features output from the fully connected layer to the classifier (C1) for classification and smoothing provided by the present invention Fine-tune the positioning in the border regression function (B1). The network has three detectors. The candidate frame B1, which has been fine-tuned by the smooth frame regression function provided by the present invention from the previous layer, is used as a new input and sent to the detector of the next layer until the candidate frame B3 is obtained. Calculate the error between B3 and the real frame as a loss, carry out backward propagation, and adjust the parameters of the convolutional neural network. Repeat the above process until the end of the training process.

3. Test process

First, preprocess the test picture, scale it to the size set by the network, and normalize the gray value. Then the pictures are sequentially sent to the convolutional neural network to extract features to obtain feature maps. The extracted feature map is input into the region suggestion network to obtain a candidate frame, and then the candidate frame and the feature map are input into the region of interest pooling layer together to obtain the feature of the region of interest. The features of the region of interest are sent to the first layer of the cascade detector to obtain the offset of the regression, and then calculate according to the smooth border regression function to obtain the corrected position of the bounding box, and use the bounding box as a new candidate The box and the feature map are input into the region of interest pooling layer together to obtain the new region of interest feature, and then the new region of interest feature is input to the detector of the second layer, and this operation is repeated until the last layer detector. The bounding box obtained by the last layer of detectors after bounding regression correction is the final bounding box of the network. At the same time, input the features of the region of interest of the last layer into the classifier of each layer of the detector to obtain the classification result, and then synthesize the classification results of each classifier to obtain the final classification result of the network.

In order to verify the effectiveness of the remote sensing image target detection method of the present invention, the above-mentioned DOTA data set is used for testing. In order to verify the effect of the smooth frame regression function provided by the present invention, the YOLO v2, SSD, Faster RCNN, YOLO v3, RetinaNet, FPN, Cascade RCNN methods are used for testing. First, the frame regression function of the original method is used for calculation, and then the original method The border regression function replaces the smooth border regression function provided by the invention, and then recalculates.

Evaluation index:

In the field of machine learning, evaluating the performance of a classifier is generally measured by two quantities: Precision and Recall. To calculate these two indicators, the samples can be divided into four categories according to the situation between the true value and the predicted value of the sample: True Positives (TP): Predict the positive sample as a positive case; False Positives , FP): predict positive samples as negative examples; True Negatives (TN): predict negative samples as negative examples; False Negatives (FN): predict negative samples as positive examples; through the confusion matrix (Confusion Matrix ) Can clearly present these four types of relationships.

The resolution of precision P and recall R is expressed as:

The higher the precision rate and the recall rate, the better, but in general, there is a contradiction between the two. When the recall rate is high, the precision rate will be lower; and when the precision rate is high, the recall rate will be lower. Low. In general, we will sort according to the classification scores, and calculate the classification samples in order of the scores from high to low, and calculate the current precision and recall rates. With the recall rate on the horizontal axis and the precision rate on the vertical axis, we can make a curve called "P-R curve". By calculating the "P-R curve" and the area enclosed by the horizontal and vertical axis, the performance can be reflected to a certain extent. The higher the area, the better the performance.

The performance of the detector in target detection is measured by AP and mAP. For single-class target detection, the average precision (Average Precision, AP) is usually taken as the evaluation index. The AP for single-type target detection is to calculate the "P-R curve" and the area enclosed by the horizontal and vertical axes of this type. In target detection, to determine the four types of samples TP, FP, TN and FN, it is necessary to calculate the IoU between each prediction frame and the true reference frame. Only when the threshold is greater than the set threshold, the sample can be judged as a positive sample .

For single-type target detection AP is expressed as:

In this embodiment, indicators such as AP, AP50, AP75, AP60, AP70, AP80, and AP90 are used to evaluate the accuracy of target detection for each comparison method used. AP50 refers to the AP value when IoU is set to 0.5, and the meaning of other indicators is similar to AP50. It can be seen that the higher the IoU, the higher the accuracy of target detection and the greater the difficulty.

Figures 6 and 7 show the target detection results using Cascade RCNN as the basic network architecture, using the original border regression function and the smooth border regression function provided by the present invention. Fig. 6 is the result of the original frame regression function, and Fig. 7 is the result of the smooth frame regression function provided by the present invention. It can be clearly seen that the positioning accuracy of the method of the present invention is higher than that of the original frame regression function.

Table 1 shows the comparison between the detection results of the method of the present invention and the original method under other network architectures. Where √ indicates the accuracy of using the smooth frame regression function provided by the present invention under a given network architecture, and if there is no √, it indicates that the original frame regression function is used.

Table 1 Comparison of target detection accuracy under each network architecture

AP represents the overall average accuracy index, AP50 represents the average accuracy under the threshold where the IoU is greater than 0.5, and AP750 represents the average accuracy under the threshold where the IoU is greater than 0.75, the same goes for the following. It can be seen that after the smooth border regression function provided by the present invention is used, the accuracy of each network architecture at each IoU level has been significantly improved, especially the detection improvement under the high IoU threshold is higher, indicating that the smooth border regression function provided by the present invention It has a more obvious effect in improving the positioning accuracy, especially at a high IoU threshold, the effect is more obvious.

With the technical solution of the present invention, the frame regression process in target detection can be better realized based on the smooth frame regression function. Under the condition of high IoU threshold, the accuracy of the detected target frame is better. Compared with the original frame regression function, the original frame regression function is more accurate. The smooth border regression function provided by the invention can realize target detection with higher precision. The smooth border regression function provided by the present invention can be used in any target detection network framework.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A remote sensing image target detection method based on a smooth border regression function is characterized in that it includes the following steps:

Step 1. Image preprocessing: Perform necessary preprocessing on training images, including image rotation, mirroring and other enhancement operations, image normalization operations, image size adjustment operations, and setting of hyperparameters for network training;

Step 2. Feature extraction: input the image into the target detection class convolutional neural network to obtain the feature map; then input the feature map into the regional suggestion network to obtain the candidate frame; then send the candidate frame and the feature map to the sensor In the region of interest pooling layer, the characteristics of the region of interest are obtained;

Step three, classification: send the region of interest features obtained in step two to the softmax classifier for classification;

Step 4. Regression: Send the features of the region of interest obtained in step 3 to the fully connected layer to get the predicted offset, and send the predicted offset to the smooth border regression function to get the actual offset. Move the amount to correct the candidate frame to a new position;

Step 5. Correction: Use the bounding box of the candidate box after regression correction as the new candidate box, and send it to the region of interest layer together with the feature map to obtain the region of interest feature, repeat step 3, step 4, and step 5. Until the training process is over, a trained network is obtained;

Step 6. Input the image to be detected into the trained network after preprocessing to obtain the target detection result.
A remote sensing image target detection method based on a smooth frame regression function according to claim 1, wherein the smooth frame regression function used for regression is:

(sgn((t x /c x ))×|(t x /c x )|) 4/3 ×p w +p x =G x

(sgn((t y /c y ))×|(t y /c y )|) 4/3 ×p h +p y ＝G y

exp(sgn((t w /c w ))×|(t w /c w )|) 4/3 ×p w ＝G w

exp(sgn((t h /c h ))×|(t h /c h )|) 4/3 × p h ＝G h

Among them, sgn represents the coincidence function, to ensure that there is no error in the operation of negative numbers, exp is an exponential function, c x , c y , c w , c h are the weight adjustment values of the regression, t x , t y , t h , t w are The offset predicted by the convolutional neural network, p x , p y are the position coordinates of the center point of the candidate box, p w , p h are the width and height of the candidate box, G x , G y are the center of the bounding box after regression correction Point position coordinates, G w , G h are the width and height of the bounding box after regression correction.
A remote sensing image target detection method based on a smooth border regression function according to claim 1, wherein the target detection convolutional neural network includes but not limited to Faster RCNN, YOLO v1, YOLO v2, YOLO v3 , SSD, FPN, RetinaNet, Cascade RCNN.