CN114170526A

CN114170526A - Remote sensing image multi-scale target detection and identification method based on lightweight network

Info

Publication number: CN114170526A
Application number: CN202111388223.6A
Authority: CN
Inventors: 蒋丽婷; 张志超; 喻金桃
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-03-11

Abstract

The invention discloses a remote sensing image multi-scale target detection and identification method based on a lightweight network, which comprises the following steps: preprocessing the acquired remote sensing image; replacing standard convolution processes in the feature extraction trunk and the detection branch based on the depth separable convolution, and inputting the processed image into a convolution neural network for feature extraction; extracting feature images of different scales from different layers of a convolutional neural network by adopting a feature pyramid for prediction, and fusing detection results of different layers to obtain a multi-scale fusion feature map; setting a candidate frame on the feature map, generating a prediction tensor, and based on the confidence coefficient and the position information of the category to which the prediction target belongs. By introducing the depth separable convolution, model parameters are reduced, and the network detection speed is improved; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; and a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features are fused, so that the small target detection performance is improved.

Description

Remote sensing image multi-scale target detection and identification method based on lightweight network

Technical Field

The invention relates to the technical field of remote sensing image detection, in particular to a remote sensing image multi-scale target detection and identification method based on a lightweight network.

Background

At present, with the continuous development of remote sensing technology, the resolution of remote sensing images, including time resolution, spatial resolution, radiation resolution and spectral resolution, is continuously improved, high-quality remote sensing images are gradually and widely applied to the military and civil fields, and the target detection of remote sensing images based on deep learning also gradually becomes a research hotspot. In order to realize target detection in remote sensing images, researchers have proposed a plurality of valuable target detection methods one after another, and currently, mainstream detection algorithms can be generally classified into 2 types: a staged detection method and an end-to-end detection method.

The first type is a staged detection method, which generally obtains a candidate region through methods such as a sliding window and the like, extracts features to train a classifier, and judges whether a candidate frame contains a target through the classifier. At present, many target detection algorithms in the field of remote sensing are realized based on the method, and the method has the advantage of higher detection precision.

The second type is an end-to-end target detection algorithm based on a regression method, a candidate frame extraction stage and a category prediction stage are effectively combined, and both YOLO and SSD finish target detection in a regression mode, so that the speed of target detection in a deep learning mode is greatly increased.

However, the defects of the existing remote sensing image target detection method mainly include: the detection process of the staged detection method comprises a plurality of stages, the implementation process is complex, a large number of redundant candidate areas are extracted through a sliding window, the detection speed is low, and the real-time requirement of remote sensing image processing is difficult to meet. An end-to-end target detection algorithm based on a regression method is mainly designed for natural scenes, and remote sensing images are wide in range, large in target scale difference, large in small target ratio and low in resolution. Therefore, the method cannot be directly applied to remote sensing images, and has the problems of poor multi-scale target extraction capability, small target missing detection and the like

Therefore, the depth separable convolution is introduced, the model parameters are reduced, and the network detection speed is improved; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; the technical personnel in the field need to solve the problem of fusing a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features to improve the small target detection performance.

Disclosure of Invention

In view of the above, the invention provides a remote sensing image multi-scale target detection and identification method based on a lightweight network, which reduces model parameters and improves network detection speed by introducing depth separable convolution; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; and a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features are fused, so that the small target detection performance is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a remote sensing image multi-scale target detection and identification method based on a lightweight network comprises the following steps:

s1, preprocessing the acquired remote sensing image;

s2, replacing the standard convolution process in the main trunk and the detection branch based on the depth separable convolution, and inputting the processed image into a convolution neural network for feature extraction;

s3, extracting feature images of different scales from different layers of the convolutional neural network by adopting a feature pyramid for prediction, and fusing detection results of different layers to obtain a multi-scale fusion feature map;

and S4, setting a candidate frame on the feature map, generating a prediction tensor, and predicting the confidence coefficient and the position information of the class to which the target belongs on the basis of the prediction tensor.

Preferably, the step S2 specifically includes:

s21, increasing the number of network layers on the basis of the initial layer by 32 convolution kernel convolutions of 3x3, and improving the feature expression capability;

s22, alternately stacking blocks 1 and blocks 2, wherein each block is composed of depth convolution and point convolution, batch normalization is carried out after each convolution, and then the block is output to a relu layer, wherein the depth convolution step of the block1 is 1, and the depth convolution step of the block2 is 2;

and S23, after feature extraction is realized through 13 convolution blocks, deconvoluting the conv13 feature graph, fusing the conv13 feature graph with the conv5, simultaneously sending the fused feature graph, the conv11 feature graph and the conv13 feature graph into a target detection module, performing coordinate regression and classification, performing non-maximum suppression on detection results on the feature graphs of multiple scales, screening out a final result, and finishing feature extraction.

Preferably, the step S3 specifically includes:

s31, calculating the characteristic diagram receptive field:

selecting proper characteristic diagram

f (x) k, wherein R_k-1＜x≤R_k

Wherein R is_kIndicating the size of the k-th layer reception field, the initial reception field R₀＝1，K_kRepresents the k-th layer convolution kernel size, s_kRepresenting the convolution step of the k layer, wherein m is the total number of layers of the feature map, x is the sample length, and f (x) represents the feature map corresponding to the sample length;

s32, dividing the sample into 3 stages, namely a large stage, a medium stage and a small stage, and respectively selecting the last 3 feature graphs with different sizes;

s33, adopting a 2 x 2 convolution kernel with the step length of 2, performing up-sampling on a 38 x 38 high semantic information feature map, performing batch normalization processing and a ReLU layer on deconvolution output, splicing the two feature maps into a multi-channel feature map with the same resolution as that of a 10 x 10 low semantic information feature map, extracting features from the spliced multi-channel feature map by using the multi-channel convolution, and realizing feature fusion by using a 3x 256 convolution kernel.

Preferably, the step S4 specifically includes:

s41, selecting the ratio of the effective receptive field to the theoretical receptive field to be 1/3, and calculating the size of the candidate frame

k∈[1,m]

Wherein R is_kThe size of the receptive field of the k layer is shown, and m is the total number of layers of the characteristic diagram;

determining the aspect ratio of the candidate frames, calculating the aspect ratio of the original image corresponding to each feature map candidate frame, and increasing 2 lengths to be S when the aspect ratio is 1_kAnd

so that there are 6 candidate boxes per feature map,

wherein r is the aspect ratio coefficient, w_k，h_kRespectively the length and width of a candidate frame of the k-th layer feature map;

s42, establishing a corresponding relation between the real label and the candidate frame, wherein the real label is selected from the candidate frame, and the matching principle is as follows: matching the candidate box with the maximum intersection ratio IoU between the real target and the candidate box in the graph to ensure that the candidate box covers each real target, the candidate box covered by the real target is a positive sample, and the candidate box without the matching target is a negative sample; when the candidate box matches multiple real targets, IoU the largest target is selected;

s43, training the model by using the loss function of the SSD, wherein the loss function is formed by weighting the confidence error and the position error

Wherein x represents whether the candidate box matches this target, the match is 1, and the mismatch is 0; c is a multi-class target confidence coefficient predicted value; predicting a bounding box position corresponding to the candidate frame; g is a location parameter of the real target; n is the number of candidate boxes matching the real target; is a parameter for adjusting the ratio between the position error and the confidence error, and usually takes 1;

for position errors, it takes the Smooth between the real target and the candidate frame_L1The losses, defined as follows:

wherein

Representing that the ith candidate box is matched with the jth real target of the category k;

representing a smoothed L1 norm, i ∈ Pos representing the ith positive sample prediction box; x, y, w and h respectively represent the center coordinates of the candidate frame and the width and the height of the candidate frame;

is obtained by encoding g; l represents an offset of the prediction frame corresponding to the candidate frame; g represents a real box; d represents a candidate box;

the confidence error is the softmax loss of the multi-class confidence c,

wherein i belongs to Neg and represents the ith positive sample prediction box;

as an indication parameter, when it is 1, it indicates that the ith candidate box matches with the jth real target, and the real target is P;

the ith candidate frame is matched with the jth real target about the category P, and the probability prediction of P is higher, and the loss is smaller;

meaning that the prediction box has no object, the higher the probability that the prediction box is the background, which is generated by softmax, the smaller the loss.

According to the technical scheme, compared with the prior art, the remote sensing image multi-scale target detection and identification method based on the lightweight network is disclosed, and by introducing the depth separable convolution, the model parameters are reduced, and the network detection speed is increased; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; and a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features are fused, so that the small target detection performance is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic diagram of a convolution process provided by the present invention.

FIG. 2 is a schematic diagram of a flow structure of the method provided by the present invention.

Fig. 3 is a schematic diagram of a network structure provided by the present invention.

Fig. 4 is a schematic structural diagram of a fusion module provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a remote sensing image multi-scale target detection and identification method based on a lightweight network, which comprises the following steps:

s1, preprocessing the acquired remote sensing image;

To further optimize the above technical solution, step S2 specifically includes:

To further optimize the above technical solution, step S3 specifically includes:

s31, calculating the characteristic diagram receptive field:

selecting proper characteristic diagram

f (x) k, wherein R_k-1＜x≤R_k

To further optimize the above technical solution, step S4 specifically includes:

so that there are 6 candidate boxes per feature map,

wherein

the confidence error is the softmax loss of the multi-class confidence c,

wherein i belongs to Neg and represents the ith positive sample prediction box;

1. Feature extraction

With great success of the convolutional neural network in the field of computer vision, the depth width of the convolutional neural network is continuously improved, so that the neural network has large calculated amount and large model capacity, the real-time detection requirement of remote sensing image target detection is difficult to meet, and the compression and acceleration problems of the neural network gradually become research hotspots. The invention introduces the deep separable convolution to replace the standard convolution, greatly reduces the network parameters and ensures that the network detection speed reaches the real-time level.

(1) Method for reducing weight

The standard convolution and depth separable convolution processes are respectively shown in fig. 1(a) and (b), and the parameter quantities involved in the two convolutions are respectively shown in formulas (2) and (3).

D_k×D_k×M×N (2)

The depth separable convolution parameters are therefore:

D_k×D_k×M+M×N (3)

the depth separable convolution and standard convolution parameters are compared as follows:

the depth separable convolution is to use depth convolution to convolute different input channels, then use point convolution to combine the above outputs, and under the condition of ensuring the whole effect, to reduce the parameter quantity to the original one

The lightweight of the model is achieved. And because a large number of convolutions of 1 x 1 are used, the convolution can be completed by directly multiplying highly optimized matrixes, so that the memory recombination is reduced, and the operation efficiency is greatly improved. Therefore, the invention replaces the standard convolution process in the feature extraction trunk and the detection branch by the deep separable convolution, thereby greatly reducing the network parameters and leading the model to achieve the real-time detection effect.

(2) Lightweight feature extraction trunks

The invention adopts an SSD (Single Shot Multi Box Detector, SSD) detection framework, which mainly comprises a feature extraction part and a target detection part 2, wherein the feature extraction part comprises 13 convolution blocks, and the whole framework is shown in figure 3: firstly, increasing the number of network layers on the basis of an initial layer by 32 convolution kernel convolutions of 3x3, and improving the feature expression capability; and then alternately stacked by blocks 1 and blocks 2, each consisting of a depth convolution and a point convolution, each convolution being followed by a batch normalization (batch normalization), and then output to the relu layer. Wherein the step size of the depth convolution of block1 is 1, and the step size of the depth convolution of block2 is 2; after feature extraction is achieved through 13 convolution blocks, a conv13 feature graph is subjected to deconvolution and then is fused with a conv5, the fused feature graph not only retains high-level semantic information but also contains low-level geometric information, the fused feature graph, the conv11 feature graph and the conv13 feature graph are simultaneously sent to a target detection module for coordinate regression and classification, non-maximum value suppression is conducted on detection results on a plurality of scale feature graphs, a final result is screened out, multi-scale target detection is achieved, the low-level high-resolution feature graph of the network structure has more global information and stronger fitting capacity, meanwhile, the high-level feature fitting capacity is unchanged, and the problem of overfitting cannot be caused.

2. Multi-scale feature fusion

The remote sensing image targets are various, the difference of different target scales is large, and even the same type of targets have large difference. The method adopts the characteristic pyramid to extract different scale characteristic graphs from different layers of a network for prediction, and combines detection results of different layers to realize multi-scale target detection.

(1) Feature map selection

The reception fields of the unit pixels in the feature maps with different sizes are different, the reception field of the feature map on the lower layer is smaller and is suitable for detecting smaller targets, and the reception field on the upper layer is larger and is suitable for detecting larger-size targets. In order to cover the target with the feature map, the feature map receptive field is calculated by adopting a formula (5), and a proper feature map is selected according to a formula (6).

f (x) k, wherein R_k-1＜x≤R_k (6)

Wherein R is_kIndicating the size of the k-th layer reception field, the initial reception field R₀＝1，K_kRepresents the k-th layer convolution kernel size, s_kAnd f (x) represents a characteristic diagram corresponding to the sample length.

Therefore, aiming at the multi-scale target requirement, the method divides the sample into 3 stages, namely a large stage, a medium stage and a small stage, and respectively selects the last 3 feature maps with different sizes. According to the network structure definition of fig. 3, when the input is 300, the sizes of feature maps at the upper layer are 38/19/10 respectively, wherein the sizes of some feature maps are the same, and since the higher the layer number is, the larger the receptive field is, the feature map 5, the feature map 11, and the feature map13 are finally selected for detecting large, medium, and small 3 size targets, wherein the feature map 5 is the convolution layer conv5 and the converter 11 output feature maps are fused, the feature map 11, and the feature map13 are the convolution layer conv11, and the conv13 output feature maps, and the three receptive fields are 43/219/315 respectively calculated according to the formula (6). For a picture with an input resolution of 300 × 300, the correspondence between the size of the feature map used for detection and the target scale is shown in table 1: the feature map 5 is used for detecting small targets, and since the low-level feature map is large, the number of candidate frames is large to cover the whole feature map, so that the detection speed is slow, a large number of redundant candidate frames are redundant, and false detection is easily caused, the low-level feature map can be discarded when the small target is lower than 1/4 in the data set.

Table 1 example of feature map selection

(2) Feature fusion

The low-level feature map has high resolution, but reserves abundant geometric information, and can more accurately position the target position; the high-level feature graph has deeper target abstraction level through multilayer convolution, contains rich semantic information, and is easier to judge the target category. In the SSD detection framework, medium-sized and large-sized targets have better detection capabilities, while small-sized targets have weak detection capabilities. For small targets, after multilayer convolution, target position information pixels are lost, the small targets are difficult to detect, and a low-layer feature map is selected, semantic information is lacked, and a large amount of false detection is easily caused.

The fusion process is as shown in fig. 4, 2 × 2 convolution kernel with step length of 2 is adopted, the 38 × 38 high semantic information feature map is up-sampled, the deconvolution output is subjected to batch normalization processing and a ReLU layer, the resolution ratio of the deconvolution output is the same as that of the 10 × 10 low semantic information feature map, the two feature maps are spliced into a multi-channel feature map, the multi-channel feature map formed by splicing is subjected to feature extraction by using multi-channel convolution, and feature fusion is realized by using 3 × 3 × 256 convolution kernel. Because the convolution kernel parameters can be adjusted through back propagation learning, the method for realizing the feature fusion by utilizing the multi-channel convolution is more effective than the method for directly realizing the feature fusion by adding the feature maps.

The invention adopts a jump connection mode to simplify model operation, reduce complexity and increase the number of output characteristic layers. The high-level feature map and the low-level feature map are fused, and the position and semantic information are simultaneously reserved by utilizing the channel, so that the network context semantic information is fully utilized, and the small target detection performance is improved.

3. Target detection

In order to detect different-scale targets on the characteristic diagram with rich semantic information and geometric information, matched candidate frames need to be arranged on the characteristic diagrams with different dimensions, and prediction tensors with different sizes, confidence degrees based on the categories to which the prediction targets belong and position information are generated.

(1) Candidate frame design

The candidate frames with different sizes are subjected to convolution operation on the feature map, corresponding receptive fields are matched, all targets on the original image are covered as far as possible, and each target can be matched with one candidate frame. The smaller the receptive field of the unit pixel of the feature map, the larger the detection specification, the more sparse the candidate frames are, the more dense the candidate frames are. The design of the candidate box needs to be based on the following 2 principles:

a) the size of the candidate frame is close to the receptive field of the characteristic image;

b) the frame candidate aspect ratio should be close to the target aspect ratio;

one unit in the convolutional neural network has two receptive fields. One is the theoretical receptive field, which represents the input area that theoretically could affect the unit value. However, not every pixel in the theoretical perceptual field contributes equally to the final output. Typically, the central pixel has a larger effect than the outer pixels, i.e. only a small area has an effective effect on the output value, called the effective receptive field. According to this theory, the candidate box should be significantly smaller than the theoretical receptive field to match the effective receptive field. Relevant researches show that as the number of network layers is deepened, the actual effective receptive field is increased in range level, the proportion of the effective receptive field to the theoretical receptive field is reduced according to the level, and referring to relevant experiments, the proportion of the effective receptive field to the theoretical receptive field is 1/3, so that the size of a candidate frame is calculated by adopting a formula (7).

Wherein R is_kThe size of the receptive field of the k layer is shown, and m is the total number of layers of the feature map.

For the frame candidate aspect ratio, an aspect ratio of

For an elongated object, such as a ship or automobile, the aspect ratio may be set to

For special data, a suitable aspect ratio can be selected by counting the sample length-width distribution. After the length-width ratio is determined, the length-width ratio of the original image corresponding to each feature map candidate frame is calculated according to the formula (8), and when the length-width ratio is 1, 2 lengths are added to be S respectively_kAnd

so there are 6 candidate boxes per feature map.

Wherein r is the aspect ratio coefficient, w_k，h_kRespectively the length and width of the k-th layer feature map candidate frame.

(2) Training

The training stage is mainly to establish the corresponding relation between the real label and the candidate frame, the real label is selected from the candidate frame, and the matching principle is 2: matching the candidate frame with the maximum Intersection-over-Union ratio (IoU) between the real target and the candidate frame in the graph to ensure that the candidate frame covers each real target, the candidate frame covered by the real target is a positive sample, and the candidate frame without the matching target is a negative sample; when the candidate box matches multiple real objects, IoU the largest object is taken.

The invention adopts a loss function training model of SSD, wherein the loss function is formed by weighting confidence error (conf) and position error (locationation error), and the weight is shown in a formula (8):

wherein x represents whether the candidate box matches this target, the match is 1, and the mismatch is 0; c is a multi-class target confidence coefficient predicted value; l is a predicted value of the position of the bounding box corresponding to the candidate box; g is a location parameter of the real target; n is the number of candidate boxes matching the real target; α is a parameter that adjusts the ratio between the position error and the confidence error, and is usually taken to be 1.

wherein

is obtained by encoding g; l represents an offset of the prediction frame corresponding to the candidate frame; g represents a real box; d represents a candidate box.

Confidence error is the softmax loss of multi-class confidence c, as shown in equation (10).

Wherein i e Neg represents the ith positive sample prediction box;

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A remote sensing image multi-scale target detection and identification method based on a lightweight network is characterized by comprising the following steps:

s1, preprocessing the acquired remote sensing image;

2. The light-weight-network-based remote sensing image multi-scale target detection and identification method according to claim 1, wherein the step S2 specifically includes:

3. The light-weight-network-based remote sensing image multi-scale target detection and identification method according to claim 1, wherein the step S3 specifically includes:

s31, calculating the characteristic diagram receptive field:

selecting proper characteristic diagram

f (x) k, wherein R_k-1＜x≤R_k

4. The method for detecting and identifying the remote sensing image multi-scale target based on the lightweight network as claimed in claim 1, wherein the step S4 specifically comprises: