CN114170526A - Remote sensing image multi-scale target detection and identification method based on lightweight network - Google Patents

Remote sensing image multi-scale target detection and identification method based on lightweight network Download PDF

Info

Publication number
CN114170526A
CN114170526A CN202111388223.6A CN202111388223A CN114170526A CN 114170526 A CN114170526 A CN 114170526A CN 202111388223 A CN202111388223 A CN 202111388223A CN 114170526 A CN114170526 A CN 114170526A
Authority
CN
China
Prior art keywords
feature
target
candidate
convolution
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111388223.6A
Other languages
Chinese (zh)
Inventor
蒋丽婷
张志超
喻金桃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202111388223.6A priority Critical patent/CN114170526A/en
Publication of CN114170526A publication Critical patent/CN114170526A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image multi-scale target detection and identification method based on a lightweight network, which comprises the following steps: preprocessing the acquired remote sensing image; replacing standard convolution processes in the feature extraction trunk and the detection branch based on the depth separable convolution, and inputting the processed image into a convolution neural network for feature extraction; extracting feature images of different scales from different layers of a convolutional neural network by adopting a feature pyramid for prediction, and fusing detection results of different layers to obtain a multi-scale fusion feature map; setting a candidate frame on the feature map, generating a prediction tensor, and based on the confidence coefficient and the position information of the category to which the prediction target belongs. By introducing the depth separable convolution, model parameters are reduced, and the network detection speed is improved; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; and a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features are fused, so that the small target detection performance is improved.

Description

Remote sensing image multi-scale target detection and identification method based on lightweight network
Technical Field
The invention relates to the technical field of remote sensing image detection, in particular to a remote sensing image multi-scale target detection and identification method based on a lightweight network.
Background
At present, with the continuous development of remote sensing technology, the resolution of remote sensing images, including time resolution, spatial resolution, radiation resolution and spectral resolution, is continuously improved, high-quality remote sensing images are gradually and widely applied to the military and civil fields, and the target detection of remote sensing images based on deep learning also gradually becomes a research hotspot. In order to realize target detection in remote sensing images, researchers have proposed a plurality of valuable target detection methods one after another, and currently, mainstream detection algorithms can be generally classified into 2 types: a staged detection method and an end-to-end detection method.
The first type is a staged detection method, which generally obtains a candidate region through methods such as a sliding window and the like, extracts features to train a classifier, and judges whether a candidate frame contains a target through the classifier. At present, many target detection algorithms in the field of remote sensing are realized based on the method, and the method has the advantage of higher detection precision.
The second type is an end-to-end target detection algorithm based on a regression method, a candidate frame extraction stage and a category prediction stage are effectively combined, and both YOLO and SSD finish target detection in a regression mode, so that the speed of target detection in a deep learning mode is greatly increased.
However, the defects of the existing remote sensing image target detection method mainly include: the detection process of the staged detection method comprises a plurality of stages, the implementation process is complex, a large number of redundant candidate areas are extracted through a sliding window, the detection speed is low, and the real-time requirement of remote sensing image processing is difficult to meet. An end-to-end target detection algorithm based on a regression method is mainly designed for natural scenes, and remote sensing images are wide in range, large in target scale difference, large in small target ratio and low in resolution. Therefore, the method cannot be directly applied to remote sensing images, and has the problems of poor multi-scale target extraction capability, small target missing detection and the like
Therefore, the depth separable convolution is introduced, the model parameters are reduced, and the network detection speed is improved; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; the technical personnel in the field need to solve the problem of fusing a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features to improve the small target detection performance.
Disclosure of Invention
In view of the above, the invention provides a remote sensing image multi-scale target detection and identification method based on a lightweight network, which reduces model parameters and improves network detection speed by introducing depth separable convolution; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; and a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features are fused, so that the small target detection performance is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
a remote sensing image multi-scale target detection and identification method based on a lightweight network comprises the following steps:
s1, preprocessing the acquired remote sensing image;
s2, replacing the standard convolution process in the main trunk and the detection branch based on the depth separable convolution, and inputting the processed image into a convolution neural network for feature extraction;
s3, extracting feature images of different scales from different layers of the convolutional neural network by adopting a feature pyramid for prediction, and fusing detection results of different layers to obtain a multi-scale fusion feature map;
and S4, setting a candidate frame on the feature map, generating a prediction tensor, and predicting the confidence coefficient and the position information of the class to which the target belongs on the basis of the prediction tensor.
Preferably, the step S2 specifically includes:
s21, increasing the number of network layers on the basis of the initial layer by 32 convolution kernel convolutions of 3x3, and improving the feature expression capability;
s22, alternately stacking blocks 1 and blocks 2, wherein each block is composed of depth convolution and point convolution, batch normalization is carried out after each convolution, and then the block is output to a relu layer, wherein the depth convolution step of the block1 is 1, and the depth convolution step of the block2 is 2;
and S23, after feature extraction is realized through 13 convolution blocks, deconvoluting the conv13 feature graph, fusing the conv13 feature graph with the conv5, simultaneously sending the fused feature graph, the conv11 feature graph and the conv13 feature graph into a target detection module, performing coordinate regression and classification, performing non-maximum suppression on detection results on the feature graphs of multiple scales, screening out a final result, and finishing feature extraction.
Preferably, the step S3 specifically includes:
s31, calculating the characteristic diagram receptive field:
Figure BDA0003367847720000031
selecting proper characteristic diagram
f (x) k, wherein Rk-1<x≤Rk
Wherein R iskIndicating the size of the k-th layer reception field, the initial reception field R0=1,KkRepresents the k-th layer convolution kernel size, skRepresenting the convolution step of the k layer, wherein m is the total number of layers of the feature map, x is the sample length, and f (x) represents the feature map corresponding to the sample length;
s32, dividing the sample into 3 stages, namely a large stage, a medium stage and a small stage, and respectively selecting the last 3 feature graphs with different sizes;
s33, adopting a 2 x 2 convolution kernel with the step length of 2, performing up-sampling on a 38 x 38 high semantic information feature map, performing batch normalization processing and a ReLU layer on deconvolution output, splicing the two feature maps into a multi-channel feature map with the same resolution as that of a 10 x 10 low semantic information feature map, extracting features from the spliced multi-channel feature map by using the multi-channel convolution, and realizing feature fusion by using a 3x 256 convolution kernel.
Preferably, the step S4 specifically includes:
s41, selecting the ratio of the effective receptive field to the theoretical receptive field to be 1/3, and calculating the size of the candidate frame
Figure BDA0003367847720000041
k∈[1,m]
Wherein R iskThe size of the receptive field of the k layer is shown, and m is the total number of layers of the characteristic diagram;
determining the aspect ratio of the candidate frames, calculating the aspect ratio of the original image corresponding to each feature map candidate frame, and increasing 2 lengths to be S when the aspect ratio is 1kAnd
Figure BDA0003367847720000042
so that there are 6 candidate boxes per feature map,
Figure BDA0003367847720000043
wherein r is the aspect ratio coefficient, wk,hkRespectively the length and width of a candidate frame of the k-th layer feature map;
s42, establishing a corresponding relation between the real label and the candidate frame, wherein the real label is selected from the candidate frame, and the matching principle is as follows: matching the candidate box with the maximum intersection ratio IoU between the real target and the candidate box in the graph to ensure that the candidate box covers each real target, the candidate box covered by the real target is a positive sample, and the candidate box without the matching target is a negative sample; when the candidate box matches multiple real targets, IoU the largest target is selected;
s43, training the model by using the loss function of the SSD, wherein the loss function is formed by weighting the confidence error and the position error
Figure BDA0003367847720000051
Wherein x represents whether the candidate box matches this target, the match is 1, and the mismatch is 0; c is a multi-class target confidence coefficient predicted value; predicting a bounding box position corresponding to the candidate frame; g is a location parameter of the real target; n is the number of candidate boxes matching the real target; is a parameter for adjusting the ratio between the position error and the confidence error, and usually takes 1;
for position errors, it takes the Smooth between the real target and the candidate frameL1The losses, defined as follows:
Figure BDA0003367847720000052
Figure BDA0003367847720000053
Figure BDA0003367847720000054
Figure BDA0003367847720000055
wherein
Figure BDA0003367847720000056
Representing that the ith candidate box is matched with the jth real target of the category k;
Figure BDA0003367847720000057
representing a smoothed L1 norm, i ∈ Pos representing the ith positive sample prediction box; x, y, w and h respectively represent the center coordinates of the candidate frame and the width and the height of the candidate frame;
Figure BDA0003367847720000058
is obtained by encoding g; l represents an offset of the prediction frame corresponding to the candidate frame; g represents a real box; d represents a candidate box;
the confidence error is the softmax loss of the multi-class confidence c,
Figure BDA0003367847720000061
Figure BDA0003367847720000062
wherein i belongs to Neg and represents the ith positive sample prediction box;
Figure BDA0003367847720000063
as an indication parameter, when it is 1, it indicates that the ith candidate box matches with the jth real target, and the real target is P;
Figure BDA0003367847720000064
the ith candidate frame is matched with the jth real target about the category P, and the probability prediction of P is higher, and the loss is smaller;
Figure BDA0003367847720000065
meaning that the prediction box has no object, the higher the probability that the prediction box is the background, which is generated by softmax, the smaller the loss.
According to the technical scheme, compared with the prior art, the remote sensing image multi-scale target detection and identification method based on the lightweight network is disclosed, and by introducing the depth separable convolution, the model parameters are reduced, and the network detection speed is increased; extracting a plurality of scale feature maps to meet the detection requirements of targets with different scales; and a high-level feature map with strong semantic information features and a bottom-level feature map with strong geometric information features are fused, so that the small target detection performance is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic diagram of a convolution process provided by the present invention.
FIG. 2 is a schematic diagram of a flow structure of the method provided by the present invention.
Fig. 3 is a schematic diagram of a network structure provided by the present invention.
Fig. 4 is a schematic structural diagram of a fusion module provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a remote sensing image multi-scale target detection and identification method based on a lightweight network, which comprises the following steps:
s1, preprocessing the acquired remote sensing image;
s2, replacing the standard convolution process in the main trunk and the detection branch based on the depth separable convolution, and inputting the processed image into a convolution neural network for feature extraction;
s3, extracting feature images of different scales from different layers of the convolutional neural network by adopting a feature pyramid for prediction, and fusing detection results of different layers to obtain a multi-scale fusion feature map;
and S4, setting a candidate frame on the feature map, generating a prediction tensor, and predicting the confidence coefficient and the position information of the class to which the target belongs on the basis of the prediction tensor.
To further optimize the above technical solution, step S2 specifically includes:
s21, increasing the number of network layers on the basis of the initial layer by 32 convolution kernel convolutions of 3x3, and improving the feature expression capability;
s22, alternately stacking blocks 1 and blocks 2, wherein each block is composed of depth convolution and point convolution, batch normalization is carried out after each convolution, and then the block is output to a relu layer, wherein the depth convolution step of the block1 is 1, and the depth convolution step of the block2 is 2;
and S23, after feature extraction is realized through 13 convolution blocks, deconvoluting the conv13 feature graph, fusing the conv13 feature graph with the conv5, simultaneously sending the fused feature graph, the conv11 feature graph and the conv13 feature graph into a target detection module, performing coordinate regression and classification, performing non-maximum suppression on detection results on the feature graphs of multiple scales, screening out a final result, and finishing feature extraction.
To further optimize the above technical solution, step S3 specifically includes:
s31, calculating the characteristic diagram receptive field:
Figure BDA0003367847720000081
selecting proper characteristic diagram
f (x) k, wherein Rk-1<x≤Rk
Wherein R iskIndicating the size of the k-th layer reception field, the initial reception field R0=1,KkRepresents the k-th layer convolution kernel size, skRepresenting the convolution step of the k layer, wherein m is the total number of layers of the feature map, x is the sample length, and f (x) represents the feature map corresponding to the sample length;
s32, dividing the sample into 3 stages, namely a large stage, a medium stage and a small stage, and respectively selecting the last 3 feature graphs with different sizes;
s33, adopting a 2 x 2 convolution kernel with the step length of 2, performing up-sampling on a 38 x 38 high semantic information feature map, performing batch normalization processing and a ReLU layer on deconvolution output, splicing the two feature maps into a multi-channel feature map with the same resolution as that of a 10 x 10 low semantic information feature map, extracting features from the spliced multi-channel feature map by using the multi-channel convolution, and realizing feature fusion by using a 3x 256 convolution kernel.
To further optimize the above technical solution, step S4 specifically includes:
s41, selecting the ratio of the effective receptive field to the theoretical receptive field to be 1/3, and calculating the size of the candidate frame
Figure BDA0003367847720000091
Wherein R iskThe size of the receptive field of the k layer is shown, and m is the total number of layers of the characteristic diagram;
determining the aspect ratio of the candidate frames, calculating the aspect ratio of the original image corresponding to each feature map candidate frame, and increasing 2 lengths to be S when the aspect ratio is 1kAnd
Figure BDA0003367847720000092
so that there are 6 candidate boxes per feature map,
Figure BDA0003367847720000093
wherein r is the aspect ratio coefficient, wk,hkRespectively the length and width of a candidate frame of the k-th layer feature map;
s42, establishing a corresponding relation between the real label and the candidate frame, wherein the real label is selected from the candidate frame, and the matching principle is as follows: matching the candidate box with the maximum intersection ratio IoU between the real target and the candidate box in the graph to ensure that the candidate box covers each real target, the candidate box covered by the real target is a positive sample, and the candidate box without the matching target is a negative sample; when the candidate box matches multiple real targets, IoU the largest target is selected;
s43, training the model by using the loss function of the SSD, wherein the loss function is formed by weighting the confidence error and the position error
Figure BDA0003367847720000094
Wherein x represents whether the candidate box matches this target, the match is 1, and the mismatch is 0; c is a multi-class target confidence coefficient predicted value; predicting a bounding box position corresponding to the candidate frame; g is a location parameter of the real target; n is the number of candidate boxes matching the real target; is a parameter for adjusting the ratio between the position error and the confidence error, and usually takes 1;
for position errors, it takes the Smooth between the real target and the candidate frameL1The losses, defined as follows:
Figure BDA0003367847720000101
Figure BDA0003367847720000102
Figure BDA0003367847720000103
Figure BDA0003367847720000104
wherein
Figure BDA0003367847720000105
Representing that the ith candidate box is matched with the jth real target of the category k;
Figure BDA0003367847720000106
representing a smoothed L1 norm, i ∈ Pos representing the ith positive sample prediction box; x, y, w and h respectively represent the center coordinates of the candidate frame and the width and the height of the candidate frame;
Figure BDA0003367847720000107
is obtained by encoding g; l represents an offset of the prediction frame corresponding to the candidate frame; g represents a real box; d represents a candidate box;
the confidence error is the softmax loss of the multi-class confidence c,
Figure BDA0003367847720000108
Figure BDA0003367847720000109
wherein i belongs to Neg and represents the ith positive sample prediction box;
Figure BDA00033678477200001010
as an indication parameter, when it is 1, it indicates that the ith candidate box matches with the jth real target, and the real target is P;
Figure BDA00033678477200001011
the ith candidate frame is matched with the jth real target about the category P, and the probability prediction of P is higher, and the loss is smaller;
Figure BDA00033678477200001012
meaning that the prediction box has no object, the higher the probability that the prediction box is the background, which is generated by softmax, the smaller the loss.
1. Feature extraction
With great success of the convolutional neural network in the field of computer vision, the depth width of the convolutional neural network is continuously improved, so that the neural network has large calculated amount and large model capacity, the real-time detection requirement of remote sensing image target detection is difficult to meet, and the compression and acceleration problems of the neural network gradually become research hotspots. The invention introduces the deep separable convolution to replace the standard convolution, greatly reduces the network parameters and ensures that the network detection speed reaches the real-time level.
(1) Method for reducing weight
The standard convolution and depth separable convolution processes are respectively shown in fig. 1(a) and (b), and the parameter quantities involved in the two convolutions are respectively shown in formulas (2) and (3).
Dk×Dk×M×N (2)
The depth separable convolution parameters are therefore:
Dk×Dk×M+M×N (3)
the depth separable convolution and standard convolution parameters are compared as follows:
Figure BDA0003367847720000111
the depth separable convolution is to use depth convolution to convolute different input channels, then use point convolution to combine the above outputs, and under the condition of ensuring the whole effect, to reduce the parameter quantity to the original one
Figure BDA0003367847720000112
The lightweight of the model is achieved. And because a large number of convolutions of 1 x 1 are used, the convolution can be completed by directly multiplying highly optimized matrixes, so that the memory recombination is reduced, and the operation efficiency is greatly improved. Therefore, the invention replaces the standard convolution process in the feature extraction trunk and the detection branch by the deep separable convolution, thereby greatly reducing the network parameters and leading the model to achieve the real-time detection effect.
(2) Lightweight feature extraction trunks
The invention adopts an SSD (Single Shot Multi Box Detector, SSD) detection framework, which mainly comprises a feature extraction part and a target detection part 2, wherein the feature extraction part comprises 13 convolution blocks, and the whole framework is shown in figure 3: firstly, increasing the number of network layers on the basis of an initial layer by 32 convolution kernel convolutions of 3x3, and improving the feature expression capability; and then alternately stacked by blocks 1 and blocks 2, each consisting of a depth convolution and a point convolution, each convolution being followed by a batch normalization (batch normalization), and then output to the relu layer. Wherein the step size of the depth convolution of block1 is 1, and the step size of the depth convolution of block2 is 2; after feature extraction is achieved through 13 convolution blocks, a conv13 feature graph is subjected to deconvolution and then is fused with a conv5, the fused feature graph not only retains high-level semantic information but also contains low-level geometric information, the fused feature graph, the conv11 feature graph and the conv13 feature graph are simultaneously sent to a target detection module for coordinate regression and classification, non-maximum value suppression is conducted on detection results on a plurality of scale feature graphs, a final result is screened out, multi-scale target detection is achieved, the low-level high-resolution feature graph of the network structure has more global information and stronger fitting capacity, meanwhile, the high-level feature fitting capacity is unchanged, and the problem of overfitting cannot be caused.
2. Multi-scale feature fusion
The remote sensing image targets are various, the difference of different target scales is large, and even the same type of targets have large difference. The method adopts the characteristic pyramid to extract different scale characteristic graphs from different layers of a network for prediction, and combines detection results of different layers to realize multi-scale target detection.
(1) Feature map selection
The reception fields of the unit pixels in the feature maps with different sizes are different, the reception field of the feature map on the lower layer is smaller and is suitable for detecting smaller targets, and the reception field on the upper layer is larger and is suitable for detecting larger-size targets. In order to cover the target with the feature map, the feature map receptive field is calculated by adopting a formula (5), and a proper feature map is selected according to a formula (6).
Figure BDA0003367847720000131
f (x) k, wherein Rk-1<x≤Rk (6)
Wherein R iskIndicating the size of the k-th layer reception field, the initial reception field R0=1,KkRepresents the k-th layer convolution kernel size, skAnd f (x) represents a characteristic diagram corresponding to the sample length.
Therefore, aiming at the multi-scale target requirement, the method divides the sample into 3 stages, namely a large stage, a medium stage and a small stage, and respectively selects the last 3 feature maps with different sizes. According to the network structure definition of fig. 3, when the input is 300, the sizes of feature maps at the upper layer are 38/19/10 respectively, wherein the sizes of some feature maps are the same, and since the higher the layer number is, the larger the receptive field is, the feature map 5, the feature map 11, and the feature map13 are finally selected for detecting large, medium, and small 3 size targets, wherein the feature map 5 is the convolution layer conv5 and the converter 11 output feature maps are fused, the feature map 11, and the feature map13 are the convolution layer conv11, and the conv13 output feature maps, and the three receptive fields are 43/219/315 respectively calculated according to the formula (6). For a picture with an input resolution of 300 × 300, the correspondence between the size of the feature map used for detection and the target scale is shown in table 1: the feature map 5 is used for detecting small targets, and since the low-level feature map is large, the number of candidate frames is large to cover the whole feature map, so that the detection speed is slow, a large number of redundant candidate frames are redundant, and false detection is easily caused, the low-level feature map can be discarded when the small target is lower than 1/4 in the data set.
Table 1 example of feature map selection
Figure BDA0003367847720000132
Figure BDA0003367847720000141
(2) Feature fusion
The low-level feature map has high resolution, but reserves abundant geometric information, and can more accurately position the target position; the high-level feature graph has deeper target abstraction level through multilayer convolution, contains rich semantic information, and is easier to judge the target category. In the SSD detection framework, medium-sized and large-sized targets have better detection capabilities, while small-sized targets have weak detection capabilities. For small targets, after multilayer convolution, target position information pixels are lost, the small targets are difficult to detect, and a low-layer feature map is selected, semantic information is lacked, and a large amount of false detection is easily caused.
The fusion process is as shown in fig. 4, 2 × 2 convolution kernel with step length of 2 is adopted, the 38 × 38 high semantic information feature map is up-sampled, the deconvolution output is subjected to batch normalization processing and a ReLU layer, the resolution ratio of the deconvolution output is the same as that of the 10 × 10 low semantic information feature map, the two feature maps are spliced into a multi-channel feature map, the multi-channel feature map formed by splicing is subjected to feature extraction by using multi-channel convolution, and feature fusion is realized by using 3 × 3 × 256 convolution kernel. Because the convolution kernel parameters can be adjusted through back propagation learning, the method for realizing the feature fusion by utilizing the multi-channel convolution is more effective than the method for directly realizing the feature fusion by adding the feature maps.
The invention adopts a jump connection mode to simplify model operation, reduce complexity and increase the number of output characteristic layers. The high-level feature map and the low-level feature map are fused, and the position and semantic information are simultaneously reserved by utilizing the channel, so that the network context semantic information is fully utilized, and the small target detection performance is improved.
3. Target detection
In order to detect different-scale targets on the characteristic diagram with rich semantic information and geometric information, matched candidate frames need to be arranged on the characteristic diagrams with different dimensions, and prediction tensors with different sizes, confidence degrees based on the categories to which the prediction targets belong and position information are generated.
(1) Candidate frame design
The candidate frames with different sizes are subjected to convolution operation on the feature map, corresponding receptive fields are matched, all targets on the original image are covered as far as possible, and each target can be matched with one candidate frame. The smaller the receptive field of the unit pixel of the feature map, the larger the detection specification, the more sparse the candidate frames are, the more dense the candidate frames are. The design of the candidate box needs to be based on the following 2 principles:
a) the size of the candidate frame is close to the receptive field of the characteristic image;
b) the frame candidate aspect ratio should be close to the target aspect ratio;
one unit in the convolutional neural network has two receptive fields. One is the theoretical receptive field, which represents the input area that theoretically could affect the unit value. However, not every pixel in the theoretical perceptual field contributes equally to the final output. Typically, the central pixel has a larger effect than the outer pixels, i.e. only a small area has an effective effect on the output value, called the effective receptive field. According to this theory, the candidate box should be significantly smaller than the theoretical receptive field to match the effective receptive field. Relevant researches show that as the number of network layers is deepened, the actual effective receptive field is increased in range level, the proportion of the effective receptive field to the theoretical receptive field is reduced according to the level, and referring to relevant experiments, the proportion of the effective receptive field to the theoretical receptive field is 1/3, so that the size of a candidate frame is calculated by adopting a formula (7).
Figure BDA0003367847720000151
Wherein R iskThe size of the receptive field of the k layer is shown, and m is the total number of layers of the feature map.
For the frame candidate aspect ratio, an aspect ratio of
Figure BDA0003367847720000161
For an elongated object, such as a ship or automobile, the aspect ratio may be set to
Figure BDA0003367847720000162
For special data, a suitable aspect ratio can be selected by counting the sample length-width distribution. After the length-width ratio is determined, the length-width ratio of the original image corresponding to each feature map candidate frame is calculated according to the formula (8), and when the length-width ratio is 1, 2 lengths are added to be S respectivelykAnd
Figure BDA0003367847720000163
so there are 6 candidate boxes per feature map.
Figure BDA0003367847720000164
Wherein r is the aspect ratio coefficient, wk,hkRespectively the length and width of the k-th layer feature map candidate frame.
(2) Training
The training stage is mainly to establish the corresponding relation between the real label and the candidate frame, the real label is selected from the candidate frame, and the matching principle is 2: matching the candidate frame with the maximum Intersection-over-Union ratio (IoU) between the real target and the candidate frame in the graph to ensure that the candidate frame covers each real target, the candidate frame covered by the real target is a positive sample, and the candidate frame without the matching target is a negative sample; when the candidate box matches multiple real objects, IoU the largest object is taken.
The invention adopts a loss function training model of SSD, wherein the loss function is formed by weighting confidence error (conf) and position error (locationation error), and the weight is shown in a formula (8):
Figure BDA0003367847720000165
wherein x represents whether the candidate box matches this target, the match is 1, and the mismatch is 0; c is a multi-class target confidence coefficient predicted value; l is a predicted value of the position of the bounding box corresponding to the candidate box; g is a location parameter of the real target; n is the number of candidate boxes matching the real target; α is a parameter that adjusts the ratio between the position error and the confidence error, and is usually taken to be 1.
For position errors, it takes the Smooth between the real target and the candidate frameL1The losses, defined as follows:
Figure BDA0003367847720000171
wherein
Figure BDA0003367847720000172
Representing that the ith candidate box is matched with the jth real target of the category k;
Figure BDA0003367847720000173
representing a smoothed L1 norm, i ∈ Pos representing the ith positive sample prediction box; x, y, w and h respectively represent the center coordinates of the candidate frame and the width and the height of the candidate frame;
Figure BDA0003367847720000174
is obtained by encoding g; l represents an offset of the prediction frame corresponding to the candidate frame; g represents a real box; d represents a candidate box.
Confidence error is the softmax loss of multi-class confidence c, as shown in equation (10).
Figure BDA0003367847720000175
Wherein i e Neg represents the ith positive sample prediction box;
Figure BDA0003367847720000176
as an indication parameter, when it is 1, it indicates that the ith candidate box matches with the jth real target, and the real target is P;
Figure BDA0003367847720000177
the ith candidate frame is matched with the jth real target about the category P, and the probability prediction of P is higher, and the loss is smaller;
Figure BDA0003367847720000178
meaning that the prediction box has no object, the higher the probability that the prediction box is the background, which is generated by softmax, the smaller the loss.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A remote sensing image multi-scale target detection and identification method based on a lightweight network is characterized by comprising the following steps:
s1, preprocessing the acquired remote sensing image;
s2, replacing the standard convolution process in the main trunk and the detection branch based on the depth separable convolution, and inputting the processed image into a convolution neural network for feature extraction;
s3, extracting feature images of different scales from different layers of the convolutional neural network by adopting a feature pyramid for prediction, and fusing detection results of different layers to obtain a multi-scale fusion feature map;
and S4, setting a candidate frame on the feature map, generating a prediction tensor, and predicting the confidence coefficient and the position information of the class to which the target belongs on the basis of the prediction tensor.
2. The light-weight-network-based remote sensing image multi-scale target detection and identification method according to claim 1, wherein the step S2 specifically includes:
s21, increasing the number of network layers on the basis of the initial layer by 32 convolution kernel convolutions of 3x3, and improving the feature expression capability;
s22, alternately stacking blocks 1 and blocks 2, wherein each block is composed of depth convolution and point convolution, batch normalization is carried out after each convolution, and then the block is output to a relu layer, wherein the depth convolution step of the block1 is 1, and the depth convolution step of the block2 is 2;
and S23, after feature extraction is realized through 13 convolution blocks, deconvoluting the conv13 feature graph, fusing the conv13 feature graph with the conv5, simultaneously sending the fused feature graph, the conv11 feature graph and the conv13 feature graph into a target detection module, performing coordinate regression and classification, performing non-maximum suppression on detection results on the feature graphs of multiple scales, screening out a final result, and finishing feature extraction.
3. The light-weight-network-based remote sensing image multi-scale target detection and identification method according to claim 1, wherein the step S3 specifically includes:
s31, calculating the characteristic diagram receptive field:
Figure FDA0003367847710000021
selecting proper characteristic diagram
f (x) k, wherein Rk-1<x≤Rk
Wherein R iskIndicating the size of the k-th layer reception field, the initial reception field R0=1,KkRepresents the k-th layer convolution kernel size, skRepresenting the convolution step of the k layer, wherein m is the total number of layers of the feature map, x is the sample length, and f (x) represents the feature map corresponding to the sample length;
s32, dividing the sample into 3 stages, namely a large stage, a medium stage and a small stage, and respectively selecting the last 3 feature graphs with different sizes;
s33, adopting a 2 x 2 convolution kernel with the step length of 2, performing up-sampling on a 38 x 38 high semantic information feature map, performing batch normalization processing and a ReLU layer on deconvolution output, splicing the two feature maps into a multi-channel feature map with the same resolution as that of a 10 x 10 low semantic information feature map, extracting features from the spliced multi-channel feature map by using the multi-channel convolution, and realizing feature fusion by using a 3x 256 convolution kernel.
4. The method for detecting and identifying the remote sensing image multi-scale target based on the lightweight network as claimed in claim 1, wherein the step S4 specifically comprises:
s41, selecting the ratio of the effective receptive field to the theoretical receptive field to be 1/3, and calculating the size of the candidate frame
Figure FDA0003367847710000022
Wherein R iskThe size of the receptive field of the k layer is shown, and m is the total number of layers of the characteristic diagram;
determining the aspect ratio of the candidate frames, calculating the aspect ratio of the original image corresponding to each feature map candidate frame, and increasing 2 lengths to be S when the aspect ratio is 1kAnd
Figure FDA0003367847710000031
so that there are 6 candidate boxes per feature map,
Figure FDA0003367847710000032
wherein r is the aspect ratio coefficient, wk,hkRespectively the length and width of a candidate frame of the k-th layer feature map;
s42, establishing a corresponding relation between the real label and the candidate frame, wherein the real label is selected from the candidate frame, and the matching principle is as follows: matching the candidate box with the maximum intersection ratio IoU between the real target and the candidate box in the graph to ensure that the candidate box covers each real target, the candidate box covered by the real target is a positive sample, and the candidate box without the matching target is a negative sample; when the candidate box matches multiple real targets, IoU the largest target is selected;
s43, training the model by using the loss function of the SSD, wherein the loss function is formed by weighting the confidence error and the position error
Figure FDA0003367847710000033
Wherein x represents whether the candidate box matches this target, the match is 1, and the mismatch is 0; c is a multi-class target confidence coefficient predicted value; predicting a bounding box position corresponding to the candidate frame; g is a location parameter of the real target; n is the number of candidate boxes matching the real target; is a parameter for adjusting the ratio between the position error and the confidence error, and usually takes 1;
for position errors, it takes the Smooth between the real target and the candidate frameL1The losses, defined as follows:
Figure FDA0003367847710000041
Figure FDA0003367847710000042
Figure FDA0003367847710000043
Figure FDA0003367847710000044
wherein
Figure FDA0003367847710000045
Representing that the ith candidate box is matched with the jth real target of the category k;
Figure FDA0003367847710000046
representing a smoothed L1 norm, i ∈ Pos representing the ith positive sample prediction box; x, y, w and h respectively represent the center coordinates of the candidate frame and the width and the height of the candidate frame;
Figure FDA0003367847710000047
is obtained by encoding g; l represents an offset of the prediction frame corresponding to the candidate frame; g represents a real box; d represents a candidate box;
the confidence error is the softmax loss of the multi-class confidence c,
Figure FDA0003367847710000048
Figure FDA0003367847710000049
wherein i belongs to Neg and represents the ith positive sample prediction box;
Figure FDA00033678477100000410
as an indication parameter, when it is 1, it indicates that the ith candidate box matches with the jth real target, and the real target is P;
Figure FDA00033678477100000411
the ith candidate frame is matched with the jth real target about the category P, and the probability prediction of P is higher, and the loss is smaller;
Figure FDA00033678477100000412
meaning that the prediction box has no object, the higher the probability that the prediction box is the background, which is generated by softmax, the smaller the loss.
CN202111388223.6A 2021-11-22 2021-11-22 Remote sensing image multi-scale target detection and identification method based on lightweight network Pending CN114170526A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111388223.6A CN114170526A (en) 2021-11-22 2021-11-22 Remote sensing image multi-scale target detection and identification method based on lightweight network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111388223.6A CN114170526A (en) 2021-11-22 2021-11-22 Remote sensing image multi-scale target detection and identification method based on lightweight network

Publications (1)

Publication Number Publication Date
CN114170526A true CN114170526A (en) 2022-03-11

Family

ID=80480075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111388223.6A Pending CN114170526A (en) 2021-11-22 2021-11-22 Remote sensing image multi-scale target detection and identification method based on lightweight network

Country Status (1)

Country Link
CN (1) CN114170526A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115223017A (en) * 2022-05-31 2022-10-21 昆明理工大学 Multi-scale feature fusion bridge detection method based on depth separable convolution
WO2023173552A1 (en) * 2022-03-15 2023-09-21 平安科技(深圳)有限公司 Establishment method for target detection model, application method for target detection model, and device, apparatus and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023173552A1 (en) * 2022-03-15 2023-09-21 平安科技(深圳)有限公司 Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN115223017A (en) * 2022-05-31 2022-10-21 昆明理工大学 Multi-scale feature fusion bridge detection method based on depth separable convolution
CN115223017B (en) * 2022-05-31 2023-12-19 昆明理工大学 Multi-scale feature fusion bridge detection method based on depth separable convolution

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN110084292B (en) Target detection method based on DenseNet and multi-scale feature fusion
CN110321923B (en) Target detection method, system and medium for fusion of different-scale receptive field characteristic layers
CN112288008B (en) Mosaic multispectral image disguised target detection method based on deep learning
CN102932605B (en) Method for selecting camera combination in visual perception network
CN112348036A (en) Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade
CN114170526A (en) Remote sensing image multi-scale target detection and identification method based on lightweight network
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN115240072B (en) Hyperspectral multi-class change detection method based on multidirectional multi-scale spectrum-space residual convolution neural network
CN114255403A (en) Optical remote sensing image data processing method and system based on deep learning
CN116452937A (en) Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism
CN116758340A (en) Small target detection method based on super-resolution feature pyramid and attention mechanism
CN117372898A (en) Unmanned aerial vehicle aerial image target detection method based on improved yolov8
CN115937552A (en) Image matching method based on fusion of manual features and depth features
CN114821341A (en) Remote sensing small target detection method based on double attention of FPN and PAN network
CN116977747B (en) Small sample hyperspectral classification method based on multipath multi-scale feature twin network
CN116385401B (en) High-precision visual detection method for textile defects
CN115761552B (en) Target detection method, device and medium for unmanned aerial vehicle carrying platform
CN117011655A (en) Adaptive region selection feature fusion based method, target tracking method and system
CN116402761A (en) Photovoltaic panel crack detection method based on double-channel multi-scale attention mechanism
CN116189160A (en) Infrared dim target detection method based on local contrast mechanism
CN116309270A (en) Binocular image-based transmission line typical defect identification method
CN115861595A (en) Multi-scale domain self-adaptive heterogeneous image matching method based on deep learning
CN112989919B (en) Method and system for extracting target object from image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination