CN117994503A

CN117994503A - Heterogeneous remote sensing image target detection method based on iterative fusion

Info

Publication number: CN117994503A
Application number: CN202410226535.4A
Authority: CN
Inventors: 贺霖; 梁文瑞; 李颖琪
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-05-07

Abstract

The invention discloses a heterogeneous remote sensing image target detection method based on iterative fusion, which comprises the steps of obtaining an image dataset by an input heterogeneous remote sensing image; constructing a heterogeneous remote sensing image target detection network; setting network super parameters, and initializing the weight and bias of each convolution in the convolutional neural network; inputting a training set of visible light and near infrared images into a heterogeneous remote sensing image target detection network, and obtaining a detection prediction result which is more fit with a real situation through forward propagation; and loading the optimal weight and the bias to a heterogeneous remote sensing image target detection network to obtain the most excellent source remote sensing image target detection network. The features with rich semantic information and detail information are extracted through the feature enhancement module based on scale transformation and the coupling self-adaptive feature sampling module, and further extraction and fusion are completed through the double-branch iterative fusion feature extraction network, so that the defect of insufficient expression capability of single-source image scene information is effectively overcome, and the detection effect is improved.

Description

Heterogeneous remote sensing image target detection method based on iterative fusion

Technical Field

The invention relates to the field of remote sensing image recognition, in particular to a heterogeneous remote sensing image target detection method based on iterative fusion.

Background

Along with the rapid development of deep learning, a target detection method based on a deep convolution network has become a mainstream method in the field of remote sensing image target detection algorithms at present. Under the support of a large number of data samples, the target detection method based on the deep convolution network can accurately realize the accurate positioning and identification of the target by fully learning the similar characteristics among similar samples and the difference characteristics among heterogeneous samples.

The usual target detection framework based on deep convolutional networks can be categorized into three processes: trunk feature extraction, neck feature enhancement, and head classification regression. The method comprises the steps of extracting trunk features, and converting pixel point information of an image from a visual layer into features with semantic information which can be distinguished and understood by a computer through combination of a plurality of convolutions; the neck feature enhancement further enhances the distinguishing property of the features through means of multi-scale fusion, information weight assignment and the like, and improves the robustness to scene change; the head classification regression is used as a final process to convert the characteristics distinguished by the computer into the indexes of interest of people, thereby completing the conversion from computer vision to human vision. This framework essentially sets the paradigm for target detection and also sets the fundamental direction for improvement in detection performance.

In recent years, with the abundance of data types and the increase of sample numbers, the requirements on the target detection performance of remote sensing images are continuously increased, and the requirements are limited by the lack of single-source input image information in the traditional framework, so that the detection performance gradually approaches to a certain upper limit. In the face of the disadvantages of complex background interference, low imaging quality and the like, it is obvious that the method has great difficulty to rely on information contained in three bands of red, green and blue of a conventional single-source visible light image to complete an expected detection result. When the situation that the target is blocked, the target camouflage and the like causes the lack of the surface characteristics of the target and the like is encountered, the lack of information enables the learning of the characteristics of the detection model to reach a state of pace lifting and difficulty, and finally, the phenomena of target false alarm, target omission and the like are often caused.

Disclosure of Invention

Aiming at the defects of the existing single-source remote sensing image target detection technology, the invention provides a heterogeneous remote sensing image target detection method based on iterative fusion in order to fully utilize abundant data mining scene information.

The invention adopts the following technical scheme:

A heterogeneous remote sensing image target detection method based on iterative fusion comprises the following steps:

Reading an input heterologous remote sensing image to obtain an image data set, wherein the heterologous remote sensing image specifically refers to a visible light image and a near infrared image which are derived from different sensors and represent the same scene;

Constructing a heterogeneous remote sensing image target detection network, wherein the heterogeneous remote sensing image target detection network comprises a double-branch iterative fusion feature extraction network, a feature enhancement network based on scale transformation and a coupling self-adaptive feature sampling network;

setting network super parameters, and initializing the weight and bias of each convolution in the convolutional neural network;

Based on the constructed training set, inputting the training set of the visible light and near infrared images into a heterogeneous remote sensing image target detection network, and obtaining a detection prediction result which is more fit with the real situation through forward propagation;

Calculating errors of the predicted result and the real result, and optimizing until the errors are converged to a minimum value or reach preset training times to obtain optimal weights and biases;

And loading the optimal weight and the bias to a heterogeneous remote sensing image target detection network to obtain the most excellent source remote sensing image target detection network.

Further, after the reading of the input heterologous remote sensing image, the method further comprises a preprocessing step, specifically:

Resetting the image size of the input visible light image and near infrared image to the set value of the super parameter; and converting the color space from RGB to HSV, and performing mosaic data enhancement, mixed data enhancement and random order scrambling.

Further, the dual-branch iterative fusion feature extraction network specifically includes:

The feature extraction sub-network comprises a visible light feature extraction sub-network and a near infrared feature extraction sub-network which have the same structure, and the input visible light image and near infrared image are processed in parallel to obtain the feature which is extracted preliminarily

A first-level feature fusion network for preliminarily extracting featuresSplicing on the channel dimension to obtain a feature map C ₁, then fusing to obtain fused feature output, wherein F ₁ represents the fused feature output of the primary feature fusion network;

A secondary feature fusion network for integrating F ₁ with each other Performing channel dimension splicing and convolution processing, splicing the processing result on the channel dimension again to obtain a spliced characteristic diagram C ₂, and then fusing to obtain a fused characteristic output F ₂ of the secondary characteristic fusion network;

three-level feature fusion network for respectively combining F ₂ with And performing channel dimension splicing and convolution processing to obtain a spliced characteristic diagram C ₃, then fusing to obtain fused characteristic output, and performing spatial pyramid pooling operation to obtain an output F ₃ of the three-level characteristic fusion network.

Further, the feature extraction sub-network comprises a three-layer convolution-batch normalization-activation function structure;

the primary feature fusion network comprises a splicing layer and five layers of convolution-batch normalization-activation function structures which are stacked in parallel;

The secondary feature fusion network comprises a splicing layer and five layers of convolution-batch normalization-activation function structures which are stacked in parallel;

the three-level feature fusion network comprises a splicing layer, five layers of convolution-batch normalization-activation function structures which are stacked in parallel and a spatial pyramid pooling layer.

Further, the scale-based feature enhancement network includes:

Detail-preserving branches, which comprise a 1-layer convolution-batch normalization-activation function structure, wherein the convolution kernel size set by convolution is 3×3, the step size is set to 1, the activation function is SiLU, the process can be expressed as f _d =conv (x), wherein x is the input of the branch, and f _d is the output of the detail-preserving branch;

a semantic extraction branch comprising a sub-pixel upsampling structure and a layer 1 convolution-batch normalization-activation function structure;

For the input feature x of the branch, firstly, sub-pixel up-sampling is carried out, the feature space size is expanded to be 2 times of the input feature space size by taking the pixels of the adjacent 4 channels as the complementary form, then, the up-sampling result is subjected to a convolution-batch normalization-activation function structure with the convolution kernel size of 3 multiplied by 3 and the step length of 2, the feature space size is readjusted to be the same as the input feature space size, and f _s is the output of the semantic extraction branch;

The fusion attention enhancement module comprises a spliced convolution fusion layer and an attention enhancement layer, and for the output f _d of the detail keeping branch and the output f _s of the semantic extraction branch, the fusion attention enhancement module is used for fusing the detail keeping branch and the semantic extraction branch through the splicing and convolution processing of channel dimensions, and then introducing a channel attention mechanism to adjust the number and the weight of channels of the fusion attention enhancement module and the semantic extraction branch; and finally, connecting the input features with the output of the attention mechanism in a residual connection mode to obtain the output F _out of the fused attention enhancement module.

Further, the coupled adaptive feature sampling network comprises an adaptive feature downsampling module and an adaptive feature upsampling module.

Further, the method comprises the steps of,

The self-adaptive feature downsampling module comprises a maximum value pooling, stride convolution and Focus downsampling structure;

the adaptive feature upsampling module includes bicubic interpolation, deconvolution, and sub-pixel upsampling structures.

Further, the errors of the prediction result and the real result are calculated, and the cross entropy loss and the mean square error loss are respectively selected as the loss functions of classification and positioning.

Further, the optimization was performed using a gradient descent method.

Further, the image dataset was as follows 8:2, randomly extracting data set samples according to the proportion, and respectively constructing a training set and a testing set.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. The invention designs a double-branch iterative fusion feature extraction network structure for carrying out feature extraction and fusion on the input of the heterogeneous remote sensing image. The structure fully utilizes the consistency and complementarity of the heterologous features in a multistage iterative fusion mode, thereby effectively capturing the target features in a complex scene.

2. The invention designs a feature enhancement network based on scale transformation to enhance features in a feature extraction process. The module further reserves the richness of detail information while enhancing the semantic information of the extracted features by dividing the input features into the detail keeping branches and the semantic promotion branches, thereby relieving the problem of detail loss in the feature extraction process and enriching the potential of feature implications.

3. The invention designs the coupling self-adaptive characteristic sampling module, so that the sampling method has higher universality and adaptability to different occasions and different scenes. The module performs self-adaptive weight assignment on the corresponding sampling method by the network model in a data driving mode through integrating multiple types of sampling methods, so that the universality of the sampling method is enhanced.

4. The invention constructs a heterogeneous remote sensing image target detection framework based on iterative fusion based on the double-branch iterative fusion feature extraction network structure, the feature enhancement network of scale transformation and the coupling self-adaptive feature sampling network, and the framework can effectively utilize the features of heterogeneous image input, overcome the false alarm omission detection problem of targets under complex conditions or strong interference, and effectively improve the detection performance. The framework is easy to expand, and more easy-to-use and effective methods can be integrated in the coupled self-adaptive feature sampling module along with the updating of the technology or the enrichment of the data types, or the framework is updated to be suitable for target detection tasks of more sensor source images.

Drawings

FIG. 1 is a workflow diagram of the present invention;

FIG. 2 is a schematic diagram of a dual-branch iterative fusion feature extraction network in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of a feature enhancement network for scale transformation in accordance with an embodiment of the present invention.

FIG. 4 is a block diagram of a coupled adaptive feature sampling network according to an embodiment of the present invention.

FIG. 5 is a block diagram of an embodiment of the present invention;

FIG. 6 (a) shows the result of target detection using the Faster R-CNN method according to the embodiment of the invention;

FIG. 6 (b) is a diagram showing the result of detecting an object using the SSD method according to an embodiment of the present invention;

FIG. 6 (c) is a graph showing the result of detecting targets by CENTERNET according to an embodiment of the present invention;

FIG. 6 (d) is a graph showing the result of detecting targets using the YOLOv L method according to an embodiment of the present invention;

FIG. 6 (e) shows the result of target detection using the present method;

Fig. 6 (f) shows the real result corresponding to the picture object.

Detailed Description

The present invention will be described in further detail with reference to examples, but embodiments of the present invention are not limited thereto.

As shown in fig. 1,2, 3, 4 and 5, the present embodiment provides a heterologous remote sensing image target detection method based on iterative fusion, which includes the following steps:

Step 1, reading an input heterogeneous remote sensing image to obtain an image data set, wherein the heterogeneous remote sensing image specifically refers to a visible light image and a near infrared image which are derived from different sensors and represent the same scene.

Reading visible light images and near infrared images from different sensors and preprocessing the visible light images and the near infrared images to form an image data set;

The data sets of the heterologous remote sensing image are extracted into two data sets, namely VEDAI and CAMEL data sets, the input image size equal proportion is defined as 320×320 for the CAMEL data set, and the input image size equal proportion is defined as 512×512 for the VEDAI data set.

Step 2, selecting different numbers of samples to construct a training set and a testing set; and according to training set & verification set: test set = 8: 2. training set: validation set = 9:1 and reconstructing the original label file into a YOLO format.

The following steps 3,4 and 5 form a heterogeneous remote sensing image target detection network.

Step 3, designing a double-branch iterative fusion feature extraction network;

step 4, designing a feature enhancement network based on scale transformation;

step 5, designing a coupling self-adaptive characteristic sampling network;

step 6, setting network super parameters, and initializing the weight and bias of each convolution in the convolutional neural network;

Specifically, 3×3 convolution with step length of 1 is uniformly used for extracting semantic features of the image, and convolution of feature map scale transformation is involved, wherein the step length is uniformly set to 2. And extracting network model parameters by using trunk characteristics trained on the basis of the VOC data set, migrating the network model parameters to the network to be trained of the same structure by using the method, and initializing weights and offsets by using Gaussian distribution values with the mean value of zero in the rest parts.

Step 7, based on the constructed training set, the training samples of the visible light and the near infrared images are simultaneously input into a heterogeneous remote sensing image target detection network based on iterative fusion, and a prediction result is obtained through forward propagation;

And 8, respectively selecting the cross entropy loss and the mean square error loss as loss functions of classification and positioning, calculating errors of a predicted result and a real result, and reducing the difference between the two by using a gradient descent method. Continuously repeating the step 7 and the step 8 until the error converges to the minimum value or reaches the preset training times, and storing the obtained optimal weight and bias;

and 9, reloading the optimal parameters into a network structure, inputting the test samples of the visible light and near infrared images into the network at the same time based on the constructed test set, and outputting a prediction result of the target category and the position in the test samples.

Further, the specific process of the step 1 is that the input visible light image and the near infrared image are reset to the super-parameter set value, the color space is changed from RGB to HSV, and mosaic data enhancement, mixed data enhancement and random order disruption are carried out on the input image;

further, the specific process of the step 2 is to set the image data set according to 8:2, randomly extracting data set samples according to the proportion, and respectively constructing a training set and a testing set.

Further, the structure of the dual-branch iterative fusion feature extraction network in the step 3 is as follows:

The feature extraction sub-network comprises a visible light feature extraction sub-network and a near infrared feature extraction sub-network. The two feature extraction sub-networks have the same construction form, and the input visible light image and the input near infrared image are processed in parallel. The subnetwork contains a 3-layer convolution-batch normalization-activation function structure. Inputting visible light and near-infrared image { x ^co,x^ir }, respectively passing through a visible light characteristic extraction sub-network and a near-infrared characteristic extraction sub-network to obtain primary extracted characteristics I.e.Wherein/>And/>Respectively representing the extraction processes of the visible light characteristic extraction sub-network and the near infrared characteristic extraction sub-network.

The first-level feature fusion network comprises a convolution-batch normalization-activation function structure with a splicing layer and 5 layers stacked in parallel. For input, light and near infrared preliminary featuresAnd splicing the two channels on the channel dimension to obtain a spliced characteristic diagram C ₁. And then obtaining a fused characteristic output through a convolution-batch normalization-activation function structure of 5 layers which are stacked in parallel, wherein an operation process can be expressed as F₁＝Conv₃(Conv₂(Conv₁(C₁)))+Conv₂(Conv₁(C₁))+Conv₁(C₁)., conv ₁ represents a convolution-batch normalization-activation function structure of 1 layer, conv ₂ represents a convolution-batch normalization-activation function structure of 2 layers, conv ₃ represents a convolution-batch normalization-activation function structure of 2 layers, a total of 5 layers of convolution-batch normalization-activation function structures are used, "+" represents that channels are spliced, and F ₁ represents the fused characteristic output of the primary characteristic fusion network.

And the secondary characteristics are fused with the network. A convolution-batch normalization-activation function structure comprising a splice layer and a 5-layer parallel stack. For the output F ₁ of the primary feature fusion network, respectivelyThe channel dimension is spliced and convolved, and the operation process can be expressed as/> I.e. the result of the splicing and convolution process. And splicing the processing results on the channel dimension again to obtain a spliced characteristic diagram C ₂. And then obtaining a fused characteristic output through a convolution-batch normalization-activation function structure of 5 layers which are stacked in parallel, wherein an operation process can be expressed as F₂＝Conv₃(Conv₂(Conv₁(C₂)))+Conv₂(Conv₁(C₂))+Conv₁(C₂)., conv ₁ represents a convolution-batch normalization-activation function structure of 1 layer, conv ₂ represents a convolution-batch normalization-activation function structure of 2 layers, conv ₃ represents a convolution-batch normalization-activation function structure of 2 layers, a total of 5 layers of convolution-batch normalization-activation function structures are used, "+" represents that channels are spliced, and F ₂ represents the fused characteristic output of the primary characteristic fusion network.

And 3, merging three-level characteristics into a network. Comprises a splicing layer, a convolution-batch normalization-activation function structure with 5 layers stacked in parallel and a space pyramid pooling layer. For the output F ₂ of the secondary feature fusion network, respectivelyThe channel dimension is spliced and convolved, and the operation process can be expressed as/> I.e. the result of the splicing and convolution process. And splicing the processing results on the channel dimension again to obtain a spliced characteristic diagram C ₃. And then obtaining the fused characteristic output through a convolution-batch normalization-activation function structure of 5 layers of parallel stacking. And finally, carrying out space pyramid pooling operation to obtain the output F ₃ of the three-level feature fusion network. The operational procedure can be represented as F₃＝F_spp(Conv₃(Conv₂(Conv₁(C₃)))+Conv₂(Conv₁(C₃))+Conv₁(C₃)). where Conv ₁ represents a 1-layer convolution-batch normalization-activation function structure, conv ₂ represents a 2-layer convolution-batch normalization-activation function structure, conv ₃ represents a 2-layer convolution-batch normalization-activation function structure, a total of 5-layer convolution-batch normalization-activation function structure is used, "+" represents that the channel is a splice, and F _spp represents the spatial pyramid pooling procedure.

Further, the feature enhancement network structure based on the scale transformation in the step 4 is as follows:

The detail keeps branching. Containing a 1-layer convolution-batch normalization-activation function structure. Wherein the convolution kernel size set by the convolution is 3×3, the step size is set to 1, and the activation function adopts SiLU. The process may be expressed as f _d =conv (x). Where x is the input of the branch and f _d is the output of the detail hold branch.

And extracting branches by semantics. Comprising a sub-pixel upsampling structure and a 1-layer convolution-batch normalization-activation function structure. For the input feature x of the branch, the feature space size is expanded to 2 times of the input feature space size by taking the pixels of the adjacent 4 channels as the complementary form through sub-pixel up-sampling. Then, the up-sampling result is subjected to a convolution-batch normalization-activation function structure with a convolution kernel size of 3×3 and a step size of 2, and the feature space size is readjusted to be the same as the input feature space size. The process of operation may be denoted as f _s＝Conv_s＝2(F_subpix (x)). Conv _s＝2 is convolution-batch normalization-activation function processing with step length of 2, F _subpix is sub-pixel up-sampling process, and F _s is output of semantic extraction branch.

And fusing the attention enhancement module. Including a spliced convolution fusion layer and an attention enhancement layer. For the output f _d of the detail-preserving branch and the output f _s of the semantic extraction branch, the two are first fused through the concatenation and convolution processes of the channel dimension. The channel attention mechanism is then introduced to adjust the number of channels and the channel weight for both. And finally, connecting the input characteristics with the output of the attention mechanism in a residual connection mode to obtain the output of the fused attention enhancement module. The course of the operation may be denoted as F _out＝F_attention(Conv(f_d+f_s)) +x. Wherein F _out is the output of the fused attention enhancement module, and F _attention is the channel attention processing procedure.

Further, the structure of the coupled adaptive feature sampling network in the step 5 is as follows:

And an adaptive feature downsampling module. Including max-pooling, stride convolution, and Focus downsampling structures. For the feature needing downsampling, the processing process is divided into 3 branches, and maximum value pooling, stride convolution and Focus downsampling are used for processing, so that the feature size obtained in each branch is reduced to 1/2 of the input feature. Subsequently, the features of each branch are subjected to convolution-batch normalization-activation function processing to enhance feature extraction and complete channel adjustment. And finally, carrying out channel dimension splicing and convolution processing on the characteristics of each branch to obtain the output of the self-adaptive characteristic downsampling module. The process of operation may be denoted as F _down＝Conv(F_maxpool(x)+Conv_s＝2(x)+F_Focus (x)). Wherein F _maxpool represents a maximum value pooling process, conv _s＝2 is a stride convolution, F _Focus is a Focus downsampling process, and F _down is the output of the adaptive feature downsampling module.

And an adaptive feature upsampling module. Including bicubic interpolation, deconvolution, and sub-pixel upsampling structures. For the features needing up-sampling, the processing process is divided into 3 branches, and bicubic interpolation, deconvolution and sub-pixel up-sampling are respectively used for processing, so that the feature size obtained in each branch is expanded to be 2 times of the input features. Subsequently, the features of each branch are subjected to convolution-batch normalization-activation function processing to enhance feature extraction and complete channel adjustment. And finally, carrying out channel dimension splicing and convolution processing on the characteristics of each branch to obtain the output of the self-adaptive characteristic up-sampling module. The process of operation may be denoted as F _up＝Conv(F_Bicubic(x)+DeConv(x)+F_subpix (x)). Where F _Bicubic represents the maximum pooling process, deConv is deconvolution, F _subpix is the sub-pixel upsampling process, and F _up is the output of the adaptive feature upsampling module.

Specifically, the present embodiment uses a visible-near infrared image target dataset VEDAI for vehicle detection made by satellite photography in utah AGRC for experimental verification. The data set comprises 1246 images of scenes, 9 types of targets are adopted, the resolution of each image is 512 multiplied by 512, each pixel point represents the area size of the area of 25cm multiplied by 25cm in the real scene, the contents of the visible light images and the near infrared images are the same, and the total number of images is 2492.

Fig. 6 (a) is a target detection result using the fast R-CNN method, fig. 6 (b) is a target detection result using the SSD method, fig. 6 (c) is a target detection result using the CENTERNET method, fig. 6 (d) is a target detection result using the YOLOv L method, fig. 6 (e) is a target detection result using the method described in the present embodiment, and fig. 6 (f) is a true result corresponding to a picture target. From the figure, whether the classical two-stage detection method Faster R-CNN and the single-stage detection method SSD or the YOLOv L method with better detection effect is accepted at present, when small target detection under a complex scene is implemented, false alarm, omission detection and false detection occur uniformly and to a certain degree; the method of the embodiment obtains the best detection effect in all methods, can accurately detect, identify and locate the target in the real result, and remarkably improves the detection effect.

The embodiments described above are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the embodiments described above, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principles of the present invention should be made in the equivalent manner, and are included in the scope of the present invention.

Claims

1. The heterogeneous remote sensing image target detection method based on iterative fusion is characterized by comprising the following steps of:

2. The method for detecting a target of a heterologous remote sensing image according to claim 1, wherein the method further comprises a preprocessing step after reading the input heterologous remote sensing image, specifically:

3. The heterogeneous remote sensing image target detection method according to claim 1, wherein the dual-branch iterative fusion feature extraction network specifically comprises:

4. The method for detecting a target in a heterologous remote sensing image according to claim 3, wherein the feature extraction sub-network comprises a three-layer convolution-batch normalization-activation function structure;

5. The method for detecting a target in a heterologous remote sensing image according to claim 1, wherein the feature enhancement network based on scale transformation comprises:

6. The method for detecting a target of a heterologous remote sensing image according to claim 1, wherein the coupled adaptive feature sampling network comprises an adaptive feature downsampling module and an adaptive feature upsampling module.

7. The method for detecting a target in a remote sensing image according to claim 6,

8. The method according to any one of claims 1-7, wherein calculating the error of the predicted and real results selects cross entropy loss and mean square error loss as the loss functions of classification and localization, respectively.

9. The method of claim 1, wherein the optimization is performed using a gradient descent method.

10. The method of claim 1, wherein the image dataset is generated according to 8:2, randomly extracting data set samples according to the proportion, and respectively constructing a training set and a testing set.