CN116994164A

CN116994164A - Multi-mode aerial image fusion and target detection combined learning method

Info

Publication number: CN116994164A
Application number: CN202311058440.8A
Authority: CN
Inventors: 于银辉; 孙旭; 余雨萍; 方兆帆; 刘雨晗
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-03

Abstract

The invention discloses a multi-mode aerial image fusion and target detection combined learning method, which belongs to the technical field of remote sensing image processing and comprises the following steps: a first fusion image is generated by adopting paired visible light images and infrared images to initially train image fusion branches; designing a target detection branch with expert guiding information based on the first fused image; performing preliminary training on the target detection branch by adopting a first fusion image; outputting an expert feature map through the target detection branch after preliminary training; performing feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature map after feature alignment processing to generate a second fusion image; fine tuning the target detection branch by adopting a second fusion image to realize the target detection task optimization; the method improves the performance of two tasks of visible light and infrared aerial image fusion and target detection at the same time, and provides more accurate and efficient data analysis and decision support for unmanned aerial vehicle application.

Description

Multi-mode aerial image fusion and target detection combined learning method

Technical Field

The invention belongs to the technical field of remote sensing image processing, and particularly relates to a multi-mode aerial image fusion and target detection combined learning method.

Background

In recent years, multi-mode aerial images captured by unmanned aerial vehicles (Unmanned Aerial Vehicle, UAV) have received increasing attention, and can be applied to many fields such as environmental investigation, urban planning, disaster relief and the like. The fusion of visible light and infrared aerial images and target detection are two important tasks in unmanned aerial vehicle application.

Visible light images exhibit rich texture details by capturing the reflected light of the scene, but are often susceptible to light conditions. In contrast, infrared images have a strong anti-interference capability, can capture heat radiation information, and are suitable for various complex environments, but lack detailed texture information. The visible light and infrared image fusion task fully utilizes the complementary information between the two modes through fusion, and generates a fusion image containing more information, so that the performance of other high-level tasks is improved. The target detection task can quickly and accurately detect the target object by utilizing the generated high-quality fusion image and combining an image processing algorithm based on deep learning.

The existing multi-mode image fusion and target detection methods mostly improve the fusion effect by designing a network and introducing constraint conditions, and neglect the potential benefits of the target detection network. Zhao Wenda et al designed a meta-feature embedding model in 2023 that enabled the features of the target detection network to be used to guide the visible and infrared image fusion network. However, the method adopts an internal and external two-stage updating method, the training process is complex, only simple natural images are focused, and complex aerial images are not considered. Different from natural images, the unmanned aerial vehicle has wide field of view, small and dense targets in the shot images, complicated background noise and increased difficulty in target detection.

The progress of the large language basic model GPT-4 excites great attention to basic model development in the field of computer vision. The segmentation basic models such as SAM (SegmentAnything Model) and MobileSAM are used as novel interactive models and are specially designed for image segmentation tasks and subsequent downstream applications, so that a novel solution idea is provided for solving the problems of complex background, small target size and poor fusion and detection effects of the multi-mode aerial images.

Therefore, how to use the segmentation basic model to enhance the detection capability of the target detection branch to small targets in the aerial image, and improve the performance of two tasks of visible light and infrared aerial image fusion and target detection, so as to provide more accurate and efficient data analysis and decision support for unmanned aerial vehicle application, which is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a multi-mode aerial image fusion and target detection combined learning method, so as to at least solve some of the technical problems mentioned in the background art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a multi-mode aerial image fusion and target detection combined learning method comprises the following steps of

S1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;

s2, designing a target detection branch with expert guiding information based on the first fusion image;

s3, performing preliminary training on the target detection branch by adopting the first fusion image; outputting an expert feature map through the target detection branch after preliminary training;

s4, carrying out feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature graph after feature alignment processing to generate a second fusion image;

s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that an optimized target detection task is realized.

Further, the step S2 specifically includes:

segmenting the first fused image into a plurality of image blocks using a segmentation base model;

and classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information.

Further, the setting process of the preset area interval includes:

respectively carrying out normalization processing on the paired visible light images and the target real frames in the infrared images, calculating the areas of the target real frames after the normalization processing, and selecting one area as a first clustering center;

calculating the shortest Euclidean distance between the area of the real frames of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of the other target real frames are the areas of the target real frames except the first clustering center;

calculating the probability that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered, and obtaining K sections of area intervals; and taking the K section area interval as a preset area interval.

Further, the partition coding module specifically includes:

respectively calculating the minimum circumscribed rectangular areas of a plurality of image blocks;

dividing the minimum circumscribed rectangular areas into K classes according to the preset area interval, setting the target area of each class to 1, setting other areas to 0, and obtaining a Mask matrix of the K channels;

flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-attention operation by using a transducer encoder to obtain a feature map;

and carrying out downsampling treatment on the feature map by adopting a convolution module to obtain the feature map after downsampling treatment.

Further, the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.

Further, the adaptive learning coding feature adopting the hybrid specialized gating mechanism specifically comprises:

firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights from the spliced feature images through a gating network; and finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image.

Further, the step S4 specifically includes:

performing feature alignment processing on the expert feature map; constructing a first loss function based on the expert feature map after feature alignment processing;

linearly combining the first loss function and the second loss function of the image fusion branch to obtain an image fusion loss function;

optimizing the image fusion loss function through back propagation until an optimal image fusion branch is obtained; and generating a second fusion image based on the optimal image fusion branch.

Compared with the prior art, the invention discloses a multi-mode aerial image fusion and target detection combined learning method, which has the following beneficial effects:

(1) The joint learning method adopted by the invention can not only improve the fusion quality of the visible light and infrared images and the fusion effect, but also improve the target detection performance of the aerial image, thereby realizing the collaborative optimization of the two tasks.

(2) The invention focuses on the target detection task of the multi-mode aerial image, fully plays the advantage that the segmentation basic model can segment everything, builds the Mask matrix through the clustering algorithm to provide expert guidance information for the target detection branch, and enhances the capability of the detector for detecting the small target in the complex working environment of the unmanned aerial vehicle.

(3) According to the invention, knowledge modeling is carried out on the guide information provided by the segmentation basic model by adopting a mixed special gating mechanism, the dependence degree of each expert information can be dynamically selected, the complicated background noise in the aerial image is effectively restrained, and the detector can be more focused on a reliable target object.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a multi-mode aerial image fusion and target detection combined learning method provided by an embodiment of the invention.

Fig. 2 is a schematic diagram of a multi-mode aerial image fusion and target detection combined learning method framework provided by the embodiment of the invention.

Fig. 3 is a schematic diagram of SAM guiding target detection branches according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a SAM segmented image block and a corresponding 3-channel visual Mask matrix according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of an expert feature map of an object detection branch according to an embodiment of the present invention for fine-tuning an image fusion branch through feature alignment.

Fig. 6 is a schematic diagram of a fused image generated by combining optimized image fusion branches according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a detection effect of a target detection branch after joint optimization according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the embodiment of the invention discloses a multi-mode aerial image fusion and target detection combined learning method, which is implemented based on a segmentation basic model SAM, and the implementation flow comprises the following steps:

s3, performing preliminary training on the target detection branch by adopting a first fusion image; outputting an expert feature map through the target detection branch after preliminary training;

s4, performing feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature map after feature alignment processing to generate a second fusion image;

s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that the target detection task is optimized.

The respective steps described above are explained next.

In the above step S1, referring to fig. 2, the image fusion branch includes a plurality of feature fusion modules and an image reconstruction module; in the training process, firstly, paired visible light images and infrared images are input into a plurality of feature fusion modules, the feature fusion modules respectively and sequentially extract the image features of the visible light images and the infrared images, and image feature fusion processing is carried out to obtain image fusion features; then an image reconstruction module is adopted to construct a fusion image based on the obtained image fusion characteristics, namely a first fusion image;

in the embodiment of the invention, the image fusion branch comprises 3 feature fusion modules and 1 image reconstruction module; each feature fusion module consists of 23×3 convolutions and a Relu activation function, and features are fused among the modules in a splicing manner; the image reconstruction module consists of 6 3 x 3 convolutions and a Relu activation function.

In the above step S2, referring specifically to fig. 3, the first fused image generated in step S1 is segmented into a plurality of image blocks using a segmentation base model; classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information;

the setting process of the preset area interval comprises the following steps:

(1) Respectively carrying out normalization processing on target real frames in the paired visible light images and infrared images; expressed by the formula:

wherein w represents the width of the target real frame; h represents the height of the target real frame; w represents the width of a visible light image or an infrared image; h represents the high of the visible light image or the infrared image;

then calculating the area of the target real frame after normalization processing, and selecting one area as a first clustering center;

(2) Calculating the shortest Euclidean distance d (s, c) between the area of the real frame of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of other target real frames are the areas of the target real frames except the first clustering center; the shortest euclidean distance d (s, c) is expressed as:

wherein s is _n And c _n An area vector representing the target real frame of the nth dimension; n represents the dimension of the vector;

(3) Calculating the probability P(s) that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered out, and obtaining K sections of area intervals; taking the K section area interval as a preset area interval;

the probability P(s) of being selected as the next cluster center is expressed as:

where c' represents other cluster centers.

The above-mentioned segmentation encoding module specifically includes:

(1) Respectively calculating the minimum circumscribed rectangular areas of the image blocks; dividing a plurality of minimum circumscribed rectangular areas into K classes according to a preset area interval, setting a target area of each class to be 1, setting other areas to be 0, and obtaining a Mask matrix of a K channel; FIG. 4 is a SAM segmented image block and corresponding 3-channel visualization Mask matrix;

(2) Flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-Attention operation Attention by using a transducer encoder to obtain a feature map; expressed as:

K＝LN(Reshape(k,R ₁ )·W) (5)

V＝LN(Reshape(v,R ₂ )·W) (6)

wherein Q represents a query; k represents keying; v represents a key value; LN representation layer normalization; r is R ₁ And R is ₂ All represent a reduction ratio; w represents a linear mapping; d represents a scale factor.

(3) Performing downsampling processing on the feature map by adopting a convolution module to obtain a feature map after downsampling processing; the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.

The method adopts the self-adaptive learning coding characteristic of the mixed special control mechanism to provide expert guiding information for the target detection branch, and specifically comprises the following steps:

firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights by the gating unit from the spliced feature images; finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image; the whole process is expressed as:

f _i ＝concatenate(F _di ,F _si ) (8)

ω _i ＝g(flatten(f _i )) (9)

F _mi ＝ω _i ·F _di +(1-ω _i )F _si (10)

wherein i=1, 2,3,4, f _di A feature map representing the output of the target detection branch; f (F) _si Representing a feature map after the downsampling process; g represents a gating unit function; f (f) _i Representing the spliced characteristic diagram; w (w) _i Representing the weight; f (F) _mi Representing expert feature graphs.

In the above step S3, referring to fig. 2, the target detection branch includes a backbone network (using res net 50), a feature pyramid network, a hybrid private gate, and a detection head; in the training process, inputting a first fusion image into a backbone network; extracting image features of the first fused image on different scales by a backbone network, and fusing the image features of different scales by a feature pyramid network to obtain a fused feature map with high resolution and strong semantics; inputting the fusion feature map into a mixed special gating network, and realizing target classification and regression through a detection head by combining expert guidance information;

the target detection branch also comprises a SAM (Segment Anything Model, dividing all models) and a dividing encoding module; inputting the first fusion image into a SAM, and dividing the first fusion image into a plurality of image blocks; the segmentation coding module carries out coding processing on the segmented image blocks, so that the semantic effect of the feature map is further enhanced; and then, the feature images output by the segmentation coding module are input into a mixed special gating network through downsampling, and expert guiding information is provided for the target detection branch.

In the step S4, the features of the image fusion task and the target detection task are incompatible due to the difference between the two tasks, and the features output by the target detection branch cannot be directly used for assisting the image fusion branch, so that the features of the two tasks can be matched by adopting the feature alignment module; in the embodiment of the invention, the expert feature map is subjected to feature alignment processing by a feature alignment module; and constructing a first loss function L based on the expert feature graph after feature alignment processing _g To fine-tune the image fusion branch to generate more target semantic information, see in particular fig. 5;

the step S4 specifically comprises the following steps:

constructing a first loss function L based on an expert feature map of target detection branch output after preliminary training _g The method comprises the steps of carrying out a first treatment on the surface of the First loss function L _g Is a SmoothL1 loss function; expressed as:

wherein F is _ui Representing the characteristics output by the characteristic alignment module; f (F) _vi Features representing image fusion branches;

acquiring a second loss function L of an image fusion branch _f ；

Linearly combining the first loss function and the second loss function to obtain an image fusion loss function L;

continuously optimizing the image fusion loss function L through back propagation until an optimal image fusion branch is obtained; expressed as:

L＝L _f +λL _g (12)

L _f ＝θ ₁ ·(1-SSIM(I _f ,I ₁ ))+θ ₂ ·(1-SSIM(I _f ,I ₂ )) (13)

wherein lambda, theta ₁ And theta ₂ All represent weight ratios; SSIM represents a structural similarity penalty; i ₁ Representing a visible light image; i ₂ Representing an infrared image; i _f Representing the first fused image or the second fused image; mu (mu) ₁ An average pixel representing a visible light image; mu (mu) ₂ Average pixels representing the infrared image; mu (mu) _f Average pixels representing the first fused image or the second fused image; sigma (sigma) ₁ Representing the standard deviation of pixels of the visible light image; sigma (sigma) ₂ Representing the pixel standard deviation of the infrared image; sigma (sigma) _f Representing a standard deviation of pixels of the first fused image or the second fused image; sigma (sigma) _f1 Representing pixel covariance of the fused image and the visible image; sigma (sigma) _f2 Is the pixel covariance of the fused image and the infrared image; the fusion image is a first fusion image or a second fusion image; k (k) ₁ And k ₂ All represent constants.

After fine tuning of the image fusion branch, generating a second fusion image; since the image fusion branch has completed the optimization, the quality of the generated second fusion image is higher, see in particular fig. 6.

In the step S5, the second fused image is adopted to further fine tune the target detection branch, so as to realize the target detection task optimization, and the specific target detection effect is shown in fig. 7; it can be seen from fig. 7 that the trimmed target detection branches can accurately distinguish the easily confused target categories, and can still identify small targets even in the case of insufficient light.

In summary, the embodiment of the invention provides a multi-mode aerial image fusion and target detection joint learning method, which utilizes expert guidance information provided by a segmentation basic model to enhance the detection capability of target detection branches on small targets in an aerial image, and uses the guided target detection branches for assisting an image fusion branch to generate more target semantic information. The method can simultaneously improve the performance of two tasks of visible light and infrared aerial image fusion and target detection, and provides more accurate and efficient data analysis and decision support for unmanned aerial vehicle application.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-mode aerial image fusion and target detection combined learning method is characterized by comprising the following steps of

2. The multi-mode aerial image fusion and target detection joint learning method according to claim 1, wherein the step S2 specifically includes:

3. The multi-mode aerial image fusion and target detection joint learning method according to claim 2, wherein the setting process of the preset area interval comprises the following steps:

4. The multi-modal aerial image fusion and target detection joint learning method according to claim 2, wherein the segmentation encoding module specifically comprises:

5. The multi-modal aerial image fusion and target detection joint learning method of claim 4, wherein the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.

6. The multi-modal aerial image fusion and target detection joint learning method according to claim 4, wherein the adaptive learning coding feature by adopting a hybrid specialized gating mechanism specifically comprises:

7. The multi-mode aerial image fusion and target detection joint learning method according to claim 1, wherein the step S4 specifically includes: