CN116994164A - Multi-mode aerial image fusion and target detection combined learning method - Google Patents
Multi-mode aerial image fusion and target detection combined learning method Download PDFInfo
- Publication number
- CN116994164A CN116994164A CN202311058440.8A CN202311058440A CN116994164A CN 116994164 A CN116994164 A CN 116994164A CN 202311058440 A CN202311058440 A CN 202311058440A CN 116994164 A CN116994164 A CN 116994164A
- Authority
- CN
- China
- Prior art keywords
- image
- fusion
- target detection
- feature
- branch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 104
- 238000001514 detection method Methods 0.000 title claims abstract description 83
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims description 22
- 230000011218 segmentation Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 238000005457 optimization Methods 0.000 abstract description 5
- 238000007405 data analysis Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000004913 activation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 108010038083 amyloid fibril protein AS-SAM Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/17—Terrestrial scenes taken from planes or by drones
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Remote Sensing (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a multi-mode aerial image fusion and target detection combined learning method, which belongs to the technical field of remote sensing image processing and comprises the following steps: a first fusion image is generated by adopting paired visible light images and infrared images to initially train image fusion branches; designing a target detection branch with expert guiding information based on the first fused image; performing preliminary training on the target detection branch by adopting a first fusion image; outputting an expert feature map through the target detection branch after preliminary training; performing feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature map after feature alignment processing to generate a second fusion image; fine tuning the target detection branch by adopting a second fusion image to realize the target detection task optimization; the method improves the performance of two tasks of visible light and infrared aerial image fusion and target detection at the same time, and provides more accurate and efficient data analysis and decision support for unmanned aerial vehicle application.
Description
Technical Field
The invention belongs to the technical field of remote sensing image processing, and particularly relates to a multi-mode aerial image fusion and target detection combined learning method.
Background
In recent years, multi-mode aerial images captured by unmanned aerial vehicles (Unmanned Aerial Vehicle, UAV) have received increasing attention, and can be applied to many fields such as environmental investigation, urban planning, disaster relief and the like. The fusion of visible light and infrared aerial images and target detection are two important tasks in unmanned aerial vehicle application.
Visible light images exhibit rich texture details by capturing the reflected light of the scene, but are often susceptible to light conditions. In contrast, infrared images have a strong anti-interference capability, can capture heat radiation information, and are suitable for various complex environments, but lack detailed texture information. The visible light and infrared image fusion task fully utilizes the complementary information between the two modes through fusion, and generates a fusion image containing more information, so that the performance of other high-level tasks is improved. The target detection task can quickly and accurately detect the target object by utilizing the generated high-quality fusion image and combining an image processing algorithm based on deep learning.
The existing multi-mode image fusion and target detection methods mostly improve the fusion effect by designing a network and introducing constraint conditions, and neglect the potential benefits of the target detection network. Zhao Wenda et al designed a meta-feature embedding model in 2023 that enabled the features of the target detection network to be used to guide the visible and infrared image fusion network. However, the method adopts an internal and external two-stage updating method, the training process is complex, only simple natural images are focused, and complex aerial images are not considered. Different from natural images, the unmanned aerial vehicle has wide field of view, small and dense targets in the shot images, complicated background noise and increased difficulty in target detection.
The progress of the large language basic model GPT-4 excites great attention to basic model development in the field of computer vision. The segmentation basic models such as SAM (SegmentAnything Model) and MobileSAM are used as novel interactive models and are specially designed for image segmentation tasks and subsequent downstream applications, so that a novel solution idea is provided for solving the problems of complex background, small target size and poor fusion and detection effects of the multi-mode aerial images.
Therefore, how to use the segmentation basic model to enhance the detection capability of the target detection branch to small targets in the aerial image, and improve the performance of two tasks of visible light and infrared aerial image fusion and target detection, so as to provide more accurate and efficient data analysis and decision support for unmanned aerial vehicle application, which is a problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a multi-mode aerial image fusion and target detection combined learning method, so as to at least solve some of the technical problems mentioned in the background art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a multi-mode aerial image fusion and target detection combined learning method comprises the following steps of
S1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;
s2, designing a target detection branch with expert guiding information based on the first fusion image;
s3, performing preliminary training on the target detection branch by adopting the first fusion image; outputting an expert feature map through the target detection branch after preliminary training;
s4, carrying out feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature graph after feature alignment processing to generate a second fusion image;
s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that an optimized target detection task is realized.
Further, the step S2 specifically includes:
segmenting the first fused image into a plurality of image blocks using a segmentation base model;
and classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information.
Further, the setting process of the preset area interval includes:
respectively carrying out normalization processing on the paired visible light images and the target real frames in the infrared images, calculating the areas of the target real frames after the normalization processing, and selecting one area as a first clustering center;
calculating the shortest Euclidean distance between the area of the real frames of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of the other target real frames are the areas of the target real frames except the first clustering center;
calculating the probability that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered, and obtaining K sections of area intervals; and taking the K section area interval as a preset area interval.
Further, the partition coding module specifically includes:
respectively calculating the minimum circumscribed rectangular areas of a plurality of image blocks;
dividing the minimum circumscribed rectangular areas into K classes according to the preset area interval, setting the target area of each class to 1, setting other areas to 0, and obtaining a Mask matrix of the K channels;
flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-attention operation by using a transducer encoder to obtain a feature map;
and carrying out downsampling treatment on the feature map by adopting a convolution module to obtain the feature map after downsampling treatment.
Further, the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.
Further, the adaptive learning coding feature adopting the hybrid specialized gating mechanism specifically comprises:
firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights from the spliced feature images through a gating network; and finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image.
Further, the step S4 specifically includes:
performing feature alignment processing on the expert feature map; constructing a first loss function based on the expert feature map after feature alignment processing;
linearly combining the first loss function and the second loss function of the image fusion branch to obtain an image fusion loss function;
optimizing the image fusion loss function through back propagation until an optimal image fusion branch is obtained; and generating a second fusion image based on the optimal image fusion branch.
Compared with the prior art, the invention discloses a multi-mode aerial image fusion and target detection combined learning method, which has the following beneficial effects:
(1) The joint learning method adopted by the invention can not only improve the fusion quality of the visible light and infrared images and the fusion effect, but also improve the target detection performance of the aerial image, thereby realizing the collaborative optimization of the two tasks.
(2) The invention focuses on the target detection task of the multi-mode aerial image, fully plays the advantage that the segmentation basic model can segment everything, builds the Mask matrix through the clustering algorithm to provide expert guidance information for the target detection branch, and enhances the capability of the detector for detecting the small target in the complex working environment of the unmanned aerial vehicle.
(3) According to the invention, knowledge modeling is carried out on the guide information provided by the segmentation basic model by adopting a mixed special gating mechanism, the dependence degree of each expert information can be dynamically selected, the complicated background noise in the aerial image is effectively restrained, and the detector can be more focused on a reliable target object.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a multi-mode aerial image fusion and target detection combined learning method provided by an embodiment of the invention.
Fig. 2 is a schematic diagram of a multi-mode aerial image fusion and target detection combined learning method framework provided by the embodiment of the invention.
Fig. 3 is a schematic diagram of SAM guiding target detection branches according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a SAM segmented image block and a corresponding 3-channel visual Mask matrix according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of an expert feature map of an object detection branch according to an embodiment of the present invention for fine-tuning an image fusion branch through feature alignment.
Fig. 6 is a schematic diagram of a fused image generated by combining optimized image fusion branches according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a detection effect of a target detection branch after joint optimization according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the embodiment of the invention discloses a multi-mode aerial image fusion and target detection combined learning method, which is implemented based on a segmentation basic model SAM, and the implementation flow comprises the following steps:
s1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;
s2, designing a target detection branch with expert guiding information based on the first fusion image;
s3, performing preliminary training on the target detection branch by adopting a first fusion image; outputting an expert feature map through the target detection branch after preliminary training;
s4, performing feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature map after feature alignment processing to generate a second fusion image;
s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that the target detection task is optimized.
The respective steps described above are explained next.
In the above step S1, referring to fig. 2, the image fusion branch includes a plurality of feature fusion modules and an image reconstruction module; in the training process, firstly, paired visible light images and infrared images are input into a plurality of feature fusion modules, the feature fusion modules respectively and sequentially extract the image features of the visible light images and the infrared images, and image feature fusion processing is carried out to obtain image fusion features; then an image reconstruction module is adopted to construct a fusion image based on the obtained image fusion characteristics, namely a first fusion image;
in the embodiment of the invention, the image fusion branch comprises 3 feature fusion modules and 1 image reconstruction module; each feature fusion module consists of 23×3 convolutions and a Relu activation function, and features are fused among the modules in a splicing manner; the image reconstruction module consists of 6 3 x 3 convolutions and a Relu activation function.
In the above step S2, referring specifically to fig. 3, the first fused image generated in step S1 is segmented into a plurality of image blocks using a segmentation base model; classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information;
the setting process of the preset area interval comprises the following steps:
(1) Respectively carrying out normalization processing on target real frames in the paired visible light images and infrared images; expressed by the formula:
wherein w represents the width of the target real frame; h represents the height of the target real frame; w represents the width of a visible light image or an infrared image; h represents the high of the visible light image or the infrared image;
then calculating the area of the target real frame after normalization processing, and selecting one area as a first clustering center;
(2) Calculating the shortest Euclidean distance d (s, c) between the area of the real frame of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of other target real frames are the areas of the target real frames except the first clustering center; the shortest euclidean distance d (s, c) is expressed as:
wherein s is n And c n An area vector representing the target real frame of the nth dimension; n represents the dimension of the vector;
(3) Calculating the probability P(s) that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered out, and obtaining K sections of area intervals; taking the K section area interval as a preset area interval;
the probability P(s) of being selected as the next cluster center is expressed as:
where c' represents other cluster centers.
The above-mentioned segmentation encoding module specifically includes:
(1) Respectively calculating the minimum circumscribed rectangular areas of the image blocks; dividing a plurality of minimum circumscribed rectangular areas into K classes according to a preset area interval, setting a target area of each class to be 1, setting other areas to be 0, and obtaining a Mask matrix of a K channel; FIG. 4 is a SAM segmented image block and corresponding 3-channel visualization Mask matrix;
(2) Flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-Attention operation Attention by using a transducer encoder to obtain a feature map; expressed as:
K=LN(Reshape(k,R 1 )·W) (5)
V=LN(Reshape(v,R 2 )·W) (6)
wherein Q represents a query; k represents keying; v represents a key value; LN representation layer normalization; r is R 1 And R is 2 All represent a reduction ratio; w represents a linear mapping; d represents a scale factor.
(3) Performing downsampling processing on the feature map by adopting a convolution module to obtain a feature map after downsampling processing; the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.
The method adopts the self-adaptive learning coding characteristic of the mixed special control mechanism to provide expert guiding information for the target detection branch, and specifically comprises the following steps:
firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights by the gating unit from the spliced feature images; finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image; the whole process is expressed as:
f i =concatenate(F di ,F si ) (8)
ω i =g(flatten(f i )) (9)
F mi =ω i ·F di +(1-ω i )F si (10)
wherein i=1, 2,3,4, f di A feature map representing the output of the target detection branch; f (F) si Representing a feature map after the downsampling process; g represents a gating unit function; f (f) i Representing the spliced characteristic diagram; w (w) i Representing the weight; f (F) mi Representing expert feature graphs.
In the above step S3, referring to fig. 2, the target detection branch includes a backbone network (using res net 50), a feature pyramid network, a hybrid private gate, and a detection head; in the training process, inputting a first fusion image into a backbone network; extracting image features of the first fused image on different scales by a backbone network, and fusing the image features of different scales by a feature pyramid network to obtain a fused feature map with high resolution and strong semantics; inputting the fusion feature map into a mixed special gating network, and realizing target classification and regression through a detection head by combining expert guidance information;
the target detection branch also comprises a SAM (Segment Anything Model, dividing all models) and a dividing encoding module; inputting the first fusion image into a SAM, and dividing the first fusion image into a plurality of image blocks; the segmentation coding module carries out coding processing on the segmented image blocks, so that the semantic effect of the feature map is further enhanced; and then, the feature images output by the segmentation coding module are input into a mixed special gating network through downsampling, and expert guiding information is provided for the target detection branch.
In the step S4, the features of the image fusion task and the target detection task are incompatible due to the difference between the two tasks, and the features output by the target detection branch cannot be directly used for assisting the image fusion branch, so that the features of the two tasks can be matched by adopting the feature alignment module; in the embodiment of the invention, the expert feature map is subjected to feature alignment processing by a feature alignment module; and constructing a first loss function L based on the expert feature graph after feature alignment processing g To fine-tune the image fusion branch to generate more target semantic information, see in particular fig. 5;
the step S4 specifically comprises the following steps:
constructing a first loss function L based on an expert feature map of target detection branch output after preliminary training g The method comprises the steps of carrying out a first treatment on the surface of the First loss function L g Is a SmoothL1 loss function; expressed as:
wherein F is ui Representing the characteristics output by the characteristic alignment module; f (F) vi Features representing image fusion branches;
acquiring a second loss function L of an image fusion branch f ;
Linearly combining the first loss function and the second loss function to obtain an image fusion loss function L;
continuously optimizing the image fusion loss function L through back propagation until an optimal image fusion branch is obtained; expressed as:
L=L f +λL g (12)
L f =θ 1 ·(1-SSIM(I f ,I 1 ))+θ 2 ·(1-SSIM(I f ,I 2 )) (13)
wherein lambda, theta 1 And theta 2 All represent weight ratios; SSIM represents a structural similarity penalty; i 1 Representing a visible light image; i 2 Representing an infrared image; i f Representing the first fused image or the second fused image; mu (mu) 1 An average pixel representing a visible light image; mu (mu) 2 Average pixels representing the infrared image; mu (mu) f Average pixels representing the first fused image or the second fused image; sigma (sigma) 1 Representing the standard deviation of pixels of the visible light image; sigma (sigma) 2 Representing the pixel standard deviation of the infrared image; sigma (sigma) f Representing a standard deviation of pixels of the first fused image or the second fused image; sigma (sigma) f1 Representing pixel covariance of the fused image and the visible image; sigma (sigma) f2 Is the pixel covariance of the fused image and the infrared image; the fusion image is a first fusion image or a second fusion image; k (k) 1 And k 2 All represent constants.
After fine tuning of the image fusion branch, generating a second fusion image; since the image fusion branch has completed the optimization, the quality of the generated second fusion image is higher, see in particular fig. 6.
In the step S5, the second fused image is adopted to further fine tune the target detection branch, so as to realize the target detection task optimization, and the specific target detection effect is shown in fig. 7; it can be seen from fig. 7 that the trimmed target detection branches can accurately distinguish the easily confused target categories, and can still identify small targets even in the case of insufficient light.
In summary, the embodiment of the invention provides a multi-mode aerial image fusion and target detection joint learning method, which utilizes expert guidance information provided by a segmentation basic model to enhance the detection capability of target detection branches on small targets in an aerial image, and uses the guided target detection branches for assisting an image fusion branch to generate more target semantic information. The method can simultaneously improve the performance of two tasks of visible light and infrared aerial image fusion and target detection, and provides more accurate and efficient data analysis and decision support for unmanned aerial vehicle application.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (7)
1. A multi-mode aerial image fusion and target detection combined learning method is characterized by comprising the following steps of
S1, adopting paired visible light images and infrared images to preliminarily train image fusion branches to generate a first fusion image;
s2, designing a target detection branch with expert guiding information based on the first fusion image;
s3, performing preliminary training on the target detection branch by adopting the first fusion image; outputting an expert feature map through the target detection branch after preliminary training;
s4, carrying out feature alignment processing on the expert feature map; fine tuning the image fusion branch by adopting the expert feature graph after feature alignment processing to generate a second fusion image;
s5, fine tuning is carried out on the target detection branch by adopting the second fusion image, so that an optimized target detection task is realized.
2. The multi-mode aerial image fusion and target detection joint learning method according to claim 1, wherein the step S2 specifically includes:
segmenting the first fused image into a plurality of image blocks using a segmentation base model;
and classifying and coding a plurality of image blocks according to a preset area interval and a segmentation coding module, and adaptively learning coding features by adopting a mixed special gating mechanism to form a target detection branch with expert guiding information.
3. The multi-mode aerial image fusion and target detection joint learning method according to claim 2, wherein the setting process of the preset area interval comprises the following steps:
respectively carrying out normalization processing on the paired visible light images and the target real frames in the infrared images, calculating the areas of the target real frames after the normalization processing, and selecting one area as a first clustering center;
calculating the shortest Euclidean distance between the area of the real frames of other targets and the first clustering center by adopting a K-means++ clustering algorithm; the areas of the other target real frames are the areas of the target real frames except the first clustering center;
calculating the probability that the area of each target real frame is selected as the next clustering center according to the shortest Euclidean distance until K clusters are clustered, and obtaining K sections of area intervals; and taking the K section area interval as a preset area interval.
4. The multi-modal aerial image fusion and target detection joint learning method according to claim 2, wherein the segmentation encoding module specifically comprises:
respectively calculating the minimum circumscribed rectangular areas of a plurality of image blocks;
dividing the minimum circumscribed rectangular areas into K classes according to the preset area interval, setting the target area of each class to 1, setting other areas to 0, and obtaining a Mask matrix of the K channels;
flattening the Mask matrix, and mapping the flattened Mask matrix into a fixed dimension; based on the Patch embedding coding feature and the Position embedding coding feature, performing self-attention operation by using a transducer encoder to obtain a feature map;
and carrying out downsampling treatment on the feature map by adopting a convolution module to obtain the feature map after downsampling treatment.
5. The multi-modal aerial image fusion and target detection joint learning method of claim 4, wherein the feature map size after the downsampling process is the same as the feature map size of the target detection branch output.
6. The multi-modal aerial image fusion and target detection joint learning method according to claim 4, wherein the adaptive learning coding feature by adopting a hybrid specialized gating mechanism specifically comprises:
firstly, splicing a feature image output by a target detection branch and a feature image after downsampling along the channel dimension; generating weights from the spliced feature images through a gating network; and finally, carrying out weighted linear combination on the two feature images and the corresponding weights to generate an expert feature image.
7. The multi-mode aerial image fusion and target detection joint learning method according to claim 1, wherein the step S4 specifically includes:
performing feature alignment processing on the expert feature map; constructing a first loss function based on the expert feature map after feature alignment processing;
linearly combining the first loss function and the second loss function of the image fusion branch to obtain an image fusion loss function;
optimizing the image fusion loss function through back propagation until an optimal image fusion branch is obtained; and generating a second fusion image based on the optimal image fusion branch.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311058440.8A CN116994164A (en) | 2023-08-22 | 2023-08-22 | Multi-mode aerial image fusion and target detection combined learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311058440.8A CN116994164A (en) | 2023-08-22 | 2023-08-22 | Multi-mode aerial image fusion and target detection combined learning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116994164A true CN116994164A (en) | 2023-11-03 |
Family
ID=88531998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311058440.8A Pending CN116994164A (en) | 2023-08-22 | 2023-08-22 | Multi-mode aerial image fusion and target detection combined learning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116994164A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117274826A (en) * | 2023-11-23 | 2023-12-22 | 山东锋士信息技术有限公司 | River and lake management violation problem remote sensing monitoring method based on large model and prompt guidance |
-
2023
- 2023-08-22 CN CN202311058440.8A patent/CN116994164A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117274826A (en) * | 2023-11-23 | 2023-12-22 | 山东锋士信息技术有限公司 | River and lake management violation problem remote sensing monitoring method based on large model and prompt guidance |
CN117274826B (en) * | 2023-11-23 | 2024-03-08 | 山东锋士信息技术有限公司 | River and lake management violation problem remote sensing monitoring method based on large model and prompt guidance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zou et al. | Robust lane detection from continuous driving scenes using deep neural networks | |
CN110363122B (en) | Cross-domain target detection method based on multi-layer feature alignment | |
CN109902806B (en) | Method for determining target bounding box of noise image based on convolutional neural network | |
CN112529015B (en) | Three-dimensional point cloud processing method, device and equipment based on geometric unwrapping | |
CN114202672A (en) | Small target detection method based on attention mechanism | |
CN114187450B (en) | Remote sensing image semantic segmentation method based on deep learning | |
CN117079139B (en) | Remote sensing image target detection method and system based on multi-scale semantic features | |
CN111368671A (en) | SAR image ship target detection and identification integrated method based on deep learning | |
CN111539255A (en) | Cross-modal pedestrian re-identification method based on multi-modal image style conversion | |
CN114170526B (en) | Remote sensing image multi-scale target detection and identification method based on lightweight network | |
CN116385761A (en) | 3D target detection method integrating RGB and infrared information | |
CN117830788B (en) | Image target detection method for multi-source information fusion | |
CN113159043A (en) | Feature point matching method and system based on semantic information | |
CN116052095B (en) | Vehicle re-identification method for smart city panoramic video monitoring | |
CN116994164A (en) | Multi-mode aerial image fusion and target detection combined learning method | |
CN116524189A (en) | High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization | |
CN112861840A (en) | Complex scene character recognition method and system based on multi-feature fusion convolutional network | |
CN117456480B (en) | Light vehicle re-identification method based on multi-source information fusion | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN113160219B (en) | Real-time railway scene analysis method for unmanned aerial vehicle remote sensing image | |
CN118247489A (en) | Infrared small target detection method based on double-flow deep neural network architecture | |
CN118397465A (en) | Remote sensing small target detection method based on multidimensional feature aggregation enhancement and distribution mechanism | |
CN111160282B (en) | Traffic light detection method based on binary Yolov3 network | |
CN117058641A (en) | Panoramic driving perception method based on deep learning | |
CN117152632A (en) | Remote sensing image scene classification method based on mixed attention and position coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |