CN112418163A - Multispectral target detection blind guiding system - Google Patents
Multispectral target detection blind guiding system Download PDFInfo
- Publication number
- CN112418163A CN112418163A CN202011426982.2A CN202011426982A CN112418163A CN 112418163 A CN112418163 A CN 112418163A CN 202011426982 A CN202011426982 A CN 202011426982A CN 112418163 A CN112418163 A CN 112418163A
- Authority
- CN
- China
- Prior art keywords
- visible light
- infrared thermal
- thermal image
- candidate frame
- feature map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 48
- 230000004927 fusion Effects 0.000 claims abstract description 27
- 238000010586 diagram Methods 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 15
- 230000000295 complement effect Effects 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 12
- 230000002902 bimodal effect Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 238000000034 method Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 6
- 238000006073 displacement reaction Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000005286 illumination Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 241000283070 Equus zebra Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000010813 municipal solid waste Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multispectral target detection blind guiding system, which comprises: the data input module is used for acquiring a visible light image and an infrared thermal image; the deformable feature extractor module is used for respectively extracting a visible light image feature map and an infrared thermal image feature map; a candidate frame extraction network for extracting a visible light image candidate frame and an infrared thermal image candidate frame; the candidate frame complementation module is used for adding the part, which is not covered by the infrared thermal image candidate frame, in the visible light image candidate frame into the infrared thermal image candidate frame and adding the part, which is not covered by the visible light image candidate frame, in the infrared thermal image candidate frame into the visible light image candidate frame to obtain a visible light image area characteristic map and an infrared thermal image area characteristic map; the cross-modal attention fusion module is used for fusing the visible light image region characteristic diagram into the infrared thermal image region characteristic diagram according to the similarity relation among the region characteristics to obtain enhanced thermal image characteristics; and the classification and regression module is used for obtaining a target detection result.
Description
Technical Field
The invention relates to the field of computers, in particular to a multispectral target detection blind guiding system.
Background
The tremendous development of computer vision in recent years has brought new opportunities and possibilities for blind guiding systems. Deep learning models based on Convolutional Neural Networks (CNN) have reached and even surpassed the human level in the tasks of image classification (ImageNet dataset) and object detection (COCO dataset). The visual perception system (especially the object detection system) based on the deep learning technology has good effect in the application of unmanned driving and the like. Therefore, a new trend is formed to assist the blind in perceiving the environment using this technology. However, the conventional target detection model is generally constructed based on a visible light color image, and the applicable scene is limited by the lighting condition and cannot be applied to night or places with strong light. Similarly, the blind guiding system using the technology has the problem that the blind cannot be assisted to perceive the environment all the time.
Disclosure of Invention
The present invention aims to provide a multi-spectral target detection blind guide system that overcomes or at least partially solves the above mentioned problems.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
one aspect of the present invention provides a multispectral target detection blind guiding system, comprising: the data input module is used for acquiring a visible light image and an infrared thermal image; the deformable feature extractor module is used for respectively extracting image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map; the candidate frame extraction network is used for extracting candidate frames of the target object according to the visible light image characteristic diagram and the infrared thermal image characteristic diagram to obtain a visible light image candidate frame and an infrared thermal image candidate frame; the candidate frame complementation module is used for adding the part, which is not covered by the infrared thermal image candidate frame, in the visible light image candidate frame into the infrared thermal image candidate frame and adding the part, which is not covered by the visible light image candidate frame, in the infrared thermal image candidate frame into the visible light image candidate frame to obtain a visible light image area characteristic map and an infrared thermal image area characteristic map; the cross-modal attention fusion module is used for taking the infrared thermal image region characteristic graph as a query vector, taking the visible light image region characteristic graph as a key vector and a value vector, and fusing the visible light image region characteristic graph into the infrared thermal image region characteristic graph according to the similarity relation among the region characteristics by referring to the self-attention module to obtain thermal image characteristics enhanced by color image characteristics; the classification and regression module is used for performing convolution calculation on the thermal image characteristics enhanced by the color image characteristics and the visible light image area characteristic diagram to obtain a target detection result, wherein the target detection result comprises: the category of each region and the candidate box offset.
The data input module is also used for determining the category and the position of the training target; the system further comprises: and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, performing model training, and continuously iterating, wherein the prediction error of the model is continuously reduced until convergence, so that the model capable of being deployed is obtained.
Wherein the deformable feature extractor module comprises: the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map; the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map; the visible light image characteristic diagram and the infrared thermal image characteristic diagram have the same size.
Wherein the deformable convolution formula is:wherein, for the conventional convolution operation formula, x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)0,h0) K denotes the position number in the convolution range, pkIs a positional shift with respect to p, wkRepresenting the weight, Δ p, corresponding to the position of the k pointkRepresenting an additional added position offset, Δ m, of k points in the convolutionkRepresenting the additional weight of the k points in the convolution.
Wherein the first and second deformable feature extractors learn w independently from each otherk、ΔpkAnd Δ mk(ii) a Or the first deformable feature extractor and the second deformable feature extractor learn w independentlykShared learning Δ pkAnd Δ mk。
Wherein the candidate box extraction network comprises: the first candidate frame extraction network is used for connecting the first deformable feature extractor and extracting visible light image candidate frames with objects in the visible light image feature map; and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the object in the infrared thermal image feature map.
The candidate frame complementation module is specifically used for adding the part, which is not covered by the infrared thermal image candidate frame, of the visible light image candidate frame into the infrared thermal image candidate frame, adding the part, which is not covered by the visible light image candidate frame, of the infrared thermal image candidate frame into the visible light image candidate frame, extracting the region features with different sizes at the corresponding positions on the initial feature map according to the selected candidate frame, and unifying the region features to the same size through the region pooling layer to obtain the visible light image region feature map and the infrared thermal image region feature map with the same size.
The cross-modal attention fusion module and the cross-modal attention fusion module are specifically used for reducing dimensions of the infrared thermal image area characteristic diagram and the visible light image area characteristic diagram through independent convolution, calculating two similarity relations between the infrared thermal image area characteristic diagram and each area characteristic in the visible light image area characteristic diagram to obtain a relation matrix, normalizing the similarity, multiplying the characteristics in the visible light image area characteristic diagram by the relation matrix to output bimodal complementary enhanced area characteristics, and obtaining the thermal image characteristics enhanced by the color image characteristics.
Wherein, the cross-mode attention fusion module is used for adding or merging the thermal image characteristics enhanced by the color image characteristics and the infrared thermal image area characteristic diagram; the classification and regression module is further configured to perform convolution calculation on the feature map obtained by adding or merging the thermal image features enhanced by the color image features with the infrared thermal image area feature map and the visible light image area feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.
Therefore, the multispectral target detection blind guiding system provided by the invention is combined with the visible light color image and the infrared thermal image to construct an all-weather end-to-end multimode/multispectral target detection blind guiding system, and the problem that the existing blind guiding system is not supported or has poor effect in a scene without illumination, low illumination or over-strong illumination is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a multispectral target detection blind guiding system according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a multispectral target detection blind guiding system according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a cross-attention fusion module according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The core of the invention is that:
existing multi-spectral/multi-modal object detection systems commonly assume that the color and thermal images for the same scene are perfectly aligned, but in reality this is not the case and the images of both modalities tend to be offset in position. Such false assumptions will cause the detection system to be erroneous or even fail. At present, the fusion of multi-modal data is carried out in a pixel-by-pixel mode, so that the alignment robustness is reduced, and the effectiveness of complementary feature fusion is influenced. The present invention aims to propose a new solution to the above problems.
In the invention, the situation that different modal images have position offset is considered in network design, on one hand, the network implicitly learns the alignment relation of the two modal images, thereby avoiding errors possibly occurring in the conventional system; and on the other hand, a region of interest (ROI) level feature fusion module is introduced to further improve the robustness to the misalignment problem. In addition, the scheme does not need additional labels, and the cost is saved.
It should be noted that the fusion module is the core of the one-step multispectral target detection system, because it determines how a system uses information of multi-mode images to improve the prediction performance. In the conventional system, no matter where the position of the fusion module is, the fusion mode is very naive, such as addition, merging (concat) or weighting of corresponding positions, and the modes do not effectively utilize complementary information of different modalities, so that the generalization capability of the model in a complex real scene is limited. Aiming at the feature fusion part, the invention provides a candidate frame complementary module and a cross-modal attention module, so that the relevant information of the two modalities is more fully utilized, the more comprehensive modeling of the correlation of the two modalities is realized, the information exchange of the features of the two modalities in the network is purposefully promoted, and the precision and the generalization capability of the system are improved by the existence of the intercommunication.
Fig. 1 shows a schematic structural diagram of a multispectral target detection blind guiding system provided by an embodiment of the present invention, and referring to fig. 1, the multispectral target detection blind guiding system provided by the embodiment of the present invention includes:
the data input module is used for acquiring a visible light image and an infrared thermal image;
the deformable feature extractor module is used for respectively extracting image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map;
the candidate frame extraction network is used for extracting candidate frames of the target object according to the visible light image characteristic diagram and the infrared thermal image characteristic diagram to obtain a visible light image candidate frame and an infrared thermal image candidate frame;
the candidate frame complementation module is used for adding the part, which is not covered by the infrared thermal image candidate frame, in the visible light image candidate frame into the infrared thermal image candidate frame and adding the part, which is not covered by the visible light image candidate frame, in the infrared thermal image candidate frame into the visible light image candidate frame to obtain a visible light image area characteristic map and an infrared thermal image area characteristic map;
the cross-modal attention fusion module is used for taking the infrared thermal image region characteristic graph as a query vector, taking the visible light image region characteristic graph as a key vector and a value vector, and fusing the visible light image region characteristic graph into the infrared thermal image region characteristic graph according to the similarity relation among the region characteristics by referring to the self-attention module to obtain thermal image characteristics enhanced by color image characteristics;
the classification and regression module is used for performing convolution calculation on the thermal image characteristics enhanced by the color image characteristics and the visible light image area characteristic diagram to obtain a target detection result, wherein the target detection result comprises: the category of each region and the candidate box offset.
Therefore, the invention provides an all-weather end-to-end multi-mode/multi-spectrum target detection blind guiding system combining a visible light color image and an infrared thermal image. The invention discloses an end-to-end target detection system which does not need position deviation supervision information and aggregates two modal information through a self-attention module at the regional characteristic level. The model in the invention is based on a two-stage detection algorithm, namely, fast-RCNN, the feature extraction and region candidate network (RPN) stage is provided with two independent branches which are respectively used for extracting region features of visible light images and infrared light images, then the region features of the two branches are fused through a candidate frame complementary module and a cross-modal self-attention module, and finally the region category and the coordinate prediction are carried out. In a specific embodiment, a general two-stage detection model such as FPN, RFCN, etc. can be used as the basic model, and is not limited to fast-RCNN.
The following describes in detail the multispectral target detection blind guiding system provided by the embodiment of the present invention with reference to fig. 2 and fig. 3:
as an optional implementation manner of the embodiment of the present invention, the data input module is further configured to determine a category and a location of the training target. Therefore, the learning target can be input to the detection network in the training process to carry out model training.
Specifically, the method comprises the following steps:
a data input module: the data input of the neural network comprises two parts, namely image input and target input detection. The image input is a bimodal paired image to be detected, wherein the bimodal paired image is a visible light color image (an RGB three channel) and an infrared thermal image (a gray scale image, and only one channel is used for original data). Assuming the input image has a length and width of H, W, the image of the input network is [ Nx 3 XHXW, Nx 1 XHXW ] (in some common implementations, the channel of the infrared thermal image may also be replicated three times to obtain a three-channel input, i.e., Nx 3 XHXW), where N represents the batch size. The detection target input is the type and position of the marked target object in the original image, and the position is represented by coordinate points [ x1, y1, x2 and y2] of a circumscribed rectangular frame of the target object, wherein [ x1, y1] and [ x2 and y2] are the coordinates of the upper left corner and the lower right corner of the circumscribed frame respectively. The detection target is firstly marked by people and is input to the detection network as a learning target in the training for model training.
As an optional implementation of the embodiment of the invention, the deformable feature extractor module comprises: the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map; the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map; the visible light image characteristic diagram and the infrared thermal image characteristic diagram have the same size.
As an optional implementation of the embodiment of the present invention, the deformable convolution formula is: where yp is 1K wk · xp + pk is a general convolution operation formula, x represents an input feature map, y represents an output feature map, and p is a pixel point position to be currently calculated (w0,h0) K denotes the position number in the convolution range, pkIs a positional shift with respect to p, wkRepresenting the weight, Δ p, corresponding to the position of the k pointkRepresenting an additional added position offset, Δ m, of k points in the convolutionkRepresenting the additional weight of the k points in the convolution.
As an optional implementation of the embodiments of the invention, the first deformable feature extractor and the second deformable feature extractor learn w independently from each otherk、ΔpkAnd Δ mk(ii) a Or the first deformable feature extractor and the second deformable feature extractor learn w independentlykShared learning Δ pkAnd Δ mk。
Specifically, the method comprises the following steps:
a deformable feature extractor module: due to the position offset of the image input of the two modes, the invention uses deformable convolution (deformable convolution) in the feature extraction module to enable the two branch networks to implicitly realize the alignment of the feature level in the process of independently extracting the image features. In the feature extractors currently in the mainstream, such as Resnet50, convolution uses convolution kernels with relatively fixed geometry, such as square-shaped 3 × 3 or 7 × 7 convolution kernels, and their geometric transformation modeling capability is inherently limited. The deformable convolution adds a displacement variable on the conventional convolution, the displacement can be automatically learned in model training, and the receptive field of the convolution after the displacement is not a square any more, but becomes an arbitrary polygon according to the condition of training data. Aiming at the position offset existing in the multi-modal image, the two feature extractors can automatically adjust respective deformable convolution through training, so that the extracted features are aligned on the layer of a convolution receptive field. Meanwhile, the method does not need to provide additional supervision information to realize the calibration of the images in the two modes, so that the cost can be saved, and the method is easy to realize. The two modality images in the data input module are used as input, and the size of the feature maps finally output by the two feature extractors is N × C × H '× W', wherein H '═ H/16, W' ═ W/16, and C represents a feature dimension or a channel number, such as 512 or 2048.
The conventional convolution operation is as in equation (1), x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)0,h0) K represents a position number within the convolution range, for example, k is 9 in a 3 × 3 convolution; p is a radical ofkIs a positional shift with respect to p, wkThe weights representing the locations of k points are learnable parameters. The deformable convolution is as formula (2) and adds two learnable parameters, Δ p, to formula (1)kAnd Δ mk。ΔpkIndicating an additional increase in the position deviation of k points in the convolutionAmount of displacement,. DELTA.mkRepresenting the additional weight of the k points in the convolution. As a supplementary explanation, the invention provides two implementation forms of the deformable convolution network as implementation cases. First, the feature extraction networks for visible light images and thermal infrared images are completely independent of each other, and the deformation parameters of the deformable convolution are learned by the respective feature extraction networks, i.e., the learnable parameters of the deformable convolution, w, in the two branch networksk、Δpk、ΔmkAre all independent. Second, the visible image and the thermal infrared image are partially independent of each other in the feature extraction network, i.e., the main calculations of feature extraction are independent of each other, i.e., wkAre independent of each other. But the deformation parameter Δ p in the deformable convolutionkAnd Δ mkThe method is shared, and the specific two latter deformation parameters are obtained by taking the fusion characteristics of the two modal characteristics as input learning.
As an optional implementation manner of the embodiment of the present invention, the candidate box extraction network includes: the first candidate frame extraction network is used for connecting the first deformable feature extractor and extracting visible light image candidate frames with objects in the visible light image feature map; and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the object in the infrared thermal image feature map.
Specifically, the method comprises the following steps:
candidate box extraction network: the candidate frame extraction network module aims to extract a candidate frame with an object, namely a prediction of a true circumscribed rectangular frame of a target object, by taking a feature map output by a feature extractor as an input, regardless of which type the object specifically belongs to. Specifically, k anchor frames (anchors) with different sizes are generated for each pixel point in the feature map, then the feature map in the k anchor frames is input into an extraction network, and the network predicts the probability that each anchor frame has an object, namely k × 2 classification results, and the offset of the anchor frame relative to the real position of the object, namely k × 4 regression results through calculation. For a feature map of size N × C × H '× W', the extraction network will output N × H '× W' × k × 2+ N × H '× W' × k × 4 results. Finally, through non-maximum suppression and elimination, M (usually M1024) candidate frames in which the object is most likely to exist are selected and stored in a matrix of size M × 4. The two branch networks will output M relatively independent candidate frames based on their respective feature maps.
As an optional implementation manner of the embodiment of the present invention, the candidate frame complementation module is specifically configured to add a portion, which is not covered by the infrared thermal image candidate frame, of the visible light image candidate frame to the infrared thermal image candidate frame, add a portion, which is not covered by the visible light image candidate frame, of the infrared thermal image candidate frame to the visible light image candidate frame, extract, according to the selected candidate frame, region features with different sizes at corresponding positions on the initial feature map, and unify, through the region pooling layer, the region features to the same size, so as to obtain the visible light image region feature map and the infrared thermal image region feature map with the same size.
Specifically, the method comprises the following steps:
candidate frame complementation module: the module is intended to fuse the candidate frames extracted by the two branch networks so as to realize the complementation of the two modes at the position level of the target object. For the condition that the lighting condition is not good, the candidate frames extracted by the color image branch are likely to be missed, and the candidate frames extracted by the thermal image branch are relatively stable; in addition, there are cases where thermal image branches are missing but color images can be detected, such as on poles with low temperatures in cloudy days. The module obtains more complete candidate frames by using two modalities. Specifically, the module takes the candidate boxes of the two modalities obtained at the previous stage as input, adds a part, such as M, of the candidate boxes of the two modalities, which is smaller than IoU e [0.5,0.8], to the candidate box of the other modality, and increases the number of the candidate boxes of the two modalities to M' ═ M + M, and the value of the threshold p may slightly vary according to different embodiments. According to the selected candidate frame, region features with different sizes of corresponding positions on the initial feature map are extracted, and then the region features are unified to the same size L × L (for example, L is 7) through a region pooling layer (ROI posing). Finally, the module will output M 'region feature maps of two modalities, respectively, with a size of 2 × N × M' × C × L.
As an optional implementation manner of the embodiment of the present invention, the cross-modal attention fusion module is specifically configured to perform dimension reduction on the infrared thermal image area feature map and the visible light image area feature map through independent convolution, calculate two similarity relationships between each area feature in the infrared thermal image area feature map and the visible light image area feature map, obtain a relationship matrix, perform weight normalization on the similarity, perform convolution on the features in the visible light image area feature map, and multiply the features by the relationship matrix to output a bimodal complementary enhanced area feature, thereby obtaining a thermal image feature enhanced by a color image feature.
Specifically, the method comprises the following steps:
a cross-modal attention fusion module: for the effect of feature model fusion, the invention introduces a bidirectional feature enhancement module. Due to the different imaging mechanisms of different modalities on the environment, point-to-point feature fusion hardly provides substantial help. For example, pedestrians under dark light conditions mostly have darkness and invisibility on color images (the whole upper half of a life), and only a small part of the color images are illuminated (below the lower legs), so that the color image branch network does not have discrimination for extracting features from the points, and cannot play a great complementary or enhancement role by being fused into the features of the thermal images one by one according to position coordinates. While fusion at the regional level is a relatively more efficient approach. Specifically, with the help of the candidate frame complementation module, the color map branch can also acquire the candidate frame of the pedestrian which is mostly invisible in the above example, the corresponding area feature comprises the information of the small part of the illuminated shank, and the feature which only comprises a certain part of the object can be fused into the thermal image feature as better supplementary information to enhance the characterization capability of the latter. Therefore, the module adopts a strategy of fusion at the regional feature level.
In addition, different objects in a scene often have a certain dependency relationship, and the relationship among the objects can help to improve the representation capability of the model, so that the prediction accuracy is improved. For example, when one person appears on a zebra crossing, there are often others (because of the green light); the bench or trash can in the park is typically placed outside the lawn, rather than on the lawn. Therefore, the module also models the relationship between the regions/potential objects in the process of fusing the characteristics of the two modal regions, and uses the relationship between the objects to enhance the expression of the region characteristics.
Specifically, the module takes the infrared thermal image area characteristics and the visible light color image area characteristics as input, takes the infrared thermal image area characteristics as a query vector (query), takes the visible light color image area characteristics as a key vector (key) and a value vector (value), and fuses the visible light color image area characteristics into the infrared thermal image area characteristics according to the similarity relationship between the area characteristics with reference to the self-attention module, so as to realize the effect of enhancing the infrared light characteristics by the visible light characteristics, or realize the effect of complementing the dual-light characteristics, as shown in fig. 2. For convenience of description, taking N as 1, respectively performing independent L × L convolution on the thermal image region feature and the color image region feature to perform the same dimensionality reduction, and changing the dimensionality reduction into size query and key of M' × C × 1 × 1; then calculating two similarity relations between the characteristics of each region in the two modes to obtain a relation matrix with the size of M 'multiplied by M', the similarity calculation method can take the negative number, matrix multiplication or other of Euclidean distance, and then carrying out weight normalization on the similarity, wherein the embodiment defaults to use progressive softmax operation; meanwhile, the color image area feature is also changed into a value with the size of M '× C × L × L through a 1 × 1 convolution, and finally, the bimodal complementary enhanced area feature with the size of M' × C × L × L is output through matrix multiplication with a relation matrix, namely, the thermal image feature enhanced by the color image feature is obtained at the area feature level.
The model of the module is the relation between candidate areas, but not the relation between all pixel points in the original method, so the calculation complexity is much smaller (the former O (M 'x M') -10)6The latter O (H.times.Wx.times.Htimes.W) to 108) And the calculation efficiency is high.
As an optional implementation manner of the embodiment of the present invention, the cross-modality attention fusion module further adds or merges the thermal image feature enhanced by the color image feature and the infrared thermal image area feature map; the classification and regression module is further configured to perform convolution calculation on the feature map obtained by adding or merging the thermal image features enhanced by the color image features with the infrared thermal image area feature map and the visible light image area feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset. And the output of the cross-modality attention fusion module may be added or merged (occlusion) with the thermal image region features.
As an optional implementation manner of the embodiment of the present invention, the system further includes: and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, performing model training, and continuously iterating, wherein the prediction error of the model is continuously reduced until convergence, so that the model capable of being deployed is obtained.
Specifically, the method comprises the following steps:
a loss calculation module: the module takes the prediction result of the whole model and the corresponding training target as input, calculates the comprehensive prediction error of the model in two tasks of frame regression and frame classification by using a conventional loss function, then returns the gradient of the error, and updates the parameters of the whole model according to a certain learning rate to realize model training. And repeating the iteration, and continuously reducing the prediction error of the model until convergence to finally obtain the model capable of being applied and deployed. In the Training, Mixed Precision Training (Mixed Precision Training) can be used for the network parameters, so that the aims of reducing video memory and accelerating the Training speed are fulfilled.
Therefore, the multispectral target detection blind guiding system provided by the embodiment of the invention utilizes the multispectral image to construct an all-weather blind guiding system based on a target detection network, applies deformable convolution to implicitly learn the alignment relation of the multispectral image in the feature extractor, deals with the possible position offset, provides a candidate frame complementary module and a cross-modal attention module, and fully utilizes the complementary information of the multispectral image, thereby realizing more accurate all-weather target detection. While further enhancing robustness to feature misalignment issues. Therefore, all-weather blind guiding can be supported, the multispectral image alignment relation can be implicitly learned, additional marking is not needed, the cost is saved, the complementary information of the multispectral image is more fully utilized by the candidate frame complementary module and the cross-modal attention module, the fusion is more effective, and the effect is better. In addition, the cross-modal attention module has low computational complexity and high efficiency.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (9)
1. A multi-spectral target detection blind guide system, comprising:
the data input module is used for acquiring a visible light image and an infrared thermal image;
the deformable feature extractor module is used for respectively extracting the image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map;
the candidate frame extraction network is used for extracting candidate frames of the target object according to the visible light image characteristic diagram and the infrared thermal image characteristic diagram to obtain a visible light image candidate frame and an infrared thermal image candidate frame;
a candidate frame complementing module, configured to add, to the infrared thermal image candidate frame, a portion of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame, and add, to the visible light image candidate frame, a portion of the infrared thermal image candidate frame that is not covered by the visible light image candidate frame, so as to obtain a visible light image area feature map and an infrared thermal image area feature map;
the cross-mode attention fusion module is used for taking the infrared thermal image region feature map as a query vector, taking the visible light image region feature map as a key vector and a value vector, and fusing the visible light image region feature map into the infrared thermal image region feature map according to the similarity relation among the region features by referring to the self-attention module to obtain thermal image features enhanced by color image features;
a classification and regression module, configured to perform convolution calculation on the thermal image features enhanced by the color image features and the visible light image region feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.
2. The system of claim 1,
the data input module is also used for determining the category and the position of a training target;
the system further comprises:
and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, training the model, and continuously iterating until the prediction error of the model is continuously reduced until the model is converged to obtain the model capable of being applied and deployed.
3. The system of claim 1, wherein the deformable feature extractor module comprises:
the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map;
the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map;
the visible light image feature map and the infrared thermal image feature map are the same in size.
4. The system of claim 3, wherein the deformable convolution formula is:
wherein,for the conventional convolution operation formula, x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)0,h0) K denotes the position number in the convolution range, pkIs a positional shift with respect to p, wkRepresenting the weight, Δ p, corresponding to the position of the k pointkRepresenting an additional added position offset, Δ m, of k points in the convolutionkRepresenting the additional weight of the k points in the convolution.
5. The system of claim 4, wherein the first deformable feature extractor and the second deformable feature extractor each independently learn wk、ΔpkAnd Δ mk(ii) a Or the first and second deformable feature extractors learn w independentlykShared learning Δ pkAnd Δ mk。
6. The system of claim 3, wherein the candidate box extraction network comprises:
a first candidate frame extraction network, connected to the first deformable feature extractor, for extracting the visible light image candidate frame in which an object exists in the visible light image feature map;
and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the objects in the infrared thermal image feature map.
7. The system according to claim 1, wherein the candidate frame complementing module is specifically configured to add a portion of the visible-light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, add a portion of the infrared thermal image candidate frame that is not covered by the visible-light image candidate frame to the visible-light image candidate frame, extract, according to the selected candidate frame, the region features at different sizes of the corresponding positions on the initial feature map, and unify the region features to the same size through a region pooling layer, thereby obtaining the visible-light image region feature map and the infrared thermal image region feature map that have the same size.
8. The system of claim 1,
the cross-modal attention fusion module is specifically configured to perform dimension reduction on the infrared thermal image region feature map and the visible light image region feature map through independent convolution, calculate two similarity relations between the infrared thermal image region feature map and each region feature in the visible light image region feature map, obtain a relation matrix, normalize the similarity by a weight, perform convolution on the features in the visible light image region feature map, multiply the convolved features with the matrix of the relation matrix, output bimodal complementary enhanced region features, and obtain the thermal image features enhanced by the color image features.
9. The system of claim 1,
the cross-modal attention fusion module is used for adding or merging the thermal image characteristics enhanced by the color image characteristics and the infrared thermal image area characteristic map;
the classification and regression module is further configured to perform convolution calculation on the thermal image features enhanced by the color image features, the feature map obtained by adding or merging the infrared thermal image region feature map and the visible light image region feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011426982.2A CN112418163B (en) | 2020-12-09 | 2020-12-09 | Multispectral target detection blind guiding system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011426982.2A CN112418163B (en) | 2020-12-09 | 2020-12-09 | Multispectral target detection blind guiding system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112418163A true CN112418163A (en) | 2021-02-26 |
CN112418163B CN112418163B (en) | 2022-07-12 |
Family
ID=74775286
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011426982.2A Active CN112418163B (en) | 2020-12-09 | 2020-12-09 | Multispectral target detection blind guiding system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112418163B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591770A (en) * | 2021-08-10 | 2021-11-02 | 中国科学院深圳先进技术研究院 | Multimode fusion obstacle detection method and device based on artificial intelligence blind guiding |
CN113688806A (en) * | 2021-10-26 | 2021-11-23 | 南京智谱科技有限公司 | Infrared and visible light image fused multispectral target detection method and system |
CN114359776A (en) * | 2021-11-25 | 2022-04-15 | 国网安徽省电力有限公司检修分公司 | Flame detection method and device integrating light imaging and thermal imaging |
CN115115919A (en) * | 2022-06-24 | 2022-09-27 | 国网智能电网研究院有限公司 | Power grid equipment thermal defect identification method and device |
CN115393684A (en) * | 2022-10-27 | 2022-11-25 | 松立控股集团股份有限公司 | Anti-interference target detection method based on automatic driving scene multi-mode fusion |
CN116468928A (en) * | 2022-12-29 | 2023-07-21 | 长春理工大学 | Thermal infrared small target detection method based on visual perception correlator |
CN116543378A (en) * | 2023-07-05 | 2023-08-04 | 杭州海康威视数字技术股份有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN117078920A (en) * | 2023-10-16 | 2023-11-17 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109448035A (en) * | 2018-11-14 | 2019-03-08 | 重庆邮电大学 | Infrared image and visible light image registration method based on deep learning |
CN111210443A (en) * | 2020-01-03 | 2020-05-29 | 吉林大学 | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance |
CN111709902A (en) * | 2020-05-21 | 2020-09-25 | 江南大学 | Infrared and visible light image fusion method based on self-attention mechanism |
CN111861880A (en) * | 2020-06-05 | 2020-10-30 | 昆明理工大学 | Image super-fusion method based on regional information enhancement and block self-attention |
-
2020
- 2020-12-09 CN CN202011426982.2A patent/CN112418163B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109448035A (en) * | 2018-11-14 | 2019-03-08 | 重庆邮电大学 | Infrared image and visible light image registration method based on deep learning |
CN111210443A (en) * | 2020-01-03 | 2020-05-29 | 吉林大学 | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance |
CN111709902A (en) * | 2020-05-21 | 2020-09-25 | 江南大学 | Infrared and visible light image fusion method based on self-attention mechanism |
CN111861880A (en) * | 2020-06-05 | 2020-10-30 | 昆明理工大学 | Image super-fusion method based on regional information enhancement and block self-attention |
Non-Patent Citations (3)
Title |
---|
KAILAI ZHOU 等: "Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems", 《ARXIV.ORG》 * |
XIZHOU ZHU 等: "Deformable ConvNets V2: More Deformable, Better Results", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
官大衍: "可见光与长波红外图像融合的行人检测方法研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591770A (en) * | 2021-08-10 | 2021-11-02 | 中国科学院深圳先进技术研究院 | Multimode fusion obstacle detection method and device based on artificial intelligence blind guiding |
CN113591770B (en) * | 2021-08-10 | 2023-07-18 | 中国科学院深圳先进技术研究院 | Multi-mode fusion obstacle detection method and device based on artificial intelligence blind guiding |
WO2023015799A1 (en) * | 2021-08-10 | 2023-02-16 | 中国科学院深圳先进技术研究院 | Multimodal fusion obstacle detection method and apparatus based on artificial intelligence blindness guiding |
CN113688806A (en) * | 2021-10-26 | 2021-11-23 | 南京智谱科技有限公司 | Infrared and visible light image fused multispectral target detection method and system |
CN114359776A (en) * | 2021-11-25 | 2022-04-15 | 国网安徽省电力有限公司检修分公司 | Flame detection method and device integrating light imaging and thermal imaging |
CN114359776B (en) * | 2021-11-25 | 2024-04-26 | 国网安徽省电力有限公司检修分公司 | Flame detection method and device integrating light and thermal imaging |
CN115115919B (en) * | 2022-06-24 | 2023-05-05 | 国网智能电网研究院有限公司 | Power grid equipment thermal defect identification method and device |
CN115115919A (en) * | 2022-06-24 | 2022-09-27 | 国网智能电网研究院有限公司 | Power grid equipment thermal defect identification method and device |
CN115393684A (en) * | 2022-10-27 | 2022-11-25 | 松立控股集团股份有限公司 | Anti-interference target detection method based on automatic driving scene multi-mode fusion |
CN116468928A (en) * | 2022-12-29 | 2023-07-21 | 长春理工大学 | Thermal infrared small target detection method based on visual perception correlator |
CN116468928B (en) * | 2022-12-29 | 2023-12-19 | 长春理工大学 | Thermal infrared small target detection method based on visual perception correlator |
CN116543378A (en) * | 2023-07-05 | 2023-08-04 | 杭州海康威视数字技术股份有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN116543378B (en) * | 2023-07-05 | 2023-09-29 | 杭州海康威视数字技术股份有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN117078920A (en) * | 2023-10-16 | 2023-11-17 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
CN117078920B (en) * | 2023-10-16 | 2024-01-23 | 昆明理工大学 | Infrared-visible light target detection method based on deformable attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN112418163B (en) | 2022-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112418163B (en) | Multispectral target detection blind guiding system | |
WO2021233029A1 (en) | Simultaneous localization and mapping method, device, system and storage medium | |
CN111259906B (en) | Method for generating remote sensing image target segmentation countermeasures under condition containing multilevel channel attention | |
CN109902548B (en) | Object attribute identification method and device, computing equipment and system | |
CN111582201A (en) | Lane line detection system based on geometric attention perception | |
CN110781262B (en) | Semantic map construction method based on visual SLAM | |
CN112950645B (en) | Image semantic segmentation method based on multitask deep learning | |
CN109509156B (en) | Image defogging processing method based on generation countermeasure model | |
CN114937083B (en) | Laser SLAM system and method applied to dynamic environment | |
CN113538401B (en) | Crowd counting method and system combining cross-modal information in complex scene | |
CN112767478B (en) | Appearance guidance-based six-degree-of-freedom pose estimation method | |
CN115100678A (en) | Cross-modal pedestrian re-identification method based on channel recombination and attention mechanism | |
CN113326735A (en) | Multi-mode small target detection method based on YOLOv5 | |
CN111767854B (en) | SLAM loop detection method combined with scene text semantic information | |
CN114066955A (en) | Registration method for registering infrared light image to visible light image | |
Shi et al. | An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds | |
CN112132013A (en) | Vehicle key point detection method | |
CN113112547A (en) | Robot, repositioning method thereof, positioning device and storage medium | |
CN112529011B (en) | Target detection method and related device | |
Yuan et al. | Dual attention and dual fusion: An accurate way of image-based geo-localization | |
CN116503618B (en) | Method and device for detecting remarkable target based on multi-mode and multi-stage feature aggregation | |
CN117218345A (en) | Semantic segmentation method for electric power inspection image | |
CN117152470A (en) | Space-sky non-cooperative target pose estimation method and device based on depth feature point matching | |
Gong et al. | Skipcrossnets: Adaptive skip-cross fusion for road detection | |
CN114882328B (en) | Target detection method combining visible light image and infrared image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |