CN112418163B

CN112418163B - Multispectral target detection blind guiding system

Info

Publication number: CN112418163B
Application number: CN202011426982.2A
Authority: CN
Inventors: 石德君; 张树; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2022-07-12
Anticipated expiration: 2040-12-09
Also published as: CN112418163A

Abstract

The invention provides a multispectral target detection blind guiding system, which comprises: the data input module is used for acquiring a visible light image and an infrared thermal image; the deformable feature extractor module is used for respectively extracting a visible light image feature map and an infrared thermal image feature map; a candidate frame extraction network for extracting a visible light image candidate frame and an infrared thermal image candidate frame; the candidate frame complementation module is used for adding the part, which is not covered by the infrared thermal image candidate frame, in the visible light image candidate frame into the infrared thermal image candidate frame and adding the part, which is not covered by the visible light image candidate frame, in the infrared thermal image candidate frame into the visible light image candidate frame to obtain a visible light image area characteristic map and an infrared thermal image area characteristic map; the cross-mode attention fusion module is used for fusing the visible light image area characteristic diagram into the infrared thermal image area characteristic diagram according to the similarity relation among the area characteristics to obtain enhanced thermal image characteristics; and the classification and regression module is used for obtaining a target detection result.

Description

Multispectral target detection blind guiding system

Technical Field

The invention relates to the field of computers, in particular to a multispectral target detection blind guiding system.

Background

The tremendous development of computer vision in recent years has brought new opportunities and possibilities for blind guiding systems. Deep learning models based on Convolutional Neural Networks (CNN) have reached and even surpassed the human level in the tasks of image classification (ImageNet dataset) and object detection (COCO dataset). A visual perception system (particularly, an object detection system) based on a deep learning technology has a good effect in applications such as unmanned driving. Therefore, a new trend is formed to assist the blind in perceiving the environment using this technology. However, the previous object detection models are generally constructed based on visible light color images, and applicable scenes are limited by lighting conditions and cannot be applied to night or places with strong light. Similarly, the blind guiding system using the technology has the problem that the blind cannot be assisted to perceive the environment all the time.

Disclosure of Invention

The present invention aims to provide a multi-spectral target detection blind guide system that overcomes or at least partially solves the above mentioned problems.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

one aspect of the present invention provides a multispectral target detection blind guiding system, comprising: the data input module is used for acquiring a visible light image and an infrared thermal image; the deformable feature extractor module is used for respectively extracting image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map; the candidate frame extraction network is used for extracting candidate frames of the target object according to the visible light image characteristic diagram and the infrared thermal image characteristic diagram to obtain a visible light image candidate frame and an infrared thermal image candidate frame; the candidate frame complementation module is used for adding the part, which is not covered by the infrared thermal image candidate frame, in the visible light image candidate frame into the infrared thermal image candidate frame and adding the part, which is not covered by the visible light image candidate frame, in the infrared thermal image candidate frame into the visible light image candidate frame to obtain a visible light image area characteristic map and an infrared thermal image area characteristic map; the cross-mode attention fusion module is used for taking the infrared thermal image area characteristic graph as a query vector, taking the visible light image area characteristic graph as a key vector and a value vector, and fusing the visible light image area characteristic graph into the infrared thermal image area characteristic graph according to the similarity relation among the area characteristics by referring to the self-attention module to obtain thermal image characteristics enhanced by color image characteristics; the classification and regression module is used for performing convolution calculation on the thermal image characteristics enhanced by the color image characteristics and the visible light image area characteristic diagram to obtain a target detection result, wherein the target detection result comprises: the category of each region and the candidate box offset.

The data input module is also used for determining the category and the position of the training target; the system further comprises: and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, performing model training, and continuously iterating, wherein the prediction error of the model is continuously reduced until convergence, so that the model capable of being deployed is obtained.

Wherein the deformable feature extractor module comprises: the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map; the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map; the visible light image characteristic diagram and the infrared thermal image characteristic diagram have the same size.

Wherein the deformable convolution formula is:

wherein the content of the first and second substances,

for the conventional convolution operation formula, x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)₀,h₀) K tableIndicating the position number, p, within the convolution range_kIs a positional shift with respect to p, w_kDenotes the weight, Δ p, corresponding to the position of the k point_kRepresenting an additional added position offset, Δ m, of k points in the convolution_kRepresenting the additional weight of the k points in the convolution.

Wherein the first and second deformable feature extractors learn w independently from each other_k、Δp_kAnd Δ m_k(ii) a Or the first deformable feature extractor and the second deformable feature extractor learn w independently_kShared learning Δ p_kAnd Δ m_k。

Wherein the candidate box extraction network comprises: the first candidate frame extraction network is used for connecting the first deformable feature extractor and extracting visible light image candidate frames with objects in the visible light image feature map; and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the object in the infrared thermal image feature map.

The candidate frame complementation module is specifically used for adding the part, which is not covered by the infrared thermal image candidate frame, of the visible light image candidate frame into the infrared thermal image candidate frame, adding the part, which is not covered by the visible light image candidate frame, of the infrared thermal image candidate frame into the visible light image candidate frame, extracting the region features with different sizes at the corresponding positions on the initial feature map according to the selected candidate frame, and unifying the region features to the same size through the region pooling layer to obtain the visible light image region feature map and the infrared thermal image region feature map with the same size.

The cross-modal attention fusion module and the cross-modal attention fusion module are specifically used for reducing dimensions of the infrared thermal image area characteristic diagram and the visible light image area characteristic diagram through independent convolution, calculating two similarity relations between the infrared thermal image area characteristic diagram and each area characteristic in the visible light image area characteristic diagram to obtain a relation matrix, normalizing the similarity, multiplying the characteristics in the visible light image area characteristic diagram by the relation matrix to output bimodal complementary enhanced area characteristics, and obtaining the thermal image characteristics enhanced by the color image characteristics.

Wherein, the cross-mode attention fusion module is used for adding or merging the thermal image characteristics enhanced by the color image characteristics and the infrared thermal image area characteristic diagram; the classification and regression module is further configured to perform convolution calculation on the feature map obtained by adding or merging the thermal image features enhanced by the color image features with the infrared thermal image area feature map and the visible light image area feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.

Therefore, the multispectral target detection blind guiding system provided by the invention is combined with the visible light color image and the infrared thermal image to construct an all-weather end-to-end multimode/multispectral target detection blind guiding system, and the problem that the existing blind guiding system is not supported or has poor effect in a scene without illumination, low illumination or over-strong illumination is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a multispectral target detection blind guiding system according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a multispectral target detection blind guiding system according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a cross-attention fusion module according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The core of the invention is that:

existing multi-spectral/multi-modal object detection systems commonly assume that the color and thermal images for the same scene are perfectly aligned, but in reality this is not the case and the images of both modalities tend to be offset in position. Such false assumptions will cause the detection system to be erroneous or even fail. At present, the fusion of multi-modal data is carried out in a pixel-by-pixel mode, so that the alignment robustness is reduced, and the effectiveness of complementary feature fusion is influenced. The present invention aims to propose a new solution to the above problems.

In the invention, the situation that different modal images have position offset is considered in network design, on one hand, the network implicitly learns the alignment relation of the two modal images, thereby avoiding errors possibly occurring in the conventional system; and on the other hand, a region of interest (ROI) level feature fusion module is introduced to further improve the robustness to the misalignment problem. In addition, the scheme does not need additional labels, and the cost is saved.

It should be noted that the fusion module is the core of the one-step multispectral target detection system, because it determines how a system uses information of multi-modal images to improve the prediction performance. In the conventional system, no matter where the position of the fusion module is, the fusion mode is very naive, such as addition, merging (concat) or weighting of corresponding positions, and the modes do not effectively utilize complementary information of different modalities, so that the generalization capability of the model in a complex real scene is limited. Aiming at the feature fusion part, the invention provides a candidate frame complementary module and a cross-modal attention module, so that the relevant information of the two modalities is more fully utilized, the more comprehensive modeling of the correlation of the two modalities is realized, the information exchange of the features of the two modalities in the network is purposefully promoted, and the precision and the generalization capability of the system are improved by the existence of the intercommunication.

Fig. 1 shows a schematic structural diagram of a multispectral target detection blind guiding system provided by an embodiment of the present invention, and referring to fig. 1, the multispectral target detection blind guiding system provided by the embodiment of the present invention includes:

the data input module is used for acquiring a visible light image and an infrared thermal image;

the deformable feature extractor module is used for respectively extracting image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map;

the candidate frame extraction network is used for extracting candidate frames of the target object according to the visible light image characteristic diagram and the infrared thermal image characteristic diagram to obtain a visible light image candidate frame and an infrared thermal image candidate frame;

the candidate frame complementation module is used for adding the part, which is not covered by the infrared thermal image candidate frame, in the visible light image candidate frame into the infrared thermal image candidate frame and adding the part, which is not covered by the visible light image candidate frame, in the infrared thermal image candidate frame into the visible light image candidate frame to obtain a visible light image area characteristic map and an infrared thermal image area characteristic map;

the cross-mode attention fusion module is used for taking the infrared thermal image area characteristic graph as a query vector, taking the visible light image area characteristic graph as a key vector and a value vector, and fusing the visible light image area characteristic graph into the infrared thermal image area characteristic graph according to the similarity relation among the area characteristics by referring to the self-attention module to obtain thermal image characteristics enhanced by color image characteristics;

the classification and regression module is used for performing convolution calculation on the thermal image characteristics enhanced by the color image characteristics and the visible light image area characteristic diagram to obtain a target detection result, wherein the target detection result comprises: the category of each region and the candidate box offset.

Therefore, the invention provides an all-weather end-to-end multi-mode/multi-spectrum target detection blind guiding system combining a visible light color image and an infrared thermal image. The invention discloses an end-to-end target detection system which does not need position deviation supervision information and aggregates two modal information through a self-attention module at the regional characteristic level. The model in the invention is based on a two-stage detection algorithm, namely, fast-RCNN, the feature extraction and region candidate network (RPN) stage is provided with two independent branches which are respectively used for extracting region features of visible light images and infrared light images, then the region features of the two branches are fused through a candidate frame complementary module and a cross-modal self-attention module, and finally the region category and the coordinate prediction are carried out. In a specific embodiment, a general two-stage detection model such as FPN, RFCN, etc. can be used as the basic model, and is not limited to fast-RCNN.

The following describes the multispectral target detection blind guiding system provided by the embodiment of the present invention in detail with reference to fig. 2 and fig. 3:

as an optional implementation manner of the embodiment of the present invention, the data input module is further configured to determine a category and a location of the training target. Therefore, the learning target can be input to the detection network in the training process to carry out model training.

Specifically, the method comprises the following steps:

a data input module: the data input of the neural network comprises two parts, namely image input and target input detection. The image input is a bimodal paired image to be detected, wherein the bimodal paired image is a visible light color image (an RGB three channel) and an infrared thermal image (a gray scale image, and only one channel is used for original data). Assuming the input image has a length and width of H, W, the image of the input network is [ Nx 3 XHXW, Nx 1 XHXW ] (in some common implementations, the channel of the infrared thermal image may also be replicated three times to obtain a three-channel input, i.e., Nx 3 XHXW), where N represents the batch size. The detection target input is the type and position of the marked target object in the original image, and the position is represented by a coordinate point [ x1, y1, x2, y2] of a circumscribed rectangular frame of the target object, wherein [ x1, y1] and [ x2, y2] are the coordinates of the upper left corner and the lower right corner of the circumscribed frame respectively. The detection target is firstly marked by people and is input to the detection network as a learning target in the training for model training.

As an optional implementation of the embodiment of the invention, the deformable feature extractor module comprises: the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map; the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map; the size of the visible light image characteristic diagram is the same as that of the infrared thermal image characteristic diagram.

As an optional implementation of the embodiment of the present invention, the deformable convolution formula is:

where yp is 1K wk · xp + pk is a general convolution operation formula, x represents an input feature map, y represents an output feature map, and p is a pixel point position to be currently calculated (w₀,h₀) K denotes the position number in the convolution range, p_kIs a position shift with respect to p, w_kRepresenting the weight, Δ p, corresponding to the position of the k point_kRepresenting an additional added position offset, Δ m, of k points in the convolution_kRepresenting the additional weight of the k point in the convolution.

As an optional implementation of the embodiments of the invention, the first deformable feature extractor and the second deformable feature extractor learn w independently from each other_k、Δp_kAnd Δ m_k(ii) a Or the first deformable feature extractor and the second deformable feature extractor learn w independently_kShared learning Δ p_kAnd Δ m_k。

Specifically, the method comprises the following steps:

a deformable feature extractor module: because the image input of the two modes has position offset, the invention uses deformable convolution (deformable convolution) in the feature extraction module to allow the two branch networks to implicitly realize the alignment of the feature level in the process of independently extracting the image features. In the feature extractors currently in the mainstream, such as Resnet50, convolution uses convolution kernels with relatively fixed geometry, such as square-shaped 3 × 3 or 7 × 7 convolution kernels, and their geometric transformation modeling capability is inherently limited. Deformable convolution on conventional convolution, a displacement variable is added, the displacement can be automatically learned in model training, and the receptive field of the convolution after the displacement is not a square any more, but becomes an arbitrary polygon according to the condition of training data. Aiming at the position offset existing in the multi-modal image, the two feature extractors can automatically adjust respective deformable convolution through training, so that the extracted features are aligned on the layer of a convolution receptive field. Meanwhile, the method does not need to provide additional supervision information to realize the calibration of the images in the two modes, so that the cost can be saved, and the method is easy to realize. The two modality images in the data input module are used as input, and the size of the feature maps finally output by the two feature extractors is N × C × H '× W', wherein H '═ H/16, W' ═ W/16, and C represents a feature dimension or a channel number, such as 512 or 2048.

The conventional convolution operation is as in equation (1), x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)₀,h₀) K represents a position number within the convolution range, for example, k is 9 in a 3 × 3 convolution; p is a radical of_kIs a positional shift with respect to p, w_kThe weights representing the locations of k points are learnable parameters. The deformable convolution is as formula (2) and adds two learnable parameters, Δ p, to formula (1)_kAnd Δ m_k。Δp_kRepresenting an additional added position offset, Δ m, of k points in the convolution_kRepresenting the additional weight of the k points in the convolution. As a supplementary explanation, the invention provides two implementation forms of the deformable convolution network as implementation cases. First, the feature extraction networks for visible light images and thermal infrared images are completely independent of each other, and the deformation parameters of the deformable convolution are learned by the respective feature extraction networks, i.e., the learnable parameters of the deformable convolution, w, in the two branch networks_k、Δp_k、Δm_kAre all independent. Second, visible light patternThe image and the thermal infrared image are partially independent of each other in the feature extraction network, i.e. the main calculations of feature extraction are independent of each other, i.e. w_kAre independent of each other. But the deformation parameter Δ p in the deformable convolution_kAnd Δ m_kThe method is shared, and the specific latter two deformation parameters are obtained by taking the fusion characteristics of the two modal characteristics as input learning.

As an optional implementation manner of the embodiment of the present invention, the candidate box extraction network includes: the first candidate frame extraction network is used for connecting the first deformable feature extractor and extracting visible light image candidate frames with objects in the visible light image feature map; and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the object in the infrared thermal image feature map.

Specifically, the method comprises the following steps:

candidate box extraction network: the candidate frame extraction network module aims to extract a candidate frame with an object, namely a prediction of a true circumscribed rectangular frame of a target object, by taking a feature map output by a feature extractor as an input, regardless of which type the object specifically belongs to. Specifically, k anchor frames (anchors) with different sizes are generated for each pixel point in the feature map, then the feature map in the k anchor frames is input into an extraction network, and the network predicts the probability that each anchor frame has an object, namely k × 2 classification results, and the offset of the anchor frame relative to the real position of the object, namely k × 4 regression results through calculation. For a feature map of size N × C × H '× W', the extraction network will output N × H '× W' × k × 2+ N × H '× W' × k × 4 results. Finally, through non-maximum suppression and elimination, M (usually M1024) candidate frames in which the object is most likely to exist are selected and stored in a matrix of size M × 4. The two branch networks will output M relatively independent candidate frames based on their respective feature maps.

As an optional implementation manner of the embodiment of the present invention, the candidate frame complementation module is specifically configured to add a portion, which is not covered by the infrared thermal image candidate frame, of the visible light image candidate frame to the infrared thermal image candidate frame, add a portion, which is not covered by the visible light image candidate frame, of the infrared thermal image candidate frame to the visible light image candidate frame, extract, according to the selected candidate frame, region features with different sizes at corresponding positions on the initial feature map, and unify, through the region pooling layer, the region features to the same size, so as to obtain the visible light image region feature map and the infrared thermal image region feature map with the same size.

Specifically, the method comprises the following steps:

candidate frame complementation module: the module is intended to fuse the candidate frames extracted by the two branch networks so as to realize the complementation of the two modes at the position level of the target object. For the condition that the lighting condition is not good, the candidate frames extracted by the color image branch are likely to be missed, and the candidate frames extracted by the thermal image branch are relatively stable; in addition, there are also situations where thermal image branches are missing but color images can be detected, such as on poles with low temperatures in cloudy days. The module obtains more complete candidate frames by using two modalities. Specifically, the module takes the candidate boxes of the two modalities obtained at the previous stage as input, adds a part IoU smaller than p e [0.5,0.8], for example, M, of the candidate boxes of the two modalities to the candidate box of the other modality, and increases the number of the candidate boxes of the two modalities to M' ═ M + M, where the value of the threshold value p may be slightly changed according to different embodiments. According to the selected candidate frame, region features with different sizes of corresponding positions on the initial feature map are extracted, and then the region features are unified to the same size L × L (for example, L is 7) through a region pooling layer (ROI posing). Finally, the module will output M 'region feature maps of two modalities, respectively, with a size of 2 × N × M' × C × L.

As an optional implementation manner of the embodiment of the present invention, the cross-modal attention fusion module is specifically configured to perform dimension reduction on the infrared thermal image area feature map and the visible light image area feature map through independent convolution, calculate two similarity relationships between each area feature in the infrared thermal image area feature map and each area feature in the visible light image area feature map, obtain a relationship matrix, normalize the similarity weights, convolve the features in the visible light image area feature map, multiply the convolved features with the relationship matrix, output a bimodal complementary enhancement area feature, and obtain a thermal image feature enhanced by a color image feature.

Specifically, the method comprises the following steps:

a cross-modal attention fusion module: for the effect of feature model fusion, the invention introduces a bidirectional feature enhancement module. Due to the different imaging mechanisms of different modalities on the environment, point-to-point feature fusion hardly provides substantial help. For example, pedestrians under dark light conditions mostly have darkness and invisibility on color images (the whole upper half of a life), and only a small part of the color images are illuminated (below the lower legs), so that the color image branch network does not have discrimination for extracting features from the points, and cannot play a great complementary or enhancement role by being fused into the features of the thermal images one by one according to position coordinates. While fusion at the regional level is a relatively more efficient approach. Specifically, with the help of the candidate frame complementation module, the color map branch can also acquire the candidate frame of the pedestrian which is mostly invisible in the above example, the corresponding area feature comprises the information of the small part of the illuminated shank, and the feature which only comprises a certain part of the object can be fused into the thermal image feature as better supplementary information to enhance the characterization capability of the latter. Therefore, the module adopts a strategy of fusion at the regional feature level.

In addition, different objects in a scene often have a certain dependency relationship, and the relationship among the objects can help to improve the representation capability of the model, so that the prediction accuracy is improved. For example, when one person appears on a zebra crossing, there are often others (because of the green light); the bench or trash can in the park is typically placed outside the lawn, rather than on the lawn. Therefore, the module also models the relationship between the regions/potential objects in the process of fusing the characteristics of the two modal regions, and uses the relationship between the objects to enhance the expression of the region characteristics.

Specifically, the module takes the infrared thermal image area characteristics and the visible light color image area characteristics as input, takes the infrared thermal image area characteristics as a query vector (query), takes the visible light color image area characteristics as a key vector (key) and a value vector (value), and fuses the visible light color image area characteristics into the infrared thermal image area characteristics according to the similarity relationship between the area characteristics with reference to the self-attention module, so as to realize the effect of enhancing the infrared light characteristics by the visible light characteristics, or realize the effect of complementing the dual-light characteristics, as shown in fig. 2. For convenience of description, taking N as 1, respectively performing independent L × L convolution on the thermal image region feature and the color image region feature to perform the same dimensionality reduction, and changing the dimensionality reduction into size query and key of M' × C × 1 × 1; then calculating two similarity relations between the characteristics of each region in the two modes to obtain a relation matrix with the size of M 'multiplied by M', the similarity calculation method can take the negative number, matrix multiplication or other of Euclidean distance, and then carrying out weight normalization on the similarity, wherein the embodiment defaults to use progressive softmax operation; meanwhile, the color image area feature is also changed into a value with the size of M '× C × L × L through a 1 × 1 convolution, and finally, the bimodal complementary enhanced area feature with the size of M' × C × L × L is output through matrix multiplication with a relation matrix, namely, the thermal image feature enhanced by the color image feature is obtained at the area feature level.

The model of the module is the relation between candidate areas, but not the relation between all pixel points in the original method, so the calculation complexity is much smaller (the former O (M 'x M') -10)⁶The latter O (H.times.Wx.times.Htimes.W) to 10⁸) And the calculation efficiency is high.

As an optional implementation manner of the embodiment of the present invention, the cross-modality attention fusion module further adds or merges the thermal image feature enhanced by the color image feature and the infrared thermal image area feature map; the classification and regression module is further configured to perform convolution calculation on the feature map obtained by adding or merging the thermal image features enhanced by the color image features with the infrared thermal image area feature map and the visible light image area feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset. And the output of the cross-modality attention fusion module may be added or merged (occlusion) with the thermal image region features.

As an optional implementation manner of the embodiment of the present invention, the system further includes: and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, performing model training, and continuously iterating, wherein the prediction error of the model is continuously reduced until convergence, so that the model capable of being deployed is obtained.

Specifically, the method comprises the following steps:

a loss calculation module: the module takes the prediction result of the whole model and the corresponding training target as input, calculates the comprehensive prediction error of the model in two tasks of frame regression and frame classification by using a conventional loss function, then returns the gradient of the error, and updates the parameters of the whole model according to a certain learning rate to realize model training. And repeating the iteration, and continuously reducing the prediction error of the model until convergence to finally obtain the model capable of being applied and deployed. In the Training, Mixed Precision Training (Mixed Precision Training) can be used for the network parameters, so that the aims of reducing video memory and accelerating the Training speed are fulfilled.

Therefore, the multispectral target detection blind guiding system provided by the embodiment of the invention utilizes the multispectral image to construct an all-weather blind guiding system based on a target detection network, applies deformable convolution to implicitly learn the alignment relation of the multispectral image in the feature extractor, deals with the possible position offset, provides a candidate frame complementary module and a cross-modal attention module, and fully utilizes the complementary information of the multispectral image, thereby realizing more accurate all-weather target detection. While further enhancing robustness to feature misalignment issues. Therefore, all-weather blind guiding can be supported, the multispectral image alignment relation can be implicitly learned, additional marking is not needed, the cost is saved, the complementary information of the multispectral image is more fully utilized by the candidate frame complementary module and the cross-modal attention module, the fusion is more effective, and the effect is better. In addition, the cross-modal attention module has low computational complexity and high efficiency.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A multi-spectral target detection blind guide system, comprising:

the deformable feature extractor module is used for respectively extracting the image features of the visible light image and the infrared thermal image by adopting deformable convolution and outputting a visible light image feature map and an infrared thermal image feature map;

a candidate frame complementing module, configured to add, to the infrared thermal image candidate frame, a portion of the visible light image candidate frame that is not covered by the infrared thermal image candidate frame, and add, to the visible light image candidate frame, a portion of the infrared thermal image candidate frame that is not covered by the visible light image candidate frame, so as to obtain a visible light image area feature map and an infrared thermal image area feature map;

a cross-mode attention fusion module, which is used for taking the infrared thermal image area characteristic diagram as a query vector, taking the visible light image area characteristic diagram as a key vector and a value vector, and fusing the visible light image area characteristic diagram into the infrared thermal image area characteristic diagram according to the similarity relation among the area characteristics by referring to a self-attention module to obtain thermal image characteristics enhanced by color image characteristics;

a classification and regression module, configured to perform convolution calculation on the thermal image features enhanced by the color image features and the visible light image region feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.

2. The system of claim 1,

the data input module is also used for determining the category and the position of a training target;

the system further comprises:

and the loss calculation module is used for calculating the comprehensive prediction error of the model in two tasks of frame regression and frame classification by adopting a loss function according to the target detection result and the training target, returning the gradient of the error, updating the model parameters, training the model, and continuously iterating until the prediction error of the model is continuously reduced until the model is converged to obtain the model capable of being applied and deployed.

3. The system of claim 1, wherein the deformable feature extractor module comprises:

the first deformable feature extractor is used for extracting image features of the visible light image to obtain a visible light image feature map;

the second deformable feature extractor is used for extracting image features of the infrared thermal image to obtain an infrared thermal image feature map;

the visible light image feature map and the infrared thermal image feature map are the same in size.

4. The system of claim 3, wherein the deformable convolution formula is:

wherein the content of the first and second substances,

for the conventional convolution operation formula, x represents the input feature map, y represents the output feature map, and p is the current pixel point position to be calculated (w)₀,h₀) K denotes the position number in the convolution range, p_kIs a position shift with respect to p, w_kRepresenting the weight, Δ p, corresponding to the position of the k point_kRepresenting an additional added position offset, Δ m, of k points in the convolution_kRepresenting the additional weight of the k points in the convolution.

5. The system of claim 4, wherein the first deformable feature extractor and the second deformable feature extractor each independently learn w_k、Δp_kAnd Δ m_k(ii) a Or the first and second deformable feature extractors learn w independently_kShared learning Δ p_kAnd Δ m_k。

6. The system of claim 3, wherein the candidate box extraction network comprises:

a first candidate frame extraction network, connected to the first deformable feature extractor, for extracting the visible light image candidate frame in which an object exists in the visible light image feature map;

and the second candidate frame extraction network is connected with the second deformable feature extractor and is used for extracting the infrared thermal image candidate frames of the objects in the infrared thermal image feature map.

7. The system of claim 1, wherein the candidate frame complementation module is specifically configured to add a portion of the visible-light image candidate frame that is not covered by the infrared thermal image candidate frame to the infrared thermal image candidate frame, add a portion of the infrared thermal image candidate frame that is not covered by the visible-light image candidate frame to the visible-light image candidate frame, extract region features with different sizes at corresponding positions on the initial feature map according to the selected candidate frame, and unify the region features to the same size through a region pooling layer to obtain the visible-light image region feature map and the infrared thermal image region feature map with the same size.

8. The system of claim 1,

the cross-modal attention fusion module is specifically configured to perform dimension reduction on the infrared thermal image region feature map and the visible light image region feature map through independent convolution, calculate two similarity relations between the infrared thermal image region feature map and each region feature in the visible light image region feature map, obtain a relation matrix, normalize the similarity by a weight, perform convolution on the features in the visible light image region feature map, multiply the convolved features with the matrix of the relation matrix, output bimodal complementary enhanced region features, and obtain the thermal image features enhanced by the color image features.

9. The system of claim 1,

the cross-modal attention fusion module is used for adding or merging the thermal image characteristics enhanced by the color image characteristics and the infrared thermal image area characteristic map;

the classification and regression module is further configured to perform convolution calculation on the thermal image features enhanced by the color image features, the feature map obtained by adding or merging the infrared thermal image region feature map and the visible light image region feature map to obtain a target detection result, where the target detection result includes: the category of each region and the candidate box offset.