CN116704273A

CN116704273A - Self-adaptive infrared and visible light dual-mode fusion detection method

Info

Publication number: CN116704273A
Application number: CN202310809010.9A
Authority: CN
Inventors: 徐立新; 辛栋; 张睿恒
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-05

Abstract

A self-adaptive infrared and visible light dual-mode fusion detection method belongs to the technical field of target detection. The weight matrix trained by the convolutional neural network is used as a feature fusion strategy, and the fusion feature proportion of each part of the infrared visible light picture is adaptively and accurately adjusted along with the network training; calculating a network structure by adopting the related weight; the method comprises the steps of carrying out weight calculation on different areas of two paths of infrared visible light feature images through a related weight calculation module formed by a multi-layer convolution network to obtain a weight matrix of an infrared target feature image and a weight matrix of a feature image of a corresponding visible light target; multiplying the weight matrix with the corresponding position of the original infrared visible light characteristic diagram, then superposing the multiplied weight matrix, outputting a fusion characteristic diagram, increasing the loss function of the training weight matrix, and improving the multi-task joint loss function; the method is suitable for the field of target detection, can better integrate different advantage intervals of the multimode information source when the multimode information is fused, and improves the accuracy and the environmental adaptability of target detection.

Description

Self-adaptive infrared and visible light dual-mode fusion detection method

Technical Field

The invention relates to a self-adaptive infrared and visible light dual-mode fusion detection method, and belongs to the technical field of target detection.

Background

As an important branch in the field of computer vision, object detection methods solve the problem of identifying and locating objects of interest in images or videos. Nowadays, the target detection technology is integrated into the aspects of life of people, is widely applied to the fields of automatic driving, security monitoring, intelligent home, industrial automation, medical diagnosis and the like, and greatly promotes the development of productivity. The target detection method based on deep learning can be classified into two types according to whether or not a candidate target frame is generated: two-Stage (Two-Stage) target detection methods and One-Stage (One-Stage) target detection methods. The two-stage object detection method divides the object detection process into two steps, the first step is to generate a suggested region (region of interest), the second step is to classify the generated suggested region, and fine-tuning the position of the suggested frame is performed. Such as regional convolutional neural networks (Region-based Convolutional Neural Network, R-CNN), fast-RCNN, and Fast-RCNN, etc.; the single-stage object detection method does not directly generate a suggested region, but instead maps an input image to an object frame location, using regression analysis to determine the size and location of the object frame. Such as Single Shot Multibox Detector (SSD) and You Only Look Once (Yolo) series, etc.

Because the optical characteristics of the target to be detected are different under the complex condition, the image quality is low due to the fact that the visible light image of the target to be detected is easily influenced by different environments such as weather, time period and the like in all weather and all time periods, and the traditional target detection system based on the single-mode visible light image is easily limited by the image quality to cause false detection and detection omission. To solve the above problems, many researchers have proposed a method of using a multi-modal image as training data. Multimodal images, such as infrared images and visible light images, have complementary advantages. The infrared image has the advantage of relying on a heat source generated by a target object, is not influenced by illumination conditions, and cannot capture detailed information of the target. The visible light image has the advantage of capturing texture features and detailed information of the target clearly, but is easily affected by illumination conditions. Therefore, research based on multi-modal target detection has become a current research hotspot. Krotosky et al use HOG as a feature extraction factor to perform feature extraction and cascade fusion on an input dual-mode image, and then use SVM for classification to obtain a detection result; fayez Lahoud proposes a real-time feature fusion method, which divides images into different levels for fusion; xu Ningwen et al use convolutional neural networks to perform feature level fusion of infrared visible light images on a small sample set. However, these methods have their weights manually adjusted to the feature distribution, and cannot adapt to a changeable environment; and the characteristics are overall regulation characteristics, so that the fusion of the multi-mode characteristics is insufficient, and the complementary advantages among the multi-mode characteristic information are required to be further improved.

Disclosure of Invention

Aiming at the problems that the existing target detection method cannot adapt to changeable environments and multi-mode characteristics are not fully fused, the main purpose of the invention is to disclose a self-adaptive infrared visible light dual-mode fusion detection method, wherein a weight matrix trained by a convolutional neural network is used as a fusion strategy for characteristic fusion, and the fusion characteristic proportion of each part of an infrared visible light picture is self-adaptively and accurately adjusted along with network training; calculating a network structure by adopting the related weight; the weight calculation is carried out on different areas of the two paths of infrared visible light feature images transmitted by a related weight calculation module formed by a multi-layer convolution network, so that a weight matrix of the infrared target feature image and a weight matrix of the feature image of the corresponding visible light target are obtained; then multiplying the weight matrix with the corresponding position of the original infrared visible light characteristic diagram, then superposing the multiplied weight matrix, outputting a fusion characteristic diagram, increasing the loss function of the training weight matrix, and improving the multi-task joint loss function; the aim of better integrating different advantage intervals of the multimode information source when the multimode information is fused is achieved, and the accuracy and the environmental adaptability of target detection are improved.

The invention aims at realizing the following technical scheme:

the invention discloses a self-adaptive infrared and visible light dual-mode fusion detection method, which specifically comprises the following steps:

and step 1, respectively extracting visible light image features and infrared image features by using two paths of trunk feature extraction networks.

And inputting registered infrared and visible light images, and respectively extracting visible light image features and infrared image features by using two paths of trunk feature extraction networks. The visible light image contains three channels of information, the infrared image contains single channel information, and four channels of information are extracted altogether.

Step 2, calculating a characteristic weight matrix through a related weight calculation network to obtain a fusion characteristic diagram, and adaptively carrying out characteristic fusion;

and (2) performing weight calculation on different areas of the visible light characteristic map and the infrared characteristic map obtained in the step (1) through a related weight calculation network formed by a multi-layer convolution network to obtain a weight characteristic map W of the infrared target characteristic map _IR Weight feature map W of feature map of corresponding visible light target _RGB . Obtaining a weight characteristic diagram, multiplying the weight characteristic diagram by the corresponding position of the original infrared visible light characteristic diagram, and then superposing the weight characteristic diagram to output a fusion characteristic diagram f ^m (w, h). The whole process is represented as follows:

wherein f _fus A transformation for a convolutional network; w (W) _IR A weight feature map which is an infrared target feature map; w (W) _RGB A weight feature map which is a feature map of a corresponding visible light target;and->Is the input infrared visible light.

Fusion of feature map f ^m Size of (w, h) and input infrared visible light characteristicsAnd->The same applies. The core of the feature fusion network is a related weight calculation network which is composed of a plurality of layers of convolutional neural networks. The size of the input infrared visible light characteristic diagram is m multiplied by w multiplied by h, m represents the number of channels, w and h represent the width and height of the characteristic diagram. The correlation weight calculation network firstly inputs the infrared characteristic diagram +.>And input visible light featuresCorresponding channel cascade operation:

where represents the convolution operation, K _i Z is the convolution kernel after the channel _concat Is a feature map obtained after the channel cascade operation, and has a size ((m) ₁ +m ₂ ) X w x h). Obtaining Z _concat Then, a plurality of convolution layers are input, and a batch normalization operation is performed after each convolution operation. Finally, the compression and normalization process is performed in the channel dimension using a flexibility maximization function (softmax):

obtain a weight feature map W of the infrared target feature map _IR And the right of the feature map of the corresponding visible light targetHeavy characteristic map W _RGB They are all of size (w×h) for each element [ ω ] in the matrix _IR ,ω _RGB ]All have omega _IR +ω _RGB ＝1。

Step 3, performing feature classification and bounding box regression by using the multi-task joint loss function;

and (3) after the fusion feature map is obtained based on the step (2), generating a region suggestion frame to preliminarily divide the feature map region, determining a suggestion frame containing the target by using a special classification module, and adjusting the position of the suggestion frame by using a bounding box regression to enable the position to be close to the position of a real target frame.

The feature classification module uses a flexible maximization function (softmax) as a classifier to calculate a probability value of a target contained within an initial detection boxDividing the detection frame into two types including a target and a target not, and thus preliminarily obtaining a candidate region including the target. Classification loss L _cls (p,i)＝-logP _i 。

The center point coordinates and width, height of the bounding box containing the object are four-dimensional vectors a= (a) _x ,A _y ,A _w ,A _h ) The bounding box regression module learns a mapping F to obtain a maximized near real box g= (G) _x ,G _y ,G _w ,G _h ) Regression frame r= (R) _x ,R _y ,R _w ,R _h ) The mapping relationship is as follows:

F(A _x ,A _y ,A _w ,A _h )＝(R _x ,R _y ,R _w ,R _h )

(R _x ,R _y ,R _w ,R _h )≈(G _x ,G _y ,G _w ,G _h )

training regression module to learn parameters W _* ^T Inputting initial target boundary frame parameter phi (A), obtaining predicted value d of regression frame _* (A)：

d _* (A)＝W _* ^T ·φ(A)

Finally learned parameter W _* ^T Values of (2)The method comprises the following steps:

wherein the argmin function represents a variable value when the objective function value is minimum, lambda is a proportionality coefficient for distributing the proportion of each branch,representing the true value.

The loss function of the regression module is smooth _L1 Loss of L _reg ：

After training, the regression module outputs the translation quantity and the transformation scale of each anchor frame and the real frame, and corrects the position of the initial target frame. And obtaining all the regional suggestion boxes containing the targets through the classification of the features and the regression operation of the boundary boxes, completing the detection task of the targets, and realizing the identification of the targets by the subsequent network. Inputting the region suggestion frames with different scales and the original infrared visible light fusion characteristics into a pooling layer, and adjusting and outputting the different sizes of the region suggestion frames to W _p *H _p Fixed size, fixed length output.

Step 4, calculating specific categories of targets in the suggestion frame in the fusion feature map by using a fully-connected network, outputting prediction confidence, and adjusting the position of the suggestion frame by using bounding box regression again;

the fully-connected network is divided into a target classification module and a bounding box regression module based on self-adaptive weight distribution. The fusion feature images of different layers are aligned through the region of interest, interpolated and extracted into fusion feature images with uniform size, sequentially pass through two full-connection layers, and then are divided into two branches; each branch still passes through a full connection layer, and then is respectively input into a target classification module and a bounding box regression module. Wherein the activation function of each full connection layer is a ReLU function. The height and width of each feature map is H and W, B is the batch size, and C is the category number.

The interesting region alignment is used as an aggregation region characteristic mode, and a bilinear interpolation method is adopted to replace quantization operation, so that the process of step-by-step quantization characteristic aggregation is converted into continuous operation; the method comprises the steps of traversing a candidate frame, keeping boundary coordinate points unchanged, continuously subdividing a candidate region into K multiplied by K units, calculating the values of four fixed coordinate positions in each unit by adopting a bilinear interpolation method, and finally executing the maximum pooling operation. The error back-propagation formula for the alignment of the region of interest is:

where d (,) represents the coordinate difference between two points, Δh and Δw are the difference between the abscissa and the ordinate of the floating point number when the sum of the feature image pixel points propagates forward, and become the bilinear interpolation coefficient of the original gradient.

In order to realize the training of the dual-mode target fusion detector based on the self-adaptive weight distribution and complete the training of the weight graph of the self-adaptive weight fusion module, a multi-task joint loss function is defined as follows:

wherein: sigma is a deep learning network related super parameter;

omega is the related weight to calculate the network super-parameter;

i represents an anchor frame index number;

L _cls representing a classification loss term;

L _reg representing regression loss terms;

N _cls representing the batch size of the training;

N _reg representing the size of the feature map;

λ represents a correlation coefficient for balancing weights of the classification branch and the regression branch;

η represents a weight matrix training coefficient for balancing the overall loss.

t _i Representing a bounding box prediction value;

a value representing a real frame corresponding to the anchor frame containing the target;

p _i representing the target confidence level output by the feature classification module;

representing the prediction confidence of the real frame;

and classification loss and regression loss, respectively;

L _ω (p,p _i ^* i omega) represents the loss function of the weight matrix,when the super parameter of the related weight calculation network is omega, the error obeys Gaussian distribution, and the probability of being output as a true value is as follows:

in converged network training, a small batch gradient descent method is used to minimize loss, thereby obtaining the optimal parameters (sigma) of the network ^* ，ω ^* )：

So that the weight feature matrix W of the infrared target feature map _IR And a weight feature matrix W of the feature map of the corresponding visible light target _RGB The optimal distribution is achieved, and the whole process is realized by automatic adjustment.

And 5, calculating the classification loss and the regression loss of the regional suggestion network by using the classification loss function and the regression loss function. Minimizing loss by using a small batch gradient descent method, training target classification and bounding box regression;

and 6, training the whole target detection network parameters by using the multi-task joint loss function again through the region suggestion network trained in the step 5, and obtaining a target frame and a corresponding category confidence coefficient.

And 7, using the target detector trained in the step 6, fusion detection can be carried out aiming at complementary information acquired by different sensors, the advantages of different information sources are synthesized, the accuracy and the environmental adaptability of target detection are improved, and the problems of target missing detection and false detection in a complex environment are effectively solved.

The beneficial effects are that:

1. according to the self-adaptive infrared and visible light dual-mode fusion detection method disclosed by the invention, the self-adaptive characteristic fusion module and the subsequent multi-task classification regression module are used for calculating the optimal fusion coefficients at different positions of different characteristic diagrams to fuse the characteristics by using a self-adaptive adjustment method, so that the target detection precision is greatly improved on the basis of increasing the space-time adaptability.

2. According to the self-adaptive infrared and visible light dual-mode fusion detection method disclosed by the invention, the related weight calculation network adopts the learning capacity and the self-adaptive adjustment capacity of the deep learning network, and weights are optimized according to the prediction result, so that the all-weather self-adaptive capacity of the dual-mode fusion detection network is improved. Meanwhile, a weight map is used for replacing a single weight, so that the situation that the definition distribution of different target categories of a pair of infrared and visible light dual-mode pictures is uneven is better adapted, and the environmental adaptability of target detection is improved.

3. According to the self-adaptive infrared and visible light dual-mode fusion detection method disclosed by the invention, a related weight calculation module formed by a multi-layer convolution network is used for calculating weights of different areas of two paths of infrared and visible light characteristic diagrams, so that a weight matrix of an infrared target characteristic diagram and a weight matrix of a corresponding visible light target characteristic diagram are obtained; the weight matrix is multiplied by the corresponding position of the original infrared visible light characteristic diagram and then overlapped, the output fusion characteristic diagram increases the loss function of the training weight matrix, the multi-task joint loss function is improved, and the detection precision of the target is improved.

Drawings

FIG. 1 is a flow chart of a method for detecting adaptive infrared and visible light dual-mode fusion in the invention;

FIG. 2 is a basic framework of a dual-mode target fusion detection network with adaptive weight distribution in the present embodiment;

FIG. 3 shows two different residual units in the present embodiment;

FIG. 4 is a detailed structure of the trunk feature extraction network in the present embodiment;

fig. 5 is a general structure of a feature fusion network for adaptive weight distribution in the present embodiment;

FIG. 6 is a related weight calculation network in the present embodiment;

FIG. 7 is a diagram showing a structure of an infrared-visible light fusion feature classification and bounding box regression module in the present embodiment;

FIG. 8 is a schematic diagram of the initial detection frame generation principle in the present embodiment;

FIG. 9 is a schematic diagram of a multi-task classification regression module based on adaptive weight distribution.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.

In the past, the target detection algorithm only based on a single-mode image (infrared image or visible light image) cannot adapt to the change of weather, illumination and time, so that in order to better identify all-weather targets in environments such as complex space-time background (pedestrian monitoring, automatic driving and military detection), and the like, different sensors are used for collecting complementary information to perform fusion detection, and therefore the advantages of different information sources can be utilized, and the problems of target missing detection and false detection in the complex environment are effectively solved. In this embodiment, the different advantages of the infrared camera and the visible camera detection information are utilized, and the infrared image and the visible image of the target are used as the source of the fusion information.

The embodiment discloses a self-adaptive infrared visible light dual-mode fusion detection method, which comprises the steps of main feature extraction, self-adaptive feature fusion, multi-task classification regression and target identification as shown in fig. 2. The feature extraction is carried out on the infrared visible light image by a two-way trunk feature extraction network respectively, the obtained two-way features are subjected to self-adaptive weight fusion by a fusion network and then are input into a classifier and a regressive device for feature classification and bounding box regression, and finally, various vehicle target classification is carried out in target recognition, so that a final detection result is obtained.

The embodiment discloses a self-adaptive infrared visible light dual-mode fusion detection method for identifying a vehicle, which comprises the following steps:

The trunk feature extraction network is composed of a plurality of layers of convolution neural networks which are symmetrical up and down on two paths, the visible light image is input in three paths, the infrared image is input in a single path, and besides, the two network structures are basically the same, so that the feature extraction network of the visible light image is selected for analysis. The network has five layers of structures, the first layer of structure is relatively simple, the preprocessing operation is carried out on the input image, the later four layers of structures are approximately the same and are all composed of residual blocks, but the number of the residual blocks and the depth of the network are different. The residual block adopts a short circuit connection structure, so that the gradient disappearance problem in the deep learning network training can be effectively relieved.

The backbone features have a five-layer structure. There are two modules in the first layer of the backbone feature extraction network: a convolution layer and a pooling layer. The convolutional layer includes a batch integration operation (Batch Normalization) using a loss function of ReLU, kernel size 7×7, step size 2. The pooling layer uses a maximum pooling method with a kernel size of 3×3 and a step size of 2. The second layer consists of three residual blocks, consisting of two different residual units, CB (Conv Block) and IB (Identity Block) respectively, as shown in fig. 3. CB is used to match the dimension difference between network output and input, IB is used to match the dimension of output and input, and IB is used to deepen the depth of network. The number of the third, fourth and fifth residual blocks is 4, 6 and 3, and the detailed framework is shown in fig. 4. The channel number (channel), height (height) and width (width) of the visible light image input into the network are expressed by three-dimensional feature vectors (C, H, W), the initial vectors are set to be (3, H, W), and the initial vectors are respectively (64, H/4, W/4), (256, H/4, W/4), (512, H/8,W/8), (1024, H/16, W/16), (2048, H/32, W/32) after sequentially passing through five layers of network structures.

Step 2, calculating a characteristic weight matrix through a related weight calculation network, and adaptively carrying out characteristic fusion;

the input of the characteristic fusion network of the self-adaptive weight distribution is two paths of infrared visible light characteristics extracted by a main characteristic extraction network improved by an attention mechanismAnd->) The output is a feature map f correspondingly fused according to the weight map (weight matrix) ^m (w, h). The core idea of the feature fusion network for adaptive weight distribution is as follows: the weight calculation is carried out on different areas of the two paths of infrared visible light feature images transmitted by a related weight calculation module formed by a multi-layer convolution network, so that a weight feature image W of the infrared target feature image is obtained _IR Weight feature map W of feature map of corresponding visible light target _RGB . Obtaining a weight graph (weight matrix), multiplying the weight graph with the corresponding position of the original infrared visible light characteristic graph, and then superposing the weight graph to output a fusion characteristic graph f ^m (w, h). Let the transform made by the convolution network be f _fus . The whole process is represented by the formula (1) and the formula (2), and the whole frame is shown in fig. 5:

as shown in fig. 5, the feature map f is fused ^m Size sum of (w, h)Input infrared visible light featuresAndthe same applies. The core of the feature fusion network is a related weight calculation network, which is composed of a plurality of layers of convolutional neural networks, and the specific structure of the feature fusion network is shown in fig. 6. The size of the input infrared visible light characteristic diagram is m multiplied by w multiplied by h, m represents the number of channels, and w and h are used for representing the width and the height of the characteristic diagram. The related weight calculation network firstly calculates the input infrared characteristic diagramAnd the visible light characteristic of the input +.>Corresponding to the channel cascade operation (concat), as shown in equation (3), where x represents the convolution operation, K _i Z is the convolution kernel after the channel _concat Is a feature map obtained after the concat operation, and has a size ((m) ₁ +m ₂ )×w×h)。

Obtaining Z _concat Then, a plurality of convolution layers are input, batch normalization is carried out after each convolution operation (the convolution layers comprise batch integration operation (Batch Normalization)), the loss function is used as a ReLU, the kernel size is 3 multiplied by 3, and the step size is 2. Finally, compressing and normalizing the channel dimension by using a flexible maximization function (Softmax) as shown in a formula (4) to obtain a weight feature map W of the infrared target feature map _IR Weight feature map W of feature map of corresponding visible light target _RGB They are all of size (w×h) for each element [ ω ] in the matrix _IR ,ω _RGB ]All have omega _IR +ω _RGB ＝1。

The design idea of the related weight calculation network is to use the learning ability and the self-adaptive adjustment ability of the deep learning network to optimize the weight according to the prediction result, thereby improving the all-weather self-adaptive ability of the dual-mode fusion detection network. Meanwhile, a weight graph (also called as a weight matrix) is used for replacing a single weight, so that the situation that the definition distribution of different target categories of a pair of infrared and visible light dual-mode pictures is uneven is better adapted, namely, the situation that infrared information of some target categories in a pair of pictures is richer and visible light information of other target categories is richer is better.

the fusion characteristics are obtained from the fusion layer, firstly, an anchor frame division characteristic map area is generated, an anchor frame containing targets is determined by using a classification module, the positions of the anchor frames are adjusted to be close to the positions of real target frames by using a boundary frame regression module, the generated area suggestion frame contains targets to be identified, a target detection function is completed, and fixed-length output lays a foundation for subsequent target identification after pooling by a pooling layer. As will be described in detail below, the overall structure is shown in fig. 7.

And performing window sliding operation on the infrared and visible light dual-mode feature map extracted by the trunk feature extraction network by using the multi-scale anchor frame to generate an initial detection frame. As shown in fig. 8.

The feature classification module calculates a probability value P of a vehicle object contained within the initial detection box using a flexible maximization function (softmax) as a classifier _i (value range is [0,1 ]]) Classifying the detection frames as shown in formula (5), and classifying them into two classes including a target and a target not, thereby preliminarily obtaining candidate regions including the target. Classification loss L _cls (p, i) is represented by formula (6).

L _cls (p,i)＝-logP _i (6)

Defining the coordinates of the center point and the width and height of the bounding box containing the object as a four-dimensional vector a= (a) _x ,A _y ,A _w ,A _h ) The bounding box regression module learns a mapping F to obtain a maximized near real box g= (G) _x ,G _y ,G _w ,G _h ) Regression frame r= (R) _x ,R _y ,R _w ,R _h ) The mapping relation is shown in formulas (7) and (8).

F(A _x ,A _y ,A _w ,A _h )＝(R _x ,R _y ,R _w ,R _h ) (7)

(R _x ,R _y ,R _w ,R _h )≈(G _x ,G _y ,G _w ,G _h ) (8)

Training regression module to learn parameters W _* ^T Inputting initial target boundary frame parameter phi (A), obtaining predicted value d of regression frame _* (A) As shown in formula (9).

d _* (A)＝W _* ^T ·φ(A) (9)

Finally learned parameter W _* ^T Values of (2)As shown in formula (10).

Wherein argmin function represents a variable value at which the objective function value is minimum,representing the true value.

The loss function of the regression module is smooth _L1 Loss of L _reg . As shown in formula (11).

The regression module outputs the translation amount and the transformation scale of each anchor frame and the real frame after training to correct the position of the initial target frame. All the regional suggestion boxes containing the vehicle targets are obtained through the feature classification and bounding box regression operation, the target detection task is completed, and the subsequent network can realize target identification. Inputting the region suggestion frames with different scales and the original infrared visible light fusion characteristics into a pooling layer, and adjusting and outputting the different sizes of the region suggestion frames to W _p *H _p Fixed size, fixed length output.

Step 4, performing target identification classification by using fully-connected network

As shown in fig. 9, the fully connected network can be divided into a target classification module and a bounding box regression module based on adaptive weight distribution. The fusion feature images of different layers are interpolated and extracted through a region of interest alignment Layer (ROI alignment) to form a fusion feature image with the unified size of 7 multiplied by 7, and sequentially pass through two fully-connected layers (FC layers), then are divided into two branches, each branch still passes through one fully-connected Layer, and then are respectively input into a target classification module and a boundary box regression module, wherein the activation function of each fully-connected Layer is a ReLU function. The target classification module and bounding box regression network parameter designs based on adaptive weight assignment are shown in table 1. The dimensions of the feature maps are given in the table, each feature map having a height and width of H and W, B being the Batch Size (Batch Size), and C being the number of categories.

Table 1 design of target classification module and bounding box regression network parameters based on adaptive weight distribution

The region of interest alignment (ROI alignment) is used as a characteristic mode of gathering the region, and a bilinear interpolation method is adopted to replace quantization operation, so that the process of gathering the step-by-step quantization characteristics is converted into continuous operation; the method comprises the steps of traversing a candidate frame, keeping boundary coordinate points unchanged, continuously subdividing a candidate region into K multiplied by K units, calculating the values of four fixed coordinate positions in each unit by adopting a bilinear interpolation method, and finally executing a maximum pooling operation (Max pooling). The error back-propagation formula of the ROI alignment is shown as a formula (12), wherein d (and) represents the coordinate difference between two points, and delta h and delta w are the difference between the horizontal and vertical coordinates of the floating point number when the sum of the pixel points of the feature image propagates forwards, so that the difference becomes the bilinear interpolation coefficient of the original gradient.

In order to realize the training of the dual-mode vehicle target fusion detector based on the self-adaptive weight distribution, the training of the weight graph (weight matrix) of the self-adaptive weight fusion module is also completed, so that a multi-task joint loss function is defined as shown in a formula (13).

Wherein: sigma is a deep learning network related super parameter;

omega is the related weight to calculate the network super-parameter;

i represents an anchor frame index number;

L _cls representing a classification loss term;

L _reg representing regression loss terms;

N _cls represents the batch size (batch size) of the training;

N _reg representing the size of the feature map;

t _i Representing a bounding box prediction value;

representing the prediction confidence of the real frame;

and->Classification loss and regression loss, respectively;

the loss function of the weight matrix is represented as shown in equation (7).

And when the correlation weight calculation network super-parameter is omega, the error is subjected to Gaussian distribution, and the probability of being output as a true value is shown as (8).

In converged network training, MBGD (small batch gradient descent) method is used to minimize loss, thereby obtaining the optimal parameters (sigma) of the network ^* ，ω ^* ) As shown in the formula (9), the weight characteristic matrix W of the infrared target characteristic diagram is formed _IR And a weight feature matrix W of the feature map of the corresponding visible light target _RGB The optimal distribution is achieved, and the whole process is realized by automatic adjustment.

Step 5, training and predicting the detection network

The training process of the self-adaptive dual-mode fusion detector comprises three steps of training a target classification and bounding box regression module, acquiring a suggested area and training the whole target detector, and setting related parameters by reasonably dividing the training steps to enable the network to converge as soon as possible.

(1) Training object classification and bounding box regression module

Training is performed by using a multi-task joint loss function (6), wherein the training process is a process that the function loss is continuously reduced.

(2) Acquiring a suggested region

After the target classification and bounding box regression network training is completed, feature images are input into the trained network, so that an area suggestion box containing targets is obtained, and the information is saved for subsequent network training, and the flow and the target detection are similar.

(3) Training an entire infrared visible light vehicle target detector

And (3) transmitting the suggested frames of different areas of the vehicle target stored before into a fully-connected network, judging specific categories by using a softmax function, and accurately regressing by using a smooth loss layer as a boundary frame. This completes the training of the last softmax layer and the training of the bounding box regression layer, and consequently the training of the whole detector network.

After the training of the detector network is completed, the detection flow of the detector network on the picture is as follows: inputting a picture, extracting a network through trunk features to obtain a feature map, generating candidate frames containing targets, carrying out fixed-length output through a pooling layer, and generating target frames with target category confidence and fine-tuned target positions through a full-connection layer. A large number of overlapping bounding boxes are often generated in the process of generating candidate boxes, non-maximum suppression is used to eliminate bounding boxes with low local confidence, and parts with high confidence are reserved. And the accuracy and the environmental adaptability of target detection are improved.

In order to verify the effectiveness of the self-adaptive infrared and visible light dual-mode fusion detection method, the detection performance of the single-mode target detection method (Faster-RCNN and YOLOv 4) and the detection performance of the self-adaptive infrared and visible light dual-mode fusion detection method are compared on complex urban road vehicle data sets (comprising 5 targets of the category of struct, sadan, van, bus and suv) at different places in different time periods, the performance of the method is measured through performance parameter multi-category average precision (mAP), and the larger the mAP value is, the better the performance is. From table 2, the mAP values of the method are larger than those of the two traditional single-mode detection methods, and the advantages of the method relative to the accuracy and environmental adaptability of target detection of the traditional single-mode detection method are proved.

Table 2 quantitative evaluation results on test sets at different sites at different time periods

The foregoing detailed description has set forth the objects, aspects and advantages of the invention in further detail, it should be understood that the foregoing description is only illustrative of the invention and is not intended to limit the scope of the invention, but is to be accorded the full scope of the invention as defined by the appended claims.

Claims

1. A self-adaptive infrared and visible light dual-mode fusion detection method is characterized in that: comprises the following steps of the method,

step 1, respectively extracting visible light image features and infrared image features by using two paths of trunk feature extraction networks;

step 5, calculating the classification loss and the regression loss of the regional suggestion network by using the classification loss function and the regression loss function; minimizing loss by using a small batch gradient descent method, and training a target classification and bounding box regression module;

2. The adaptive infrared-visible light dual-mode fusion detection method as defined in claim 1, wherein the method comprises the following steps: the method also comprises a step 7, wherein the trained target detector in the step 6 can be used for fusion detection aiming at complementary information acquired by different sensors, so that the advantages of different information sources are synthesized, the accuracy and the environmental adaptability of target detection are improved, and the problems of target missing detection and false detection in a complex environment are effectively solved.

3. The adaptive infrared-visible light dual-mode fusion detection method as defined in claim 1, wherein the method comprises the following steps: the implementation method of the first step is that,

inputting registered infrared and visible light images, and respectively extracting visible light image features and infrared image features by using two paths of trunk feature extraction networks; the visible light image contains three channels of information, the infrared image contains single channel information, and four channels of information are extracted altogether.

4. The adaptive infrared-visible light dual-mode fusion detection method as defined in claim 2, wherein the method comprises the following steps: the implementation method of the step 2 is that,

and (2) performing weight calculation on different areas of the visible light characteristic map and the infrared characteristic map obtained in the step (1) through a related weight calculation network formed by a multi-layer convolution network to obtain a weight characteristic map W of the infrared target characteristic map _IR Weight feature map W of feature map of corresponding visible light target _RGB The method comprises the steps of carrying out a first treatment on the surface of the Obtaining weight characteristic diagram and then visible with original infraredMultiplying the corresponding positions of the light feature images, and then superposing the multiplied light feature images to output a fusion feature image f ^m (w, h); the whole process is represented as follows:

wherein f _fus A transformation for a convolutional network; w (W) _IR A weight feature map which is an infrared target feature map; w (W) _RGB A weight feature map which is a feature map of a corresponding visible light target;and->Is input infrared visible light;

fusion of feature map f ^m Size of (w, h) and input infrared visible light characteristicsAnd->The same; the core of the feature fusion network is a related weight calculation network, which is composed of a plurality of layers of convolutional neural networks; the size of the input infrared visible light characteristic diagram is m multiplied by w multiplied by h, m represents the number of channels, w and h represent the width and height of the characteristic diagram; the correlation weight calculation network firstly inputs the infrared characteristic diagram +.>And input visible light featuresCorresponding channel cascade operation:

where represents the convolution operation, K _i Z is the convolution kernel after the channel _concat Is a feature map obtained after the channel cascade operation, and has a size ((m) ₁ +m ₂ ) X w x h); obtaining Z _concat Inputting a plurality of convolution layers, and performing batch standardization operation after each convolution operation; finally, the compression and normalization process is performed in the channel dimension using a flexibility maximization function (softmax):

obtain a weight feature map W of the infrared target feature map _IR Weight feature map W of feature map of corresponding visible light target _RGB They are all of size (w×h) for each element [ ω ] in the matrix _IR ,ω _RGB ]All have v _IR +ω _RGB ＝1。

5. The adaptive infrared-visible light dual-mode fusion detection method as defined in claim 3, wherein the method comprises the following steps of: the implementation method of the step 3 is that,

after the fusion feature map is obtained based on the step 2, generating a region suggestion frame to preliminarily divide the feature map region, determining a suggestion frame containing a target by using a special classification module, and adjusting the position of the suggestion frame by using a boundary frame regression to enable the position to be close to the position of a real target frame;

the feature classification module uses a flexible maximization function (softmax) as a classifier to calculate a probability value of a target contained within an initial detection boxDividing the detection frame into two types including a target and a non-target, so as to preliminarily obtain a candidate region including the target; classification loss L _cls (p,i)＝-logP _i ；

F(A _x ,A _y ,A _w ,A _h )＝(R _x ,R _y ,R _w ,R _h )

(R _x ,R _y ,R _w ,R _h )≈(G _x ,G _y ,G _w ,G _h )

training regression modules to learn parametersInputting initial target boundary frame parameter phi (A), obtaining predicted value d of regression frame _* (A)：

Finally learned parametersValue of->The method comprises the following steps:

wherein the argmin function represents a variable value when the objective function value is minimum, lambda is a proportionality coefficient for distributing the proportion of each branch,representing the true value;

the loss function of the regression module is smooth _L1 Loss of L _reg ：

After training, the regression module outputs the translation quantity and the transformation scale of each anchor frame and the real frame, and corrects the position of the initial target frame; obtaining all region suggestion frames containing targets through feature classification and bounding box regression operation, completing target detection tasks, and realizing target identification by a subsequent network; inputting the region suggestion frames with different scales and the original infrared visible light fusion characteristics into a pooling layer, and adjusting and outputting the different sizes of the region suggestion frames to W _p *H _p Fixed size, fixed length output.

6. The adaptive infrared-visible light dual-mode fusion detection method as defined in claim 4, wherein the method comprises the following steps: the implementation method of the step 4 is that,

the fully-connected network is divided into a target classification module and a bounding box regression module based on self-adaptive weight distribution; the fusion feature images of different layers are aligned through the region of interest, interpolated and extracted into fusion feature images with uniform size, sequentially pass through two full-connection layers, and then are divided into two branches; each branch still passes through a full-connection layer, and then is respectively input into a target classification module and a bounding box regression module; wherein the activation function of each full-connection layer is a ReLU function; the height and width of each feature map are H and W, B is the batch size, and C is the category number;

the interesting region alignment is used as an aggregation region characteristic mode, and a bilinear interpolation method is adopted to replace quantization operation, so that the process of step-by-step quantization characteristic aggregation is converted into continuous operation; the method comprises the steps of traversing a candidate frame, keeping boundary coordinate points unchanged, continuously subdividing a candidate region into K multiplied by K units, calculating the values of four fixed coordinate positions in each unit by adopting a bilinear interpolation method, and finally executing the maximum pooling operation; the error back-propagation formula for the alignment of the region of interest is:

wherein d (,) represents the coordinate difference between two points, and Δh and Δw are the difference between the horizontal and vertical coordinates of the floating point number when the sum of the pixel points of the feature image propagates forwards, so that the difference becomes the bilinear interpolation coefficient of the original gradient;

wherein: sigma is a deep learning network related super parameter;

omega is the related weight to calculate the network super-parameter;

i represents an anchor frame index number;

L _cls representing a classification loss term;

L _reg representing regression loss terms;

N _cls representing the batch size of the training;

N _reg representing the size of the feature map;

η represents a weight matrix training coefficient for balancing the overall loss;

t _i representing a bounding box prediction value;

representing the prediction confidence of the real frame;

and L _reg (t _i ,t _i ^* |σ) is the classification loss and regression loss, respectively;