CN111382683B

CN111382683B - Target detection method based on feature fusion of color camera and infrared thermal imager

Info

Publication number: CN111382683B
Application number: CN202010135485.0A
Authority: CN
Inventors: 殷国栋; 吴愿; 薛培林; 耿可可; 庄伟超; 黄文涵; 沈童; 于晨风; 邹伟; 卢彦博; 王金湘; 张宁; 陈建松; 任祖平
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2023-05-23
Anticipated expiration: 2040-03-02
Also published as: CN111382683A

Abstract

The invention discloses a target detection method based on feature fusion of a color camera and a thermal infrared imager, which comprises the following steps: a. obtaining a color data set through a color camera, and obtaining a thermal infrared data set through an infrared thermal imager; b. simultaneously inputting the bimodal data set into a bimodal YOLOv3 neural network algorithm, and extracting color characteristics and temperature characteristics of a target; fusing the features of the two modes at a certain layer of the backbone network through a fusion function and a 1 multiplied by 1 convolution block, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map; c. the fused extracted feature images are input into a subsequent convolution layer to classify targets, and an algorithm model of the trained bimodal neural network is output. According to the method, the temperature and the color information are fused in the bimodal neural backbone network algorithm, the target is predicted in the classification layer, various characteristic information of the target is increased, and the target identification accuracy is improved.

Description

Target detection method based on feature fusion of color camera and infrared thermal imager

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a target detection method based on feature fusion of a color camera and a thermal infrared imager.

Background

In complex terrain environments such as changeable background, shielding of terrain and ground objects and the like, and in natural environments with low visibility such as rainy days, haze and darkness, the accuracy rate is low by using the traditional target recognition scheme, and the requirement of automatic driving cannot be met. The color camera is particularly sensitive to light, and the neural network cannot extract the complete characteristics of the target, so that the accuracy of target identification cannot be improved by the network lifting algorithm. In order to ensure that an automatic driving vehicle can timely and accurately sense potential safety hazards existing in a road surface environment, measures are taken rapidly, traffic accidents are avoided, and a multi-sensor combined observation method is often adopted.

The existing automatic driving vehicle sensing platform mainly comprises sensors such as a laser radar, a color camera, a thermal infrared imager, a millimeter wave radar and the like. The laser radar mainly scans the surrounding environment of the vehicle to obtain three-dimensional distance information, senses the driving road condition through a distance analysis and identification technology, can obtain the three-dimensional distance information of an object, has high measurement accuracy and is not influenced by the environment, but is high in price, and cannot obtain the morphological characteristics of the target. The millimeter wave radar is mainly used for obtaining the distance information of the target, has the characteristics of short wavelength, wide frequency band, strong penetrating capacity and the like, is small in size and high in recognition precision, but cannot detect the morphological characteristics of the target, is relatively expensive and is not suitable for target detection and recognition. The color camera and the infrared thermal imager are mainly responsible for identifying the target, and have the advantages of lowest cost, convenient installation, small volume and low energy consumption. The color can obtain rich texture information, and the thermal infrared imager can obtain temperature information of the target.

Therefore, the method for jointly detecting the target by the color camera and the infrared thermal imager is widely applied, and how to fuse the temperature information of the thermal infrared camera with the color information of the color camera so as to achieve the optimal target detection effect is a problem to be solved.

Disclosure of Invention

In order to solve the problems, the invention provides a target detection method based on feature fusion of a color camera and an infrared thermal imager, which aims at solving the problem that the color camera cannot accurately identify a target in a special scene, fuses temperature information of the thermal infrared camera and color information of the color camera, respectively learns target features by using a dual-channel neural network algorithm, fuses at a certain layer of the dual-mode neural network algorithm, and then inputs the target feature information into a classification layer of the network algorithm for target prediction, so that various feature information of the target can be increased, and the accuracy of target identification is improved.

The technical scheme is as follows: the invention provides a target detection method based on feature fusion of a color camera and a thermal infrared imager, which comprises the following steps:

a. obtaining a color data set through a color camera, and obtaining a thermal infrared data set through an infrared thermal imager;

the step a comprises the following steps:

a.1, fixing a color camera and an infrared thermal imager on a sensor bracket, and ensuring that the visual angles of the color camera and the infrared thermal imager are consistent; then, carrying out joint calibration on the color camera and the infrared thermal imager to respectively obtain inner and outer parameter matrixes of the color camera and the infrared thermal imager, and completing the spatial synchronization;

a.2, acquiring the environment by using the infrared thermal imager and the color camera simultaneously in real time to acquire a color data set and a thermal infrared data set of a synchronous timestamp, and completing the time synchronization;

and a.3, performing de-distortion treatment on the shot color data set and the shot thermal infrared data set according to internal and external parameter matrixes of the color camera and the thermal infrared imager, and then registering to ensure that pixel points in the color data set correspond to pixel points on the thermal infrared data set one by one.

And a Zhang Zhengyou calibration plate is adopted for combined calibration in the step a.1, and an inner and outer parameter matrix of the two sensors is obtained through a Zhang Zhengyou calibration method.

b. Simultaneously inputting a bimodal data set consisting of the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm, and extracting color characteristics and temperature characteristics of a target; fusing the features of the two modes with a 1 multiplied by 1 convolution block through a fusion function at a certain layer of the YOLOv3 backbone network, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map;

The bimodal YOLOv3 neural network algorithm in the step b comprises a dual-channel input layer; one channel of the input layer inputs a color dataset and the other channel inputs a thermal infrared dataset;

the bimodal YOLOv3 neural network algorithm in the step b comprises a main network and a subsequent convolution layer; the backbone network is Darknet-53, and the total number of the backbone networks is 52; the subsequent convolution layers add up to 23 layers.

The formula of the fusion function in the step b is y _i ＝f(p _i ,q _i )；

Wherein p is _i For a feature map matrix of a color dataset of a certain layer, the dimension is n×c ₁ ×h×w；q _i For a feature map matrix of a thermal infrared dataset of a certain layer, the dimension is n×c ₂ ×h×w；

n represents the number of images, h represents the height of the feature map matrix, w represents the width of the feature map matrix, c ₁ The number of channels of the feature map matrix representing the color dataset, c ₂ A channel number of the feature map matrix representing the thermal infrared dataset;

y obtained after fusion function _i The dimension of the matrix is n×c ₀ X h x w, where c ₀ ＝c ₁ +c ₂ 。

The step b may adopt any one of the following schemes:

scheme one,

b.1, fusing the 1X 1 convolution block with a fusion function at the 1 st layer of the backbone network;

the color data set and the thermal infrared data set are simultaneously input into a first layer of a bimodal YOLOv3 neural network algorithm, and data set images of two modalities are subjected to linear superposition through a fusion function to obtain a superposition data set; the dimension of the color dataset is n×c ₁ X h x w, the dimension of the thermal infrared dataset is n x c ₂ X h x w, the dimension of the superimposed dataset is n x c ₀ X h x w; wherein c ₀ ＝c ₁ +c ₂ ；

The 1 x 1 convolution block includes 3 dimensions c ₀ A convolution kernel function of x 1 and an activation function;

c when each convolution kernel extracts image features ₀ The x 1 convolution kernel is respectively associated with c of each unit region on the superimposed dataset image ₀ The local matrix of x 1 is weighted and summed, and the dimension of the output matrix is 1 x 1; the single image matrix dimension after weighted summation becomes 1 xhxw;

the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the fused image is output as n multiplied by 3 multiplied by h multiplied by w through the operation of an activation function;

b.2, continuously inputting the fused image matrix into a 52-layer of an original main-trunk network to perform feature extraction operation, and extracting features from shallow single lines, color and other edge features to deep semantic features of a certain part of the deep image; because a layer of 1 multiplied by 1 convolution layer is added in the network algorithm, the sequence numbers of the convolution layers of other layers are sequentially increased by 1, the 26 th layer of the network outputs a first extraction feature image, the 43 rd layer outputs a second extraction feature image, and the 52 th layer outputs a third extraction feature image; wherein the matrix dimension of the first extracted feature map is n×256×h/8×w/8, the matrix dimension of the second extracted feature map is n×512×h/16×w/16, and the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;

Scheme II,

b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 25 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image; after the front 25 layers of convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;

b.2, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function;

linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;

the 1 x 1 convolution block comprises 256 convolution kernel functions and activation functions with 512 x 1 dimensions;

when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;

The matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;

b.3, continuously inputting the matrix of the first extracted feature map into the remaining convolution layers of the main network to continuously extract features, wherein the number sequence of the convolution layers after the 26 th layer of the main network is sequentially increased by one because a layer of 1×1 convolution is added to the 26 th layer; layer 43 outputs a second extracted feature map and layer 52 outputs a third extracted feature map; wherein the matrix dimension of the second extracted feature map is nx512×h/16×w/16 and the matrix dimension of the third extracted feature map is nx1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;

scheme III,

b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 42 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image;

after the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;

Fusing the data set images of the two modes output by the 25 th layer with a 1X 1 convolution block through a fusion function;

the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions;

after the convolution of the layers 3 and 42, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;

fusing the data set images of the two modes output by the 42 th layer with a 1X 1 convolution block through a fusion function;

linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;

The 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1;

when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;

the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;

b.4, continuously inputting the matrix of the second extracted feature map into the remaining convolution layers of the main network to continuously extract features until a third extracted feature map is output; the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, and the execution of the Darknet-53 convolution layer ends;

scheme IV,

b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using 52 convolution layers of a main network, extracting marginal features such as single lines and colors from a shallow layer through convolution operation, and then extracting deep semantic features of a certain part on a deep image;

after the 42 th layer convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;

after the 51 st layer convolution, the feature map output matrixes of the color data set and the thermal infrared data set are n multiplied by 1024 multiplied by h/32 multiplied by w/32;

b.3, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function;

b.4, fusing the data set images of the two modes output by the 42 th layer with a 1 multiplied by 1 convolution block through a fusion function;

b.5, fusing the data set images of the two modes output by the 51 st layer with a 1X 1 convolution block through a fusion function;

linearly superposing the data set images of the two modes output by the 51 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n×2048×h/32×w/32;

The 1 x 1 convolution block used at layer 51 comprises 1024 convolution kernel functions and activation functions with dimensions 2048 x 1;

when each convolution kernel performs extraction of image features, the 2048 x 1 convolution kernels are respectively weighted and summed with the 2048 x 1 local matrix for each unit region on the superimposed dataset image, the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/32 xw/32;

and (3) a matrix obtained by carrying out weighted summation on each image of the superimposed data set, and outputting a matrix of the third extracted feature map as n multiplied by 1024 multiplied by h/32 multiplied by w/32 through an activated function operation.

c. The fused extracted feature images are input into a subsequent convolution layer to classify targets, and finally an algorithm model of the trained bimodal neural network is output.

The step c comprises the following steps: inputting the third extracted feature map into a subsequent convolution layer to alternately execute the convolution operation of 1×1 and 3×3, and outputting a first prediction feature map matrix after u layers; after the (u+1) th layer, the channel number of the original third extracted feature map becomes 256, the channel number linear superposition operation is carried out with the second extracted feature map, and after the second extracted feature map passes through the 5 th convolution layers, a second prediction feature map matrix is generated after the (u+6) th convolution layers; then after the execution of the layer u+7, the channel number of the original third extraction feature map is changed to 128, and the linear superposition of the channel number and the first extraction feature map is executed; generating a result of a third prediction feature map matrix after the (u+12) th layer; the first prediction feature map matrix, the second prediction feature map matrix and the third prediction feature map matrix are MXh/32 Xw/32, MXh/16 Xw/16, MXh/8 Xw/8, respectively;

Where m=3× (5+m), M representing the classification of the predicted targets;

when step b selects scheme one, scheme two or scheme three, u=58;

when scheme four is selected in step b, u=57.

d. The algorithm model of the bimodal neural network is evaluated by the following parameters:

b, performing a parameter I, namely training time required by a YOLOv3 neural network algorithm adopting four fusion schemes in the step b;

the second parameter is that the color data set and the thermal infrared data set which are not trained by the network are respectively input into an algorithm model of the bimodal neural network correspondingly obtained by four fusion schemes with the same training times for testing, and the numerical value of the network loss function is obtained by testing;

setting algorithm model thresholds of the bimodal neural network to 0.3 and 0.5 respectively to obtain mAP values of the neural network under the two thresholds;

detecting targets in an untrained dataset by using algorithm models of bimodal neural networks with thresholds of 0.3 and 0.5 respectively to obtain the number of the predicted correct labels and the number of the predicted error labels;

and the parameters five and four fusion schemes correspond to the real-time performance of the obtained algorithm model detection target of the bimodal neural network.

The beneficial effects are that: according to the invention, a bimodal data set consisting of a color data set and a thermal infrared data set is input into a neural network algorithm for training, the bimodal neural network algorithm is utilized to simultaneously extract color information and temperature information of a target, and then 1X 1 convolution fusion is carried out on a certain layer of a main network, so that target detection based on characteristic fusion of a color camera and an infrared thermal imager is realized. The method is different from the traditional single-mode color camera for detecting the target, the infrared camera increases the temperature dimension characteristic of the target, and the accuracy of target identification can be improved.

The characteristic extraction process is actually multiplication of a multidimensional matrix, when an image matrix is input into a neural network, various convolution kernels are multiplied by the matrix, and when the matrix obtained by multiplying the convolution kernels by the image matrix is overlapped by the channel number, a characteristic diagram about characteristics is obtained; the color camera and the infrared thermal imager are multidimensional descriptions of the same target, when the dimension of the feature description is increased, the features extracted by using the convolution kernel are increased, and at the moment, the features on the feature map are richer, so that the classification of the classification layer is facilitated.

The fusion method adopted by the invention not only changes the channel number of the image input, but also changes the original simple linear superposition of the channel number into the nonlinear output, and the nonlinear output enables the algorithm to extract more characteristics of the same target, so that the algorithm has stronger target prediction capability and stronger robustness.

The bimodal YOLO v3 network algorithm not only extracts the temperature and color characteristics of the same target and deep semantic characteristics hidden on the image, but also can fit nonlinear color characteristics well and fit nonlinear infrared characteristics better through training, so that the network algorithm can judge the category according to the characteristics on the infrared image in a dim light environment, and the accuracy of target identification is increased.

Drawings

FIG. 1 is a schematic diagram of the structure of the bimodal de YOLO v3 neural network algorithm of the present invention;

FIG. 2 is a schematic diagram of a backbone network algorithm structure of a first fusion scheme;

fig. 3 is a schematic diagram of a backbone network algorithm structure of a second fusion scheme;

fig. 4 is a schematic diagram of a backbone network algorithm structure of a third fusion scheme;

fig. 5 is a schematic diagram of a backbone network algorithm structure of the fusion scheme four.

Detailed Description

Referring to fig. 1, the invention discloses a target detection method based on feature fusion of a color camera and a thermal infrared imager, which comprises the following steps:

the step a comprises the following steps:

b. 70% of a bimodal data set consisting of the color data set and the thermal infrared data set is simultaneously input into a bimodal YOLOv3 neural network algorithm, and color characteristics and temperature characteristics of a target are extracted;

the characteristic extraction process is actually multiplication of a multidimensional matrix, when an image matrix is input into a neural network, various convolution kernels are multiplied by the matrix, and when the matrix obtained by multiplying the convolution kernels by the image matrix is overlapped by the channel number, a characteristic diagram about characteristics is obtained; the color camera and the infrared thermal imager are multidimensional description of the same target, when the dimension of the feature description is increased, the features extracted by using the convolution kernel are increased, and at the moment, the features on the feature map are richer, so that the classification of the classification layer is facilitated;

Fusing the features of the two modes with a 1 multiplied by 1 convolution block through a fusion function at a certain layer of the YOLOv3 backbone network, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map;

The formula of the fusion function in the step b is y _i ＝f(p _i ,q _i )；

The 52 convolution layers of YOLOv3 are composed of a plurality of convolution layers plus 5 repeated residual blocks, after an image is input into a network, firstly one layer of convolution operation is carried out, secondly the convolution operation with the step length of 2 is carried out, then the repeated residual blocks are carried out, and the repeated times of the repeated residual blocks are respectively 1, 2, 8 and 4 times. Each repeated residual block consists of two layers of convolution and residual calculation, wherein the two layers of convolution firstly execute 1×1 and then execute 3×3 operation, the number of channels is halved and then doubled compared with the previous layer, and the residual calculation does not calculate the convolution calculation. For example, the size of the image input into the network is 416×416, the image is calculated by 5 residual blocks, the image size is 208×208, 104×104, 52×52, 26×26, 13×13, the image is reduced by 25 times by 5 downsampling, the feature map of the image is output after the residual calculation with the repetition times of 8, 8 and 4, and the obtained feature map is 1024×13×13, 512×26×26, 256×52×52, and the output feature map is continuously input into the subsequent classification layer for classification.

The step b may adopt any one of the following schemes: in this embodiment, the dimensions of the color data sets input by the four schemes are n×3×416×416, and the dimensions of the thermal infrared data sets are n×3×416×416;

Scheme one is shown in fig. 2:

the activation function is to ensure that the output of the convolution layer is not only the superposition of the simple linear combination of the previous layer, but also the mapping relation of a function, and the simple linear combination can lead to the effect that no matter what layer number of convolution layers is equivalent to only one layer of convolution layer, so that the variety of functions which can be fitted by a network algorithm is reduced, and even only linearly separable classification can be fitted. When the operation of the activation function is performed, the tf.nn.leak_relu function is called in the embodiment, which is an activation function commonly used in a neural network, and only needs to judge whether the input value is greater than zero or not during the operation, the value smaller than zero is set to be zero, and the part greater than zero is linear;

The upper fusion method not only changes the channel number of the image input, but also changes the original simple linear superposition of the channel number into nonlinear output, and the nonlinear output enables the algorithm to extract more characteristics of the same target, so that the algorithm has stronger capability of predicting the target and stronger robustness;

scheme two is shown in figure 3:

Scheme three is shown in fig. 4:

scheme four is shown in fig. 5:

The step c comprises the following steps: inputting the third extracted feature map into a subsequent convolution layer to alternately execute the convolution operation of 1×1 and 3×3, and outputting a first prediction feature map matrix after u layers; after the (u+1) th layer, the channel number of the original third extracted feature map becomes 256, the channel number linear superposition operation is carried out with the second extracted feature map, and after the second extracted feature map passes through the 5 th convolution layers, a second prediction feature map matrix is generated after the (u+6) th convolution layers; then after the execution of the layer u+7, the channel number of the original third extraction feature map is changed to 128, and the linear superposition of the channel number and the first extraction feature map is executed; generating a result of a third prediction feature map matrix after the (u+12) th layer; the first prediction feature map matrix, the second prediction feature map matrix and the third prediction feature map matrix are MXh/32 Xh/32, MXh/16 Xw/16, MXh/8 Xw/8, respectively;

Where m=3× (5+m), M representing the classification of the predicted targets; in the embodiment, n is 2, and the prediction targets are classified into human and vehicle classes;

when step b selects scheme one, scheme two or scheme three, u=58;

when scheme four is selected in step b, u=57.

According to the dual-mode YOLO v3 network algorithm, the temperature and color characteristics of the same target and deep semantic characteristics hidden on the image are extracted at the same time, in practice, the network algorithm can be used for fitting nonlinear color characteristics well and nonlinear infrared characteristics well through training, so that the network algorithm can be used for judging the types according to the characteristics on the infrared image in a dim light environment, and the accuracy of target identification is increased.

respectively inputting the remaining 30% of color data sets and thermal infrared data sets which are not trained by the network into algorithm models of bimodal neural networks correspondingly obtained by four fusion schemes trained for the same times for testing, and testing the obtained values of network loss functions;

Claims

1. The target detection method based on feature fusion of the color camera and the infrared thermal imager is characterized by comprising the following steps of:

c. The fused extracted feature images are input into a subsequent convolution layer to classify targets, and finally an algorithm model of the trained bimodal neural network is output;

the bimodal YOLOv3 neural network algorithm comprises a dual-channel input layer; one channel of the input layer inputs a color dataset and the other channel inputs a thermal infrared dataset;

the bimodal YOLOv3 neural network algorithm comprises a backbone network and a subsequent convolution layer; the backbone network is Darknet-53, and the total number of the backbone networks is 52; the subsequent convolution layers total 23 layers;

the formula of the fusion function in the step b is y _i ＝f(p _i ,q _i )；

n represents the number of images, h represents the height of the feature map matrix, w represents the width of the feature map matrix, c ₁ Feature moments representing color datasetsNumber of channels of array, c ₂ A channel number of the feature map matrix representing the thermal infrared dataset;

2. The method for detecting the target based on feature fusion of a color camera and a thermal infrared imager according to claim 1, wherein the method comprises the following steps: the step b may adopt any one of the following schemes:

Scheme one,

Scheme II,

b.2, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;

scheme III,

the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions; when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1;

the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;

The 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1; when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16; the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;

b.4, continuously inputting the matrix of the second extracted feature map into the remaining convolution layers of the main network to continuously extract features until a third extracted feature map is output;

the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, and the execution of the Darknet-53 convolution layer ends;

scheme IV,

b.3, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;

when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1;

the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;

b.4, fusing the data set images of the two modes output by the 42 th layer with a 1 multiplied by 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;

the 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1; when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;

b.5, fusing the data set images of the two modes output by the 51 st layer with a 1X 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 51 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n×2048×h/32×w/32;

3. The method for detecting an object based on feature fusion of a color camera and a thermal infrared imager according to claim 2, wherein the step c comprises: inputting the third extracted feature map into a subsequent convolution layer to alternately execute the convolution operation of 1×1 and 3×3, and outputting a first prediction feature map matrix after u layers; after the (u+1) th layer, the channel number of the original third extracted feature map becomes 256, the channel number linear superposition operation is carried out with the second extracted feature map, and after the second extracted feature map passes through the 5 th convolution layers, a second prediction feature map matrix is generated after the (u+6) th convolution layers; then after the execution of the layer u+7, the channel number of the original third extraction feature map is changed to 128, and the linear superposition of the channel number and the first extraction feature map is executed; generating a result of a third prediction feature map matrix after the (u+12) th layer; the first prediction feature map matrix, the second prediction feature map matrix and the third prediction feature map matrix are MXh/32 Xw/32, MXh/16 Xw 16, MXh/8 Xw/8 respectively;

Where m=3× (5+m), M representing the classification of the predicted targets;

when step b selects scheme one, scheme two or scheme three, u=58;

when scheme four is selected in step b, u=57.

4. The method for detecting an object based on feature fusion of a color camera and a thermal infrared imager according to claim 3, wherein the step a comprises:

5. The method for detecting the target based on feature fusion of a color camera and a thermal infrared imager according to claim 4, wherein the method comprises the following steps: and a Zhang Zhengyou calibration plate is adopted for combined calibration in the step a.1, and an inner and outer parameter matrix of the two sensors is obtained through a Zhang Zhengyou calibration method.

6. The method for detecting the target based on feature fusion of a color camera and a thermal infrared imager according to any one of claims 2 to 5, wherein the method comprises the following steps: and also comprises