CN111382683B - Target detection method based on feature fusion of color camera and infrared thermal imager - Google Patents

Target detection method based on feature fusion of color camera and infrared thermal imager Download PDF

Info

Publication number
CN111382683B
CN111382683B CN202010135485.0A CN202010135485A CN111382683B CN 111382683 B CN111382683 B CN 111382683B CN 202010135485 A CN202010135485 A CN 202010135485A CN 111382683 B CN111382683 B CN 111382683B
Authority
CN
China
Prior art keywords
multiplied
data set
matrix
convolution
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010135485.0A
Other languages
Chinese (zh)
Other versions
CN111382683A (en
Inventor
殷国栋
吴愿
薛培林
耿可可
庄伟超
黄文涵
沈童
于晨风
邹伟
卢彦博
王金湘
张宁
陈建松
任祖平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010135485.0A priority Critical patent/CN111382683B/en
Publication of CN111382683A publication Critical patent/CN111382683A/en
Application granted granted Critical
Publication of CN111382683B publication Critical patent/CN111382683B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing

Abstract

The invention discloses a target detection method based on feature fusion of a color camera and a thermal infrared imager, which comprises the following steps: a. obtaining a color data set through a color camera, and obtaining a thermal infrared data set through an infrared thermal imager; b. simultaneously inputting the bimodal data set into a bimodal YOLOv3 neural network algorithm, and extracting color characteristics and temperature characteristics of a target; fusing the features of the two modes at a certain layer of the backbone network through a fusion function and a 1 multiplied by 1 convolution block, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map; c. the fused extracted feature images are input into a subsequent convolution layer to classify targets, and an algorithm model of the trained bimodal neural network is output. According to the method, the temperature and the color information are fused in the bimodal neural backbone network algorithm, the target is predicted in the classification layer, various characteristic information of the target is increased, and the target identification accuracy is improved.

Description

Target detection method based on feature fusion of color camera and infrared thermal imager
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a target detection method based on feature fusion of a color camera and a thermal infrared imager.
Background
In complex terrain environments such as changeable background, shielding of terrain and ground objects and the like, and in natural environments with low visibility such as rainy days, haze and darkness, the accuracy rate is low by using the traditional target recognition scheme, and the requirement of automatic driving cannot be met. The color camera is particularly sensitive to light, and the neural network cannot extract the complete characteristics of the target, so that the accuracy of target identification cannot be improved by the network lifting algorithm. In order to ensure that an automatic driving vehicle can timely and accurately sense potential safety hazards existing in a road surface environment, measures are taken rapidly, traffic accidents are avoided, and a multi-sensor combined observation method is often adopted.
The existing automatic driving vehicle sensing platform mainly comprises sensors such as a laser radar, a color camera, a thermal infrared imager, a millimeter wave radar and the like. The laser radar mainly scans the surrounding environment of the vehicle to obtain three-dimensional distance information, senses the driving road condition through a distance analysis and identification technology, can obtain the three-dimensional distance information of an object, has high measurement accuracy and is not influenced by the environment, but is high in price, and cannot obtain the morphological characteristics of the target. The millimeter wave radar is mainly used for obtaining the distance information of the target, has the characteristics of short wavelength, wide frequency band, strong penetrating capacity and the like, is small in size and high in recognition precision, but cannot detect the morphological characteristics of the target, is relatively expensive and is not suitable for target detection and recognition. The color camera and the infrared thermal imager are mainly responsible for identifying the target, and have the advantages of lowest cost, convenient installation, small volume and low energy consumption. The color can obtain rich texture information, and the thermal infrared imager can obtain temperature information of the target.
Therefore, the method for jointly detecting the target by the color camera and the infrared thermal imager is widely applied, and how to fuse the temperature information of the thermal infrared camera with the color information of the color camera so as to achieve the optimal target detection effect is a problem to be solved.
Disclosure of Invention
In order to solve the problems, the invention provides a target detection method based on feature fusion of a color camera and an infrared thermal imager, which aims at solving the problem that the color camera cannot accurately identify a target in a special scene, fuses temperature information of the thermal infrared camera and color information of the color camera, respectively learns target features by using a dual-channel neural network algorithm, fuses at a certain layer of the dual-mode neural network algorithm, and then inputs the target feature information into a classification layer of the network algorithm for target prediction, so that various feature information of the target can be increased, and the accuracy of target identification is improved.
The technical scheme is as follows: the invention provides a target detection method based on feature fusion of a color camera and a thermal infrared imager, which comprises the following steps:
a. obtaining a color data set through a color camera, and obtaining a thermal infrared data set through an infrared thermal imager;
the step a comprises the following steps:
a.1, fixing a color camera and an infrared thermal imager on a sensor bracket, and ensuring that the visual angles of the color camera and the infrared thermal imager are consistent; then, carrying out joint calibration on the color camera and the infrared thermal imager to respectively obtain inner and outer parameter matrixes of the color camera and the infrared thermal imager, and completing the spatial synchronization;
a.2, acquiring the environment by using the infrared thermal imager and the color camera simultaneously in real time to acquire a color data set and a thermal infrared data set of a synchronous timestamp, and completing the time synchronization;
and a.3, performing de-distortion treatment on the shot color data set and the shot thermal infrared data set according to internal and external parameter matrixes of the color camera and the thermal infrared imager, and then registering to ensure that pixel points in the color data set correspond to pixel points on the thermal infrared data set one by one.
And a Zhang Zhengyou calibration plate is adopted for combined calibration in the step a.1, and an inner and outer parameter matrix of the two sensors is obtained through a Zhang Zhengyou calibration method.
b. Simultaneously inputting a bimodal data set consisting of the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm, and extracting color characteristics and temperature characteristics of a target; fusing the features of the two modes with a 1 multiplied by 1 convolution block through a fusion function at a certain layer of the YOLOv3 backbone network, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map;
The bimodal YOLOv3 neural network algorithm in the step b comprises a dual-channel input layer; one channel of the input layer inputs a color dataset and the other channel inputs a thermal infrared dataset;
the bimodal YOLOv3 neural network algorithm in the step b comprises a main network and a subsequent convolution layer; the backbone network is Darknet-53, and the total number of the backbone networks is 52; the subsequent convolution layers add up to 23 layers.
The formula of the fusion function in the step b is y i =f(p i ,q i );
Wherein p is i For a feature map matrix of a color dataset of a certain layer, the dimension is n×c 1 ×h×w;q i For a feature map matrix of a thermal infrared dataset of a certain layer, the dimension is n×c 2 ×h×w;
n represents the number of images, h represents the height of the feature map matrix, w represents the width of the feature map matrix, c 1 The number of channels of the feature map matrix representing the color dataset, c 2 A channel number of the feature map matrix representing the thermal infrared dataset;
y obtained after fusion function i The dimension of the matrix is n×c 0 X h x w, where c 0 =c 1 +c 2
The step b may adopt any one of the following schemes:
scheme one,
b.1, fusing the 1X 1 convolution block with a fusion function at the 1 st layer of the backbone network;
the color data set and the thermal infrared data set are simultaneously input into a first layer of a bimodal YOLOv3 neural network algorithm, and data set images of two modalities are subjected to linear superposition through a fusion function to obtain a superposition data set; the dimension of the color dataset is n×c 1 X h x w, the dimension of the thermal infrared dataset is n x c 2 X h x w, the dimension of the superimposed dataset is n x c 0 X h x w; wherein c 0 =c 1 +c 2
The 1 x 1 convolution block includes 3 dimensions c 0 A convolution kernel function of x 1 and an activation function;
c when each convolution kernel extracts image features 0 The x 1 convolution kernel is respectively associated with c of each unit region on the superimposed dataset image 0 The local matrix of x 1 is weighted and summed, and the dimension of the output matrix is 1 x 1; the single image matrix dimension after weighted summation becomes 1 xhxw;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the fused image is output as n multiplied by 3 multiplied by h multiplied by w through the operation of an activation function;
b.2, continuously inputting the fused image matrix into a 52-layer of an original main-trunk network to perform feature extraction operation, and extracting features from shallow single lines, color and other edge features to deep semantic features of a certain part of the deep image; because a layer of 1 multiplied by 1 convolution layer is added in the network algorithm, the sequence numbers of the convolution layers of other layers are sequentially increased by 1, the 26 th layer of the network outputs a first extraction feature image, the 43 rd layer outputs a second extraction feature image, and the 52 th layer outputs a third extraction feature image; wherein the matrix dimension of the first extracted feature map is n×256×h/8×w/8, the matrix dimension of the second extracted feature map is n×512×h/16×w/16, and the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;
Scheme II,
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 25 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image; after the front 25 layers of convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
b.2, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block comprises 256 convolution kernel functions and activation functions with 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
The matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
b.3, continuously inputting the matrix of the first extracted feature map into the remaining convolution layers of the main network to continuously extract features, wherein the number sequence of the convolution layers after the 26 th layer of the main network is sequentially increased by one because a layer of 1×1 convolution is added to the 26 th layer; layer 43 outputs a second extracted feature map and layer 52 outputs a third extracted feature map; wherein the matrix dimension of the second extracted feature map is nx512×h/16×w/16 and the matrix dimension of the third extracted feature map is nx1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;
scheme III,
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 42 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image;
after the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
Fusing the data set images of the two modes output by the 25 th layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
after the convolution of the layers 3 and 42, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;
fusing the data set images of the two modes output by the 42 th layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;
The 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1;
when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;
b.4, continuously inputting the matrix of the second extracted feature map into the remaining convolution layers of the main network to continuously extract features until a third extracted feature map is output; the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, and the execution of the Darknet-53 convolution layer ends;
scheme IV,
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using 52 convolution layers of a main network, extracting marginal features such as single lines and colors from a shallow layer through convolution operation, and then extracting deep semantic features of a certain part on a deep image;
After the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
after the 42 th layer convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;
after the 51 st layer convolution, the feature map output matrixes of the color data set and the thermal infrared data set are n multiplied by 1024 multiplied by h/32 multiplied by w/32;
b.3, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
b.4, fusing the data set images of the two modes output by the 42 th layer with a 1 multiplied by 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;
the 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1;
when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;
b.5, fusing the data set images of the two modes output by the 51 st layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 51 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n×2048×h/32×w/32;
The 1 x 1 convolution block used at layer 51 comprises 1024 convolution kernel functions and activation functions with dimensions 2048 x 1;
when each convolution kernel performs extraction of image features, the 2048 x 1 convolution kernels are respectively weighted and summed with the 2048 x 1 local matrix for each unit region on the superimposed dataset image, the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/32 xw/32;
and (3) a matrix obtained by carrying out weighted summation on each image of the superimposed data set, and outputting a matrix of the third extracted feature map as n multiplied by 1024 multiplied by h/32 multiplied by w/32 through an activated function operation.
c. The fused extracted feature images are input into a subsequent convolution layer to classify targets, and finally an algorithm model of the trained bimodal neural network is output.
The step c comprises the following steps: inputting the third extracted feature map into a subsequent convolution layer to alternately execute the convolution operation of 1×1 and 3×3, and outputting a first prediction feature map matrix after u layers; after the (u+1) th layer, the channel number of the original third extracted feature map becomes 256, the channel number linear superposition operation is carried out with the second extracted feature map, and after the second extracted feature map passes through the 5 th convolution layers, a second prediction feature map matrix is generated after the (u+6) th convolution layers; then after the execution of the layer u+7, the channel number of the original third extraction feature map is changed to 128, and the linear superposition of the channel number and the first extraction feature map is executed; generating a result of a third prediction feature map matrix after the (u+12) th layer; the first prediction feature map matrix, the second prediction feature map matrix and the third prediction feature map matrix are MXh/32 Xw/32, MXh/16 Xw/16, MXh/8 Xw/8, respectively;
Where m=3× (5+m), M representing the classification of the predicted targets;
when step b selects scheme one, scheme two or scheme three, u=58;
when scheme four is selected in step b, u=57.
d. The algorithm model of the bimodal neural network is evaluated by the following parameters:
b, performing a parameter I, namely training time required by a YOLOv3 neural network algorithm adopting four fusion schemes in the step b;
the second parameter is that the color data set and the thermal infrared data set which are not trained by the network are respectively input into an algorithm model of the bimodal neural network correspondingly obtained by four fusion schemes with the same training times for testing, and the numerical value of the network loss function is obtained by testing;
setting algorithm model thresholds of the bimodal neural network to 0.3 and 0.5 respectively to obtain mAP values of the neural network under the two thresholds;
detecting targets in an untrained dataset by using algorithm models of bimodal neural networks with thresholds of 0.3 and 0.5 respectively to obtain the number of the predicted correct labels and the number of the predicted error labels;
and the parameters five and four fusion schemes correspond to the real-time performance of the obtained algorithm model detection target of the bimodal neural network.
The beneficial effects are that: according to the invention, a bimodal data set consisting of a color data set and a thermal infrared data set is input into a neural network algorithm for training, the bimodal neural network algorithm is utilized to simultaneously extract color information and temperature information of a target, and then 1X 1 convolution fusion is carried out on a certain layer of a main network, so that target detection based on characteristic fusion of a color camera and an infrared thermal imager is realized. The method is different from the traditional single-mode color camera for detecting the target, the infrared camera increases the temperature dimension characteristic of the target, and the accuracy of target identification can be improved.
The characteristic extraction process is actually multiplication of a multidimensional matrix, when an image matrix is input into a neural network, various convolution kernels are multiplied by the matrix, and when the matrix obtained by multiplying the convolution kernels by the image matrix is overlapped by the channel number, a characteristic diagram about characteristics is obtained; the color camera and the infrared thermal imager are multidimensional descriptions of the same target, when the dimension of the feature description is increased, the features extracted by using the convolution kernel are increased, and at the moment, the features on the feature map are richer, so that the classification of the classification layer is facilitated.
The fusion method adopted by the invention not only changes the channel number of the image input, but also changes the original simple linear superposition of the channel number into the nonlinear output, and the nonlinear output enables the algorithm to extract more characteristics of the same target, so that the algorithm has stronger target prediction capability and stronger robustness.
The bimodal YOLO v3 network algorithm not only extracts the temperature and color characteristics of the same target and deep semantic characteristics hidden on the image, but also can fit nonlinear color characteristics well and fit nonlinear infrared characteristics better through training, so that the network algorithm can judge the category according to the characteristics on the infrared image in a dim light environment, and the accuracy of target identification is increased.
Drawings
FIG. 1 is a schematic diagram of the structure of the bimodal de YOLO v3 neural network algorithm of the present invention;
FIG. 2 is a schematic diagram of a backbone network algorithm structure of a first fusion scheme;
fig. 3 is a schematic diagram of a backbone network algorithm structure of a second fusion scheme;
fig. 4 is a schematic diagram of a backbone network algorithm structure of a third fusion scheme;
fig. 5 is a schematic diagram of a backbone network algorithm structure of the fusion scheme four.
Detailed Description
Referring to fig. 1, the invention discloses a target detection method based on feature fusion of a color camera and a thermal infrared imager, which comprises the following steps:
a. obtaining a color data set through a color camera, and obtaining a thermal infrared data set through an infrared thermal imager;
the step a comprises the following steps:
a.1, fixing a color camera and an infrared thermal imager on a sensor bracket, and ensuring that the visual angles of the color camera and the infrared thermal imager are consistent; then, carrying out joint calibration on the color camera and the infrared thermal imager to respectively obtain inner and outer parameter matrixes of the color camera and the infrared thermal imager, and completing the spatial synchronization;
a.2, acquiring the environment by using the infrared thermal imager and the color camera simultaneously in real time to acquire a color data set and a thermal infrared data set of a synchronous timestamp, and completing the time synchronization;
and a.3, performing de-distortion treatment on the shot color data set and the shot thermal infrared data set according to internal and external parameter matrixes of the color camera and the thermal infrared imager, and then registering to ensure that pixel points in the color data set correspond to pixel points on the thermal infrared data set one by one.
And a Zhang Zhengyou calibration plate is adopted for combined calibration in the step a.1, and an inner and outer parameter matrix of the two sensors is obtained through a Zhang Zhengyou calibration method.
b. 70% of a bimodal data set consisting of the color data set and the thermal infrared data set is simultaneously input into a bimodal YOLOv3 neural network algorithm, and color characteristics and temperature characteristics of a target are extracted;
the characteristic extraction process is actually multiplication of a multidimensional matrix, when an image matrix is input into a neural network, various convolution kernels are multiplied by the matrix, and when the matrix obtained by multiplying the convolution kernels by the image matrix is overlapped by the channel number, a characteristic diagram about characteristics is obtained; the color camera and the infrared thermal imager are multidimensional description of the same target, when the dimension of the feature description is increased, the features extracted by using the convolution kernel are increased, and at the moment, the features on the feature map are richer, so that the classification of the classification layer is facilitated;
Fusing the features of the two modes with a 1 multiplied by 1 convolution block through a fusion function at a certain layer of the YOLOv3 backbone network, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map;
the bimodal YOLOv3 neural network algorithm in the step b comprises a dual-channel input layer; one channel of the input layer inputs a color dataset and the other channel inputs a thermal infrared dataset;
the bimodal YOLOv3 neural network algorithm in the step b comprises a main network and a subsequent convolution layer; the backbone network is Darknet-53, and the total number of the backbone networks is 52; the subsequent convolution layers add up to 23 layers.
The formula of the fusion function in the step b is y i =f(p i ,q i );
Wherein p is i For a feature map matrix of a color dataset of a certain layer, the dimension is n×c 1 ×h×w;q i For a feature map matrix of a thermal infrared dataset of a certain layer, the dimension is n×c 2 ×h×w;
n represents the number of images, h represents the height of the feature map matrix, w represents the width of the feature map matrix, c 1 The number of channels of the feature map matrix representing the color dataset, c 2 A channel number of the feature map matrix representing the thermal infrared dataset;
y obtained after fusion function i The dimension of the matrix is n×c 0 X h x w, where c 0 =c 1 +c 2
The 52 convolution layers of YOLOv3 are composed of a plurality of convolution layers plus 5 repeated residual blocks, after an image is input into a network, firstly one layer of convolution operation is carried out, secondly the convolution operation with the step length of 2 is carried out, then the repeated residual blocks are carried out, and the repeated times of the repeated residual blocks are respectively 1, 2, 8 and 4 times. Each repeated residual block consists of two layers of convolution and residual calculation, wherein the two layers of convolution firstly execute 1×1 and then execute 3×3 operation, the number of channels is halved and then doubled compared with the previous layer, and the residual calculation does not calculate the convolution calculation. For example, the size of the image input into the network is 416×416, the image is calculated by 5 residual blocks, the image size is 208×208, 104×104, 52×52, 26×26, 13×13, the image is reduced by 25 times by 5 downsampling, the feature map of the image is output after the residual calculation with the repetition times of 8, 8 and 4, and the obtained feature map is 1024×13×13, 512×26×26, 256×52×52, and the output feature map is continuously input into the subsequent classification layer for classification.
The step b may adopt any one of the following schemes: in this embodiment, the dimensions of the color data sets input by the four schemes are n×3×416×416, and the dimensions of the thermal infrared data sets are n×3×416×416;
Scheme one is shown in fig. 2:
b.1, fusing the 1X 1 convolution block with a fusion function at the 1 st layer of the backbone network;
the color data set and the thermal infrared data set are simultaneously input into a first layer of a bimodal YOLOv3 neural network algorithm, and data set images of two modalities are subjected to linear superposition through a fusion function to obtain a superposition data set; the dimension of the color dataset is n×c 1 X h x w, the dimension of the thermal infrared dataset is n x c 2 X h x w, the dimension of the superimposed dataset is n x c 0 X h x w; wherein c 0 =c 1 +c 2
The 1 x 1 convolution block includes 3 dimensions c 0 A convolution kernel function of x 1 and an activation function;
c when each convolution kernel extracts image features 0 The x 1 convolution kernel is respectively associated with c of each unit region on the superimposed dataset image 0 The local matrix of x 1 is weighted and summed, and the dimension of the output matrix is 1 x 1; the single image matrix dimension after weighted summation becomes 1 xhxw;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the fused image is output as n multiplied by 3 multiplied by h multiplied by w through the operation of an activation function;
the activation function is to ensure that the output of the convolution layer is not only the superposition of the simple linear combination of the previous layer, but also the mapping relation of a function, and the simple linear combination can lead to the effect that no matter what layer number of convolution layers is equivalent to only one layer of convolution layer, so that the variety of functions which can be fitted by a network algorithm is reduced, and even only linearly separable classification can be fitted. When the operation of the activation function is performed, the tf.nn.leak_relu function is called in the embodiment, which is an activation function commonly used in a neural network, and only needs to judge whether the input value is greater than zero or not during the operation, the value smaller than zero is set to be zero, and the part greater than zero is linear;
The upper fusion method not only changes the channel number of the image input, but also changes the original simple linear superposition of the channel number into nonlinear output, and the nonlinear output enables the algorithm to extract more characteristics of the same target, so that the algorithm has stronger capability of predicting the target and stronger robustness;
b.2, continuously inputting the fused image matrix into a 52-layer of an original main-trunk network to perform feature extraction operation, and extracting features from shallow single lines, color and other edge features to deep semantic features of a certain part of the deep image; because a layer of 1 multiplied by 1 convolution layer is added in the network algorithm, the sequence numbers of the convolution layers of other layers are sequentially increased by 1, the 26 th layer of the network outputs a first extraction feature image, the 43 rd layer outputs a second extraction feature image, and the 52 th layer outputs a third extraction feature image; wherein the matrix dimension of the first extracted feature map is n×256×h/8×w/8, the matrix dimension of the second extracted feature map is n×512×h/16×w/16, and the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;
scheme two is shown in figure 3:
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 25 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image; after the front 25 layers of convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
b.2, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block comprises 256 convolution kernel functions and activation functions with 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
b.3, continuously inputting the matrix of the first extracted feature map into the remaining convolution layers of the main network to continuously extract features, wherein the number sequence of the convolution layers after the 26 th layer of the main network is sequentially increased by one because a layer of 1×1 convolution is added to the 26 th layer; layer 43 outputs a second extracted feature map and layer 52 outputs a third extracted feature map; wherein the matrix dimension of the second extracted feature map is nx512×h/16×w/16 and the matrix dimension of the third extracted feature map is nx1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;
Scheme three is shown in fig. 4:
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 42 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image;
after the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
fusing the data set images of the two modes output by the 25 th layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
The matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
after the convolution of the layers 3 and 42, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;
fusing the data set images of the two modes output by the 42 th layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;
the 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1;
when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;
b.4, continuously inputting the matrix of the second extracted feature map into the remaining convolution layers of the main network to continuously extract features until a third extracted feature map is output; the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, and the execution of the Darknet-53 convolution layer ends;
scheme four is shown in fig. 5:
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using 52 convolution layers of a main network, extracting marginal features such as single lines and colors from a shallow layer through convolution operation, and then extracting deep semantic features of a certain part on a deep image;
after the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
after the 42 th layer convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;
after the 51 st layer convolution, the feature map output matrixes of the color data set and the thermal infrared data set are n multiplied by 1024 multiplied by h/32 multiplied by w/32;
b.3, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function;
Linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
b.4, fusing the data set images of the two modes output by the 42 th layer with a 1 multiplied by 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;
the 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1;
When each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;
b.5, fusing the data set images of the two modes output by the 51 st layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 51 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n×2048×h/32×w/32;
the 1 x 1 convolution block used at layer 51 comprises 1024 convolution kernel functions and activation functions with dimensions 2048 x 1;
when each convolution kernel performs extraction of image features, the 2048 x 1 convolution kernels are respectively weighted and summed with the 2048 x 1 local matrix for each unit region on the superimposed dataset image, the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/32 xw/32;
And (3) a matrix obtained by carrying out weighted summation on each image of the superimposed data set, and outputting a matrix of the third extracted feature map as n multiplied by 1024 multiplied by h/32 multiplied by w/32 through an activated function operation.
c. The fused extracted feature images are input into a subsequent convolution layer to classify targets, and finally an algorithm model of the trained bimodal neural network is output.
The step c comprises the following steps: inputting the third extracted feature map into a subsequent convolution layer to alternately execute the convolution operation of 1×1 and 3×3, and outputting a first prediction feature map matrix after u layers; after the (u+1) th layer, the channel number of the original third extracted feature map becomes 256, the channel number linear superposition operation is carried out with the second extracted feature map, and after the second extracted feature map passes through the 5 th convolution layers, a second prediction feature map matrix is generated after the (u+6) th convolution layers; then after the execution of the layer u+7, the channel number of the original third extraction feature map is changed to 128, and the linear superposition of the channel number and the first extraction feature map is executed; generating a result of a third prediction feature map matrix after the (u+12) th layer; the first prediction feature map matrix, the second prediction feature map matrix and the third prediction feature map matrix are MXh/32 Xh/32, MXh/16 Xw/16, MXh/8 Xw/8, respectively;
Where m=3× (5+m), M representing the classification of the predicted targets; in the embodiment, n is 2, and the prediction targets are classified into human and vehicle classes;
when step b selects scheme one, scheme two or scheme three, u=58;
when scheme four is selected in step b, u=57.
According to the dual-mode YOLO v3 network algorithm, the temperature and color characteristics of the same target and deep semantic characteristics hidden on the image are extracted at the same time, in practice, the network algorithm can be used for fitting nonlinear color characteristics well and nonlinear infrared characteristics well through training, so that the network algorithm can be used for judging the types according to the characteristics on the infrared image in a dim light environment, and the accuracy of target identification is increased.
d. The algorithm model of the bimodal neural network is evaluated by the following parameters:
b, performing a parameter I, namely training time required by a YOLOv3 neural network algorithm adopting four fusion schemes in the step b;
respectively inputting the remaining 30% of color data sets and thermal infrared data sets which are not trained by the network into algorithm models of bimodal neural networks correspondingly obtained by four fusion schemes trained for the same times for testing, and testing the obtained values of network loss functions;
Setting algorithm model thresholds of the bimodal neural network to 0.3 and 0.5 respectively to obtain mAP values of the neural network under the two thresholds;
detecting targets in an untrained dataset by using algorithm models of bimodal neural networks with thresholds of 0.3 and 0.5 respectively to obtain the number of the predicted correct labels and the number of the predicted error labels;
and the parameters five and four fusion schemes correspond to the real-time performance of the obtained algorithm model detection target of the bimodal neural network.

Claims (6)

1. The target detection method based on feature fusion of the color camera and the infrared thermal imager is characterized by comprising the following steps of:
a. obtaining a color data set through a color camera, and obtaining a thermal infrared data set through an infrared thermal imager;
b. simultaneously inputting a bimodal data set consisting of the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm, and extracting color characteristics and temperature characteristics of a target; fusing the features of the two modes with a 1 multiplied by 1 convolution block through a fusion function at a certain layer of the YOLOv3 backbone network, and then selecting the fused feature map to continue to perform feature extraction of the backbone network to obtain a fused extracted feature map;
c. The fused extracted feature images are input into a subsequent convolution layer to classify targets, and finally an algorithm model of the trained bimodal neural network is output;
the bimodal YOLOv3 neural network algorithm comprises a dual-channel input layer; one channel of the input layer inputs a color dataset and the other channel inputs a thermal infrared dataset;
the bimodal YOLOv3 neural network algorithm comprises a backbone network and a subsequent convolution layer; the backbone network is Darknet-53, and the total number of the backbone networks is 52; the subsequent convolution layers total 23 layers;
the formula of the fusion function in the step b is y i =f(p i ,q i );
Wherein p is i For a feature map matrix of a color dataset of a certain layer, the dimension is n×c 1 ×h×w;q i For a feature map matrix of a thermal infrared dataset of a certain layer, the dimension is n×c 2 ×h×w;
n represents the number of images, h represents the height of the feature map matrix, w represents the width of the feature map matrix, c 1 Feature moments representing color datasetsNumber of channels of array, c 2 A channel number of the feature map matrix representing the thermal infrared dataset;
y obtained after fusion function i The dimension of the matrix is n×c 0 X h x w, where c 0 =c 1 +c 2
2. The method for detecting the target based on feature fusion of a color camera and a thermal infrared imager according to claim 1, wherein the method comprises the following steps: the step b may adopt any one of the following schemes:
Scheme one,
b.1, fusing the 1X 1 convolution block with a fusion function at the 1 st layer of the backbone network;
the color data set and the thermal infrared data set are simultaneously input into a first layer of a bimodal YOLOv3 neural network algorithm, and data set images of two modalities are subjected to linear superposition through a fusion function to obtain a superposition data set; the dimension of the color dataset is n×c 1 X h x w, the dimension of the thermal infrared dataset is n x c 2 X h x w, the dimension of the superimposed dataset is n x c 0 X h x w; wherein c 0 =c 1 +c 2
The 1 x 1 convolution block includes 3 dimensions c 0 A convolution kernel function of x 1 and an activation function;
c when each convolution kernel extracts image features 0 The x 1 convolution kernel is respectively associated with c of each unit region on the superimposed dataset image 0 The local matrix of x 1 is weighted and summed, and the dimension of the output matrix is 1 x 1; the single image matrix dimension after weighted summation becomes 1 xhxw;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the fused image is output as n multiplied by 3 multiplied by h multiplied by w through the operation of an activation function;
b.2, continuously inputting the fused image matrix into a 52-layer of an original main-trunk network to perform feature extraction operation, and extracting features from shallow single lines, color and other edge features to deep semantic features of a certain part of the deep image; because a layer of 1 multiplied by 1 convolution layer is added in the network algorithm, the sequence numbers of the convolution layers of other layers are sequentially increased by 1, the 26 th layer of the network outputs a first extraction feature image, the 43 rd layer outputs a second extraction feature image, and the 52 th layer outputs a third extraction feature image; wherein the matrix dimension of the first extracted feature map is n×256×h/8×w/8, the matrix dimension of the second extracted feature map is n×512×h/16×w/16, and the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;
Scheme II,
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 25 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image; after the front 25 layers of convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
b.2, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block comprises 256 convolution kernel functions and activation functions with 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
The matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
b.3, continuously inputting the matrix of the first extracted feature map into the remaining convolution layers of the main network to continuously extract features, wherein the number sequence of the convolution layers after the 26 th layer of the main network is sequentially increased by one because a layer of 1×1 convolution is added to the 26 th layer; layer 43 outputs a second extracted feature map and layer 52 outputs a third extracted feature map; wherein the matrix dimension of the second extracted feature map is nx512×h/16×w/16 and the matrix dimension of the third extracted feature map is nx1024×h/32×w/32, so that the execution of the dark-53 convolution layer ends;
scheme III,
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using the first 42 convolution layers of a main network, extracting marginal features such as single lines and colors from shallow layers by convolution operation, and then extracting deep semantic features of a certain part on a deep image;
after the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
Fusing the data set images of the two modes output by the 25 th layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions; when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1;
the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
after the convolution of the layers 3 and 42, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;
fusing the data set images of the two modes output by the 42 th layer with a 1X 1 convolution block through a fusion function;
linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;
The 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1; when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16; the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;
b.4, continuously inputting the matrix of the second extracted feature map into the remaining convolution layers of the main network to continuously extract features until a third extracted feature map is output;
the matrix dimension of the third extracted feature map is n×1024×h/32×w/32, and the execution of the Darknet-53 convolution layer ends;
scheme IV,
b.1, inputting the color data set and the thermal infrared data set into a bimodal YOLOv3 neural network algorithm at the same time, respectively extracting features of the bimodal data set by using 52 convolution layers of a main network, extracting marginal features such as single lines and colors from a shallow layer through convolution operation, and then extracting deep semantic features of a certain part on a deep image;
After the convolution of the layers 2 and 25, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 256 multiplied by h/8 multiplied by w/8;
after the 42 th layer convolution, the characteristic diagram output matrixes of the color data set and the thermal infrared data set are n multiplied by 512 multiplied by h/16 multiplied by w/16;
after the 51 st layer convolution, the feature map output matrixes of the color data set and the thermal infrared data set are n multiplied by 1024 multiplied by h/32 multiplied by w/32;
b.3, fusing the data set images of the two modes output by the 25 th layer with a 1 multiplied by 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 25 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 512 x h/8 x w/8;
the 1 x 1 convolution block used at layer 25 comprises 256 convolution kernel functions and activation functions of 512 x 1 dimensions;
when each convolution kernel extracts the image characteristics, the 512 multiplied by 1 convolution kernels respectively carry out weighted summation with the 512 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1;
the single image matrix dimension after weighted summation becomes 1 xh/8 xw/8;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the first extracted feature map is output as n multiplied by 256 multiplied by h/8 multiplied by w/8 through the operation of an activation function;
b.4, fusing the data set images of the two modes output by the 42 th layer with a 1 multiplied by 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 42 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n x 1024 x h/16 x w/16;
the 1 x 1 convolution block used at layer 42 comprises 512 convolution kernel functions and activation functions with dimensions 1024 x 1; when each convolution kernel extracts image features, the 1024 multiplied by 1 convolution kernels respectively perform weighted summation with the 1024 multiplied by 1 local matrix of each unit area on the superimposed dataset image, and the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/16 xw/16;
the matrix obtained after each image of the superimposed dataset is weighted and summed, and the matrix of the second extracted feature map is output as n multiplied by 512 multiplied by h/16 multiplied by w/16 through the operation of an activation function;
b.5, fusing the data set images of the two modes output by the 51 st layer with a 1X 1 convolution block through a fusion function; linearly superposing the data set images of the two modes output by the 51 th layer through a fusion function to obtain a superposed data set; the dimensions of the superimposed dataset are n×2048×h/32×w/32;
The 1 x 1 convolution block used at layer 51 comprises 1024 convolution kernel functions and activation functions with dimensions 2048 x 1;
when each convolution kernel performs extraction of image features, the 2048 x 1 convolution kernels are respectively weighted and summed with the 2048 x 1 local matrix for each unit region on the superimposed dataset image, the dimension of the output matrix is 1 multiplied by 1; the single image matrix dimension after weighted summation becomes 1 xh/32 xw/32;
and (3) a matrix obtained by carrying out weighted summation on each image of the superimposed data set, and outputting a matrix of the third extracted feature map as n multiplied by 1024 multiplied by h/32 multiplied by w/32 through an activated function operation.
3. The method for detecting an object based on feature fusion of a color camera and a thermal infrared imager according to claim 2, wherein the step c comprises: inputting the third extracted feature map into a subsequent convolution layer to alternately execute the convolution operation of 1×1 and 3×3, and outputting a first prediction feature map matrix after u layers; after the (u+1) th layer, the channel number of the original third extracted feature map becomes 256, the channel number linear superposition operation is carried out with the second extracted feature map, and after the second extracted feature map passes through the 5 th convolution layers, a second prediction feature map matrix is generated after the (u+6) th convolution layers; then after the execution of the layer u+7, the channel number of the original third extraction feature map is changed to 128, and the linear superposition of the channel number and the first extraction feature map is executed; generating a result of a third prediction feature map matrix after the (u+12) th layer; the first prediction feature map matrix, the second prediction feature map matrix and the third prediction feature map matrix are MXh/32 Xw/32, MXh/16 Xw 16, MXh/8 Xw/8 respectively;
Where m=3× (5+m), M representing the classification of the predicted targets;
when step b selects scheme one, scheme two or scheme three, u=58;
when scheme four is selected in step b, u=57.
4. The method for detecting an object based on feature fusion of a color camera and a thermal infrared imager according to claim 3, wherein the step a comprises:
a.1, fixing a color camera and an infrared thermal imager on a sensor bracket, and ensuring that the visual angles of the color camera and the infrared thermal imager are consistent; then, carrying out joint calibration on the color camera and the infrared thermal imager to respectively obtain inner and outer parameter matrixes of the color camera and the infrared thermal imager, and completing the spatial synchronization;
a.2, acquiring the environment by using the infrared thermal imager and the color camera simultaneously in real time to acquire a color data set and a thermal infrared data set of a synchronous timestamp, and completing the time synchronization;
and a.3, performing de-distortion treatment on the shot color data set and the shot thermal infrared data set according to internal and external parameter matrixes of the color camera and the thermal infrared imager, and then registering to ensure that pixel points in the color data set correspond to pixel points on the thermal infrared data set one by one.
5. The method for detecting the target based on feature fusion of a color camera and a thermal infrared imager according to claim 4, wherein the method comprises the following steps: and a Zhang Zhengyou calibration plate is adopted for combined calibration in the step a.1, and an inner and outer parameter matrix of the two sensors is obtained through a Zhang Zhengyou calibration method.
6. The method for detecting the target based on feature fusion of a color camera and a thermal infrared imager according to any one of claims 2 to 5, wherein the method comprises the following steps: and also comprises
d. The algorithm model of the bimodal neural network is evaluated by the following parameters:
b, performing a parameter I, namely training time required by a YOLOv3 neural network algorithm adopting four fusion schemes in the step b;
the second parameter is that the color data set and the thermal infrared data set which are not trained by the network are respectively input into an algorithm model of the bimodal neural network correspondingly obtained by four fusion schemes with the same training times for testing, and the numerical value of the network loss function is obtained by testing;
setting algorithm model thresholds of the bimodal neural network to 0.3 and 0.5 respectively to obtain mAP values of the neural network under the two thresholds;
detecting targets in an untrained dataset by using algorithm models of bimodal neural networks with thresholds of 0.3 and 0.5 respectively to obtain the number of the predicted correct labels and the number of the predicted error labels;
And the parameters five and four fusion schemes correspond to the real-time performance of the obtained algorithm model detection target of the bimodal neural network.
CN202010135485.0A 2020-03-02 2020-03-02 Target detection method based on feature fusion of color camera and infrared thermal imager Active CN111382683B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010135485.0A CN111382683B (en) 2020-03-02 2020-03-02 Target detection method based on feature fusion of color camera and infrared thermal imager

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010135485.0A CN111382683B (en) 2020-03-02 2020-03-02 Target detection method based on feature fusion of color camera and infrared thermal imager

Publications (2)

Publication Number Publication Date
CN111382683A CN111382683A (en) 2020-07-07
CN111382683B true CN111382683B (en) 2023-05-23

Family

ID=71217108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010135485.0A Active CN111382683B (en) 2020-03-02 2020-03-02 Target detection method based on feature fusion of color camera and infrared thermal imager

Country Status (1)

Country Link
CN (1) CN111382683B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016524B (en) * 2020-09-25 2023-08-08 北京百度网讯科技有限公司 Model training method, face recognition device, equipment and medium
CN112200757A (en) * 2020-09-29 2021-01-08 北京灵汐科技有限公司 Image processing method, image processing device, computer equipment and storage medium
CN113762001B (en) * 2020-10-10 2024-04-19 北京京东乾石科技有限公司 Target detection method and device, electronic equipment and storage medium
CN112801275B (en) * 2021-02-08 2024-02-13 华南理工大学 Implementation method of convolutional neural network module for enhancing channel rearrangement and fusion
CN112924037A (en) * 2021-02-26 2021-06-08 河北地质大学 Infrared body temperature detection system and detection method based on image registration
CN113160150B (en) * 2021-04-01 2022-12-06 西安科技大学 AI (Artificial intelligence) detection method and device for invasion of foreign matters in wire mesh
CN112990149A (en) * 2021-05-08 2021-06-18 创新奇智(北京)科技有限公司 Multi-mode-based high-altitude safety belt detection method, device, equipment and storage medium
CN113240653A (en) * 2021-05-19 2021-08-10 中国联合网络通信集团有限公司 Rice quality detection method, device, server and system
CN115980298B (en) * 2023-03-20 2023-07-21 山东思睿环境设备科技有限公司 Multi-parameter adaptive water quality detection and analysis method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363820A (en) * 2019-06-28 2019-10-22 东南大学 It is a kind of based on the object detection method merged before laser radar, image

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795991B (en) * 2019-09-11 2023-03-31 西安科技大学 Mining locomotive pedestrian detection method based on multi-information fusion

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363820A (en) * 2019-06-28 2019-10-22 东南大学 It is a kind of based on the object detection method merged before laser radar, image

Also Published As

Publication number Publication date
CN111382683A (en) 2020-07-07

Similar Documents

Publication Publication Date Title
CN111382683B (en) Target detection method based on feature fusion of color camera and infrared thermal imager
CN111080629B (en) Method for detecting image splicing tampering
CN107527352B (en) Remote sensing ship target contour segmentation and detection method based on deep learning FCN network
CN104966085B (en) A kind of remote sensing images region of interest area detecting method based on the fusion of more notable features
CN109886312B (en) Bridge vehicle wheel detection method based on multilayer feature fusion neural network model
CN111738110A (en) Remote sensing image vehicle target detection method based on multi-scale attention mechanism
CN110910378B (en) Bimodal image visibility detection method based on depth fusion network
CN109285139A (en) A kind of x-ray imaging weld inspection method based on deep learning
CN107967474A (en) A kind of sea-surface target conspicuousness detection method based on convolutional neural networks
CN113326735B (en) YOLOv 5-based multi-mode small target detection method
Banuls et al. Object detection from thermal infrared and visible light cameras in search and rescue scenes
CN105223208B (en) A kind of circuit board detecting template and preparation method thereof, circuit board detecting method
CN111860351A (en) Remote sensing image fishpond extraction method based on line-row self-attention full convolution neural network
CN104156717A (en) Method for recognizing rule breaking of phoning of driver during driving based on image processing technology
CN107862333A (en) A kind of method of the judgment object combustion zone under complex environment
CN109214331A (en) A kind of traffic haze visibility detecting method based on image spectrum
CN110119737A (en) A kind of object detection method and device
CN106951863A (en) A kind of substation equipment infrared image change detecting method based on random forest
CN112084860A (en) Target object detection method and device and thermal power plant detection method and device
Ren et al. Environment influences on uncertainty of object detection for automated driving systems
CN115456938A (en) Metal part crack detection method based on deep learning and ultrasonic infrared image
CN111368756A (en) Visible light-based method and system for quickly identifying open fire smoke
CN110321869A (en) Personnel's detection and extracting method based on Multiscale Fusion network
CN114067273A (en) Night airport terminal thermal imaging remarkable human body segmentation detection method
CN113255797A (en) Dangerous goods detection method and system based on deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant