CN114359687A

CN114359687A - Target detection method, device, equipment and medium based on multi-mode data dual fusion

Info

Publication number: CN114359687A
Application number: CN202111483806.7A
Authority: CN
Inventors: 张浪文; 张晋凯; 解宇敏; 刘洁耿; 施唯; 彭雄峰; 胡俊嘉; 罗其昆; 时佰仟
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-04-15
Anticipated expiration: 2041-12-07
Also published as: CN114359687B

Abstract

The invention discloses a target detection method, a device, electronic equipment and a storage medium based on multi-mode data dual fusion, wherein the method comprises the following steps: generating a fusion image according to any pair of visible light images and infrared images in the visible light-infrared target detection data set; forming a data sample by the visible light image, the infrared image and the fused image; training a detector by using a data sample to obtain trained detectors with different modes; generating a fusion image to be detected according to a pair of visible light images and infrared images to be detected; respectively inputting the visible light image and the infrared image to be detected and the fusion image to be detected into the trained detectors with the corresponding modes to obtain detection results; and fusing the detection results to obtain a final detection result. The invention comprehensively utilizes the advantages of pixel level fusion and decision level fusion, and makes full use of the information of visible light mode and infrared mode as far as possible, thereby having more excellent all-weather detection performance.

Description

Target detection method, device, equipment and medium based on multi-mode data dual fusion

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method and device based on multi-mode data dual fusion, electronic equipment and a storage medium.

Background

In recent years, target detection technologies based on visible light images have become mature and are widely used in life and industrial production. However, the expressive force of the visible light image is insufficient under the condition of insufficient light, so that the technology has no good adaptability to the change of environmental illumination, and the target detection effect is greatly reduced under the condition of insufficient light. Aiming at the problem, the current solution is to introduce an infrared image on the basis of an original visible light image to weaken the influence of illumination conditions on the target detection performance. The infrared image has the characteristic of insensitivity to illumination, and can distinguish a target from a background according to radiation difference. While the manner in which the visible light image is consistent with the human visual system may provide texture details with high spatial resolution and clarity. The infrared image and the visible light image have complementary characteristics, and by combining the thermal radiation information in the infrared image and the detailed texture information in the visible light image, more robust target representation information is provided for a detection task, and the target detection performance under all-weather conditions is effectively improved.

At present, target detection methods based on visible light and infrared data fusion are mainly divided into two types: one is a detection method for pre-fusion detection, and the other is a detection method for pre-fusion detection. The detection method of fusion before detection is that before detection, visible light and infrared images are subjected to pixel level fusion to obtain a fused image, and then the fused image is sent to a detector for detection. Such methods can retain the information of the original modality to a large extent, but have high requirements on the alignment of the images and may cause partial information redundancy. The detection method of the detection before the fusion is to fuse the detection results, firstly use the detectors of the corresponding modes to respectively detect the visible light images and the infrared images, then combine the detection results and reserve the optimal decision result. The information redundancy degree of the method is the lowest, but the quality of the detection result has great relation with the selection of the detector, and the existing decision algorithm is difficult to break through the upper limit of the performance of a certain detector after the result is fused.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a target detection method, a device, equipment and a medium based on multi-mode data dual fusion, wherein the method comprehensively utilizes the advantages of pixel-level fusion and decision-level fusion, so that the information of a visible light mode and an infrared mode is fully utilized as far as possible, and the method has more excellent all-weather detection performance

The invention aims to provide a target detection method based on multi-modal data dual fusion.

The second purpose of the invention is to provide an object detection device based on multi-modal data dual fusion.

A third object of the present invention is to provide an electronic apparatus.

It is a fourth object of the present invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a method for target detection based on multi-modal data dual fusion, the method comprising:

generating a fusion image according to any pair of visible light images and infrared images in the visible light-infrared target detection data set; forming a data sample by the visible light image, the infrared image and the fused image;

training a detector by using the data sample to obtain trained detectors with different modes;

generating a fusion image to be detected according to a pair of visible light images and infrared images to be detected; inputting the pair of visible light images and infrared images to be detected and the fused image to be detected into a trained detector with a corresponding mode respectively to obtain a detection result;

and fusing the detection results to obtain a final detection result.

Further, the generating of the fused image according to any pair of the visible light image and the infrared image in the visible light-infrared target detection data set is based on an image fusion algorithm based on multi-scale transformation, and specifically includes:

processing the visible light image and the infrared image by adopting wavelet transformation to generate a low-frequency sub-band and a high-frequency sub-band;

respectively fusing the low-frequency sub-band and the high-frequency sub-band to obtain a fused low-frequency sub-band and a fused high-frequency sub-band;

and reconstructing the fused low-frequency sub-band and the fused high-frequency sub-band to generate a fused image.

Further, the processing the visible light image and the infrared image by using wavelet transform to generate a low frequency sub-band and a high frequency sub-band specifically includes:

when i is 1, carrying out ith decomposition on the visible light image to generate an ith low-frequency sub-band and t ith high-frequency sub-bands of the visible light image; carrying out ith decomposition on the infrared image to generate an ith low-frequency sub-band and t ith high-frequency sub-bands of the infrared image; wherein i is a positive integer greater than or equal to 1 and less than or equal to c, t is a first set threshold, and c is a second set threshold;

when i is larger than 1 and is smaller than or equal to c, carrying out ith decomposition on the i-1 th low-frequency sub-band of the visible light image to generate the ith low-frequency sub-band and t ith high-frequency sub-bands of the visible light image; and carrying out ith decomposition on the i-1 th low-frequency sub-band of the infrared image to generate the ith low-frequency sub-band and t ith high-frequency sub-bands of the infrared image.

Further, the fusing the low-frequency sub-band and the high-frequency sub-band respectively to obtain a fused low-frequency sub-band and a fused high-frequency sub-band specifically includes:

fusing the low-frequency sub-bands by adopting a window fusion rule to obtain a c-th fused low-frequency sub-band;

and fusing the high-frequency sub-bands by adopting a region characteristic energy fusion method to obtain fused high-frequency sub-bands.

Further, the low frequency sub-band includes a c-th low frequency sub-band LF of the visible light image_c ^RGB(x, y) and the c-th low-frequency subband LF of the infrared image_c ^IR(x,y)；

The method comprises the following steps of adopting a window fusion rule to fuse the low-frequency sub-bands to obtain a c-th fused low-frequency sub-band, and specifically comprising the following steps:

obtaining the c-th fused low-frequency sub-band by using the following formula:

LF_c ^F(x,y)＝α₁LF_c ^RGB(x,y)+α₂LF_c ^IR(x,y)

wherein x and y are horizontal and vertical coordinates of processing point on the image, alpha₁,α₂Alpha is the fusion coefficient of the visible light image and the infrared image respectively₁+α₂＝1；

The t ith high-frequency sub-bands comprise ith high-frequency sub-bands in the horizontal direction, the vertical direction and the diagonal direction;

the high-frequency sub-bands comprise the high-frequency sub-bands in the ith horizontal direction, the vertical direction and the diagonal direction of the visible light image and the high-frequency sub-bands in the ith horizontal direction, the vertical direction and the diagonal direction of the infrared image;

fusing the high-frequency sub-bands by adopting a region characteristic energy fusion method to obtain fused high-frequency sub-bands, which specifically comprises the following steps:

respectively extracting edge image features of the first image and the second image by using a canny operator to obtain edge feature images, calculating the regional variance energy features through a sliding window, and respectively obtaining regional energy values RGB (red, green, blue) of the first image and the second image at the (x, y) position_EAnd IR_E(ii) a Wherein the first image and the second image are respectively a high-frequency subband in an ith horizontal direction of the visible light image and a high-frequency subband in an ith horizontal direction of the infrared image, a high-frequency subband in an ith vertical direction of the visible light image and a high-frequency subband in an ith vertical direction of the infrared image, and a high-frequency subband in an ith diagonal direction of the visible light image and a high-frequency subband in an ith diagonal direction of the infrared image;

selective fusion is performed by regional energy comparison, and a fusion formula is as follows:

wherein the content of the first and second substances,

respectively a first image and a second image,

is a fused image;

after the fusion, the fusion high-frequency sub-band in the ith horizontal direction, the fusion high-frequency sub-band in the ith vertical direction and the fusion high-frequency sub-band in the ith diagonal direction are respectively obtained.

Further, reconstructing the fused low-frequency subband and the fused high-frequency subband to generate a fused image specifically includes:

when i is 1, reconstructing the c-th fused low-frequency sub-band, the c-th fused high-frequency sub-band in the horizontal direction, the c-th fused high-frequency sub-band in the vertical direction and the c-th fused high-frequency sub-band in the diagonal direction to generate a c-1-th fused low-frequency sub-band;

when i is larger than 1 and is less than or equal to c, reconstructing the (c + 1) -i) th fused low-frequency sub-band, the (c + 1) -i) th fused high-frequency sub-band in the horizontal direction, the (c + 1) -i) th fused high-frequency sub-band in the vertical direction and the (c + 1-i) th fused high-frequency sub-band in the diagonal direction to generate a (c-i) th fused low-frequency sub-band;

the 0 th fused low-frequency sub-band is the fused image.

Further, the detection results comprise a visible light mode detection result, an infrared mode detection result and a fusion mode detection result, wherein each mode detection result comprises a target boundary box coordinate, a class cls to which the target belongs and a confidence score;

the fusing the detection results to obtain a final detection result specifically comprises:

all the detection results are processed as follows:

for target bounding boxes with the same value of cls, then:

calculating the intersection ratio of the two targets IoU pairwise, and when IoU is more than or equal to a third set threshold, the two target bounding boxes are the same target; if the intersection ratio IoU of the two frames is less than a third set threshold, the two target boundary frames are different targets;

if the two target bounding boxes are the same target, fusing the coordinates and the confidence score of the target bounding boxes through a Bayesian decision level fusion algorithm, and putting the fused result into a set B; if the two target bounding boxes are different targets, putting the coordinates and confidence scores of the target bounding boxes into a set B;

for target bounding boxes with different cls values, putting coordinates and confidence scores of the target bounding boxes into a set B;

the set B is the final detection result.

Further, the fusing the target bounding box coordinates and the confidence score by using a bayesian decision-level fusion algorithm specifically includes:

fusing the confidence scores of all the modes together through a Bayesian rule to obtain fused confidence scores;

and calculating the average value of the coordinate values of the target boundary frames which represent the same target in different modes to obtain the coordinate of the fused target boundary frame.

The second purpose of the invention can be achieved by adopting the following technical scheme:

an object detection apparatus based on multi-modal data dual fusion, the apparatus comprising:

the data sample acquisition module is used for generating a fusion image according to any pair of visible light images and infrared images in the visible light-infrared target detection data set; forming a data sample by the visible light image, the infrared image and the fused image;

the detector training module is used for respectively training the detectors by using the data samples to obtain the trained detectors with different modes;

the detection result generation module is used for generating a fusion image to be detected according to the pair of visible light images and the infrared image to be detected; inputting the pair of visible light images and infrared images to be detected and the fused image to be detected into a trained detector with a corresponding mode respectively to obtain a detection result;

and the detection result fusion module is used for fusing the detection results to obtain a final detection result.

The third purpose of the invention can be achieved by adopting the following technical scheme:

an electronic device comprises a processor and a memory for storing a program executable by the processor, wherein the processor executes the program stored in the memory to realize the target detection method.

The fourth purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium stores a program that, when executed by a processor, implements the object detection method described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs an image fusion algorithm based on multi-scale transformation, obtains a fusion image by utilizing a visible light image and an infrared image, forms three data sources with the visible light image and the infrared image, and adopts a multi-mode data dual fusion strategy at a data input level, thereby retaining original information to the maximum extent.

2. In the invention, on the aspect of detection result output, the detection results output by the three detectors according to the three modes are fused through a Bayesian decision level fusion algorithm, and the results of different detectors are integrated, so that the final output fusion result has a more accurate detection result compared with any detector.

3. The dual fusion combines the advantages of two levels of fusion, and has more excellent all-weather detection performance compared with the single use of pixel level fusion or decision level fusion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a schematic diagram of a target detection method based on multi-modal data dual fusion according to embodiment 1 of the present invention.

Fig. 2 is a flowchart of a target detection method based on multi-modal data dual fusion according to embodiment 1 of the present invention.

Fig. 3 is a flowchart of an image fusion algorithm based on multi-scale transformation according to embodiment 1 of the present invention.

Fig. 4 is a schematic diagram of the distribution of fused subbands after image fusion according to embodiment 1 of the present invention.

Fig. 5 is a flowchart of the bayesian decision-level fusion algorithm in embodiment 1 of the present invention.

Fig. 6 is a block diagram of a target detection apparatus based on multi-modal data dual fusion according to embodiment 2 of the present invention.

Fig. 7 is a block diagram of an electronic device according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention. It should be understood that the description of the specific embodiments is intended to be illustrative only and is not intended to be limiting.

Example 1:

the application provides a target detection method based on multi-modal data dual fusion, which mainly relates to two key technologies: pixel level fusion algorithm: an image fusion algorithm based on multi-scale transformation is designed, and the existing visible light and infrared images are utilized to obtain a fusion image, so that the data source is enriched; (II) a decision-level fusion algorithm: and a Bayesian decision level fusion algorithm is designed, the results of independent detection input by the three detectors according to the three modes are fused, and the results of different detectors are synthesized. The method mainly comprises the following steps: generating a fused image, training a detector, predicting the detector and fusing a detection result, wherein: generating a fusion image, namely generating the fusion image according to the input visible light and infrared images through a pixel level fusion algorithm; training a detector, namely training the detector of a corresponding mode by utilizing a visible light image, an infrared image and a fusion image respectively; detector prediction, namely, respectively carrying out target detection on three image inputs by using the three detectors obtained in the previous step; and (4) fusion of detection results, namely merging the three groups of detection results obtained in the last step by using a decision-level fusion algorithm to obtain a final detection result.

As shown in fig. 1 and fig. 2, the present embodiment provides a target detection method based on multi-modal data dual fusion, including the following steps:

s201, generating a fusion image according to any pair of visible light images and infrared images in the visible light-infrared target detection data set; and forming a data sample by the visible light image, the infrared image and the fused image.

The method comprises the steps of obtaining a disclosed visible light-infrared target detection data set, fusing a visible light image and an infrared image in the visible light-infrared target detection data set to obtain a fused image, and forming three groups of data samples with the visible light image and the infrared image.

As shown in fig. 3, a fused image is obtained by fusing a visible light image and an infrared image through an image fusion algorithm based on multi-scale transformation, and includes image decomposition generation sub-bands, sub-band information processing by frequency division and fused image reconstruction.

The method comprises the steps of fusing a visible light image and an infrared image by using an image fusion algorithm based on multi-scale transformation, obtaining multi-scale representation of an input image by using wavelet transformation, namely obtaining low-frequency and high-frequency sub-bands of the two images, fusing the low-frequency and high-frequency sub-bands of the two images by using different methods to obtain a fused sub-band, and finally performing multi-scale inverse transformation by using the fused sub-band, namely inverse transformation of the wavelet transformation, so as to obtain the fused image.

Further, step S201 includes:

and S2011, processing the visible light image and the infrared image by adopting wavelet transformation to generate a high-frequency sub-band and a low-frequency sub-band.

The present embodiment is described by taking a visible light image as an example. For an input image I, firstly, taking the whole image as a target, decomposing the image to obtain a low-frequency subband LF₁And three high-frequency subbands DF₁ ^H,DF₁ ^V,DF₁ ^D. From the second decomposition, only the low-frequency subband obtained from the previous decomposition is subjected to decomposition as a target, and one low-frequency subband and three high-frequency subbands are obtained (for example, the second decomposition is performed on the low-frequency subband LF obtained from the first decomposition₁Decomposing to obtain low frequency sub-band LF₂And high frequency sub-bands

). By small wave changeAnd obtaining a low-frequency image and three pieces of high-frequency image information in the horizontal direction, the vertical direction and the diagonal direction after conversion.

The wavelet transform generates corresponding scale and displacement functions, namely an inner product of a square integrable function f (t) and a wavelet function ψ (t), through a wavelet basis, wherein the expression of the wavelet basis is as follows:

where α is a scale factor, β is a displacement factor, t is time, ψ (t) is a wavelet function, i.e., mother wavelet, ψ_α,β(t) is a family of functions, wavelet basis, generated by shifting and warping the mother wavelet ψ (t). By varying the value of the scale factor alpha, the function psi can be controlled_α,β(t) effects of stretching (α > 1) or shrinking (α < 1); changing the shift factor β affects the analysis of the function f (t) around the β point.

Continuous wavelet transform:

w (alpha, beta) is f (t) and is subjected to continuous wavelet transformation.

The digital image is a discrete digital signal, and the scale factor alpha and the displacement factor beta are subjected to discretization treatment, wherein the discretization treatment is

Wherein i, j are integers, alpha₀Is a constant greater than 1, beta₀A constant greater than 0, discrete wavelet transform:

wherein, W (alpha)ⁱ,jβ₀) To f (t) undergo a discrete wavelet transform,

is a discretized wavelet basis.

In the embodiment, the Mallat decomposition algorithm is used to decompose the input original image to obtain a high frequency part and a low frequency part.

In this embodiment, the primary image is decomposed three times by using a Mallat decomposition algorithm to obtain three low frequency portions and nine high frequency portions, and the mathematical expression of the algorithm is as follows:

wherein i is 1,2, 3; r and c represent the row and column values, i.e. dimensions, of the input image; x and y are the horizontal and vertical coordinate values of the processing point on the image, and the value range is

LF₀For input original image, LF_j+1For low-frequency parts of the image, LF₁,LF₂,LF₃Low-frequency parts after the first decomposition, the second decomposition and the third decomposition are respectively; h_rAnd H_cA high-pass filter; l is_rAnd L_cIs a low-pass filter; the high frequency parts in the horizontal, vertical and diagonal directions are respectively DF₁ ^H,

DF₁ ^V,

And DF₁ ^D,

Can be respectively expressed by LH, HL and HH.

And carrying out the processing on the infrared image to obtain a corresponding low-frequency part and a corresponding high-frequency part. As can be seen, the low frequency part comprises the low frequency part LF of the visible light image_i ^RGBLow frequency part LF of (x, y) and infrared image_i ^IR(x,y)。

S2012, the high-frequency sub-band and the low-frequency sub-band are respectively fused to obtain a fused high-frequency sub-band and a fused low-frequency sub-band.

(1) For the low frequency part, a window fusion rule is adopted.

For the low-frequency part, a window fusion rule is adopted to fuse the Nth low-frequency sub-band after the visible light image and the infrared image are respectively decomposed for N times (N is 3 in the embodiment), and the low-frequency sub-band of the first N-1 times is subjected to re-decomposition and does not need to be fused. The nth low frequency subband fusion formula is:

wherein alpha is₁,α₂Respectively, the fusion coefficients of the visible light image and the infrared image.

This embodiment uses average fusion to fuse the low frequency part, i.e. alpha₁＝0.5,α₂When the total decomposition is three times, the third low-frequency subband fusion formula is as follows:

obtaining a third fused low frequency sub-band LF₃ ^F。

(2) For the high frequency part, a region characteristic energy fusion method is adopted.

And for the fusion of the high-frequency sub-bands, a region characteristic energy fusion method is adopted, edge characteristics are extracted from the high-frequency components through a canny operator, a region energy value is calculated to serve as a threshold condition, and the fusion is selected through a threshold after up-sampling from top to bottom.

Specifically, for the fusion of the high-frequency part, edge feature extraction is carried out by using a canny operator, the variance energy feature of a region is calculated through a sliding window, the variance feature value is used as a threshold value for comparison, and a sub-band of an image corresponding to a larger feature value is selected as a fusion sub-band.

Taking horizontal sub-bands as an example, visible light image sub-bands thereof

And infrared image sub-band

The calculation flow of (2) is as follows:

(1-1) extracting edge image features from a visible light image and an infrared image to be fused by using a canny operator to obtain edge feature images, calculating the variance energy features of the regions through a sliding window, and respectively obtaining the region energy values RGB at the (x, y) positions_EAnd IR_E。

(1-2) performing threshold selection fusion through regional energy comparison, wherein a fusion formula is as follows:

wherein the content of the first and second substances,

the high-frequency sub-band in the horizontal direction after the ith decomposition of the visible light image,

the high-frequency sub-band in the horizontal direction after the ith decomposition of the infrared image is obtained. After the fusion, a fusion high-frequency sub-band in the horizontal direction is obtained, namely the fusion high-frequency sub-band

By the method, the fused high-frequency sub-band in the vertical direction and the diagonal direction can be obtained, namely the fused high-frequency sub-band

And

as shown in fig. 4.

S2013, reconstructing the fused high-frequency sub-band and the fused low-frequency sub-band by adopting a Mallat algorithm to obtain a fused image.

The reconstruction is the inverse process of the decomposition, that is, the low-frequency subband and the high-frequency subband obtained by the decomposition are reconstructed into the low-frequency subband subjected to the decomposition in the previous round (for example, the low-frequency subband and the high-frequency subband obtained by the decomposition in the third time in this embodiment are fused, and then the fused low-frequency subband and the fused high-frequency subband are reconstructed to obtain the fused low-frequency subband subjected to the decomposition in the second time, and so on). When the fused low frequency sub-band LF of the first decomposition is reconstructed₁ ^FThen, it is combined with the fused high frequency sub-band

Make reconstruction, the final LF₀ ^FNamely the fused image.

The image reconstruction process of the Mallat algorithm can be described as:

wherein i is 2,1, 0; the final LF₀ ^FNamely the fusion image; r and c represent the row, column values, i.e. dimensions, of the input image; and m and n are horizontal and vertical coordinate values of processing points on the generated image, and the value range m is 0-2 r, and n is 0-2 c.

And fusing each pair of images in the visible light-infrared data set by adopting the image fusion algorithm based on multi-scale transformation to obtain a fused image, and forming three groups of data samples in different modes with the visible light-infrared data set.

S202, training the detector by using the data sample to obtain the trained detectors with different modes.

Based on the YOLOv5 target detection framework, the detector is trained by using the visible light image, the infrared image and the corresponding fusion image respectively, so as to obtain trained detectors of three different modalities.

Further, a Yolov5 target detection algorithm is used as a framework, three detectors with random weights are initialized, and the three detectors are respectively trained by using data samples of a visible light mode, an infrared mode and a fusion mode, so that optimal model weight parameters are obtained, and the detectors corresponding to the trained three modes are obtained.

S203, generating a fusion image to be detected according to the pair of visible light image and infrared image to be detected; and respectively inputting the pair of visible light images and infrared images to be detected and the fused images to be detected into the trained detectors with the corresponding modes to obtain detection results.

And (3) inputting a pair of visible light images and infrared images and the fused images obtained by the visible light images and the infrared images through the step (S201) into the corresponding detectors respectively by using the detectors trained in the step (S202) to obtain three groups of detection results.

For a pair of visible light and infrared images to be detected, a fused image is obtained through step S201, the visible light image, the infrared image and the fused image are respectively input into corresponding detectors for detection, and three groups of detection results are respectively obtained, where the detection results include three information, namely, coordinates of a target boundary box, a class to which the target belongs, and a confidence score (i.e., a probability that all given classes may belong to each class), and are respectively marked as bbox, cls, conf.

Specifically, a visible light image, an infrared image and a fusion image describing the same picture are respectively input into corresponding detectors to obtain three groups of detection results, and each target in the detection results is composed of three pieces of information { bbox, cls and conf }, wherein bbox represents the coordinate of a bounding box of the target, cls represents the most likely category to which the target belongs, and conf represents the probability that all given categories may belong to various categories.

And S204, fusing the detection results to obtain a final detection result.

As shown in fig. 5, the three groups of detection results are merged by using a bayesian decision-level fusion algorithm to obtain a final detection result.

Specifically, three groups of detection results are put into a set A, and whether each overlapped bounding box represents the same target is judged according to the bounding box intersection ratio (IoU): taking a threshold value thres, when some boundary frames in the set are overlapped in pairs and the intersection ratio IoU of the two frames is more than or equal to thres, considering that the two frames represent the same target, fusing the classification confidence score and the boundary frame coordinate representing the same target by using a decision-level fusion algorithm, and putting the result into the set B; if the intersection ratio IoU of the two boxes is less than thres, the two boxes represent different targets, and the confidence score and the bounding box coordinate of the two boxes are reserved according to the original result and are placed in the set B. And the result in the set B is the final detection result.

Further, step S204 includes:

(1) placing the three groups of detection results in a set A, calculating the intersection ratio IoU of the bounding boxes with the same cls value pairwise, setting the threshold value thres to be 0.5, and when the intersection ratio IoU of the two boxes is more than or equal to 0.5, considering that the two boxes represent the same target; if the intersection ratio IoU of the two boxes is ≦ thres, the two boxes are considered to represent different targets.

(2) And according to the IoU result, correspondingly processing the bounding box in the set in the following way:

wherein fusion represents that if the two frames represent the same target, the coordinates and the confidence score of the target boundary frame are fused by a Bayesian decision level fusion algorithm; the reserve indicates that if the two boxes indicate that the two boxes do not represent the same target, no processing is performed on the bounding box, and the result is retained.

Further, the step (2) specifically comprises:

and (2-1) fusing the confidence scores of all the modes together through a Bayesian rule to obtain a fused confidence score.

The fusion of the confidence scores is based on the condition independence of the detection process of each mode and the combination of the Bayes rule and the prediction characteristics of the YOLOv5 algorithm, and the fused confidence scores and the original confidence scores have the following relations:

wherein y represents a visible-infrared target detection data setThe object category annotated in (1) includes, but is not limited to, typical object categories such as pedestrians, bicycles, automobiles, etc., when k is 0,1,2, the object is a pedestrian, a bicycle or an automobile, and so on, x_jInformation in a certain modality of input for the target category y, p (y | x)_j) To predict the probability that a certain object belongs to class k (y-k) in label y,

is p (y | x)_j) Probability distribution under the YOLOv5 target detection framework.

And multiplying the prediction probabilities of the same target from different modes, and dividing the result by the sum of the probabilities of all the categories to perform normalization to obtain a fusion confidence score. The principle of confidence score fusion is as follows:

predicting information x of visible light modality given label y of certain object₁If this prediction does not change the infrared modality information x₂Or fusion modality information x₃Then conditional independence holds. According to the independence condition, the following probability relationship is provided:

p(x₁,x₂,x₃|y)＝p(x₁|y)p(x₂|y)p(x₃|y)

according to the multi-modal information, the probability that a certain target belongs to the category y is predicted, and the probability is obtained by a Bayesian rule:

according to the independence condition, the above formula can be rewritten into the following relational expression, which is called Bayesian probability fusion formula:

according to the prediction principle of the Yolov5 target detection framework, certain modal information x is input₁Predicting a probability distribution (k) where a certain target belongs to class k (y-k) in the label yConfidence score) may be expressed as:

substituting the Bayes probability fusion formula to obtain:

in summary, the process of bayesian probability fusion can be described simply as the normalization by multiplying the predicted probabilities from different modalities for the same target, and dividing by the sum of the probabilities of all classes.

(2-2) the principle of bounding box coordinate fusion adopts the principle of simple averaging, namely, the coordinate values of the bounding boxes from different modalities and representing the same target are averaged. Assuming that three detectors respectively detect three images with different modalities and the same content, n detectors detect the same target, such as the same automobile, and the predicted coordinates of the target bounding box are expressed as

Wherein n is an integer and n<＝3；

Represents the horizontal and vertical coordinates of the upper left corner of the bounding box,

the coordinates of the bottom right corner of the bounding box are represented by the horizontal and vertical coordinates from the prediction result of the jth detector, and the coordinates of the fused bounding box are represented as:

the method of averaging the coordinates of the bounding box can properly reduce the prediction error of the coordinates of the detection box, so that the finally obtained fusion bounding box is closer to the real label.

(3) Performing the above processing on all the prediction results in the set A, if the intersection ratio of the two frames is IoU < thres, considering that the two frames represent different targets, reserving the confidence score and the boundary frame coordinate of the two frames according to the original result, and putting the confidence score and the boundary frame coordinate into the set B; otherwise, the two frames represent the same target, the classification confidence score and the boundary frame coordinate representing the same target are fused, and the result is put into the set B.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program to instruct associated hardware, and the corresponding program may be stored in a computer-readable storage medium.

It should be noted that although the method operations of the above-described embodiments are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Example 2:

as shown in fig. 6, the present embodiment provides an object detection apparatus based on multi-modal data dual fusion, the apparatus includes a data sample acquisition module 601, a detector training module 602, a detection result generation module 603, and a detection result fusion module 604, wherein:

the data sample acquisition module 601 is configured to generate a fused image according to any pair of visible light images and infrared images in the visible light-infrared target detection data set; forming a data sample by the visible light image, the infrared image and the fused image;

a detector training module 602, configured to train detectors respectively by using the data samples, so as to obtain trained detectors in different modalities;

a detection result generating module 603, configured to generate a fused image to be detected according to a pair of visible light images and infrared images to be detected; inputting the pair of visible light images and infrared images to be detected and the fused image to be detected into a trained detector with a corresponding mode respectively to obtain a detection result;

and a detection result fusion module 604, configured to fuse the detection results to obtain a final detection result.

The specific implementation of each module in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that, the apparatus provided in this embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

Example 3:

this embodiment provides an electronic device, which may be a computer, as shown in fig. 7, and includes a processor 702, a memory, an input device 703, a display 704, and a network interface 705 that are connected by a system bus 701, where the processor is used to provide computing and control capabilities, the memory includes a nonvolatile storage medium 706 and an internal memory 707, the nonvolatile storage medium 706 stores an operating system, computer programs, and a database, the internal memory 707 provides an environment for the operating system and the computer programs in the nonvolatile storage medium to run, and when the processor 702 executes the computer programs stored in the memory, the object detection method of embodiment 1 is implemented as follows:

and fusing the detection results to obtain a final detection result.

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, and when the computer program is executed by a processor, the method for detecting the target of embodiment 1 is implemented as follows:

and fusing the detection results to obtain a final detection result.

It should be noted that the computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In conclusion, the invention designs an image fusion algorithm based on multi-scale transformation, obtains a fusion image by utilizing a visible light image and an infrared image, forms three data sources with the visible light image and the infrared image, and furthest retains original information; meanwhile, three modal detection results output by the three detectors are fused through a Bayesian decision level fusion algorithm, and the results of different detectors are integrated, so that the final output fusion result has a more accurate detection result compared with any detector. The invention combines the advantages of two levels of fusion, and has more excellent all-weather detection performance compared with the single use of pixel level fusion or decision level fusion.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the scope of the present invention.

Claims

1. A target detection method based on multi-modal data dual fusion is characterized by comprising the following steps:

and fusing the detection results to obtain a final detection result.

2. The target detection method according to claim 1, wherein the generating of the fused image according to any pair of the visible light image and the infrared image in the visible light-infrared target detection data set adopts an image fusion algorithm based on multi-scale transformation, and specifically comprises:

3. The object detection method according to claim 2, wherein the processing the visible light image and the infrared image by using wavelet transform to generate a low frequency subband and a high frequency subband specifically comprises:

4. The target detection method according to claim 3, wherein the fusing the low-frequency subband and the high-frequency subband respectively to obtain a fused low-frequency subband and a fused high-frequency subband specifically comprises:

5. The object detection method of claim 4, wherein the low-frequency subbands include a c-th low-frequency subband of the visible light image

And the c-th low-frequency sub-band of the infrared image

obtaining the c-th fused low-frequency sub-band by using the following formula:

wherein x and y are horizontal and vertical coordinates of processing point on the image, alpha₁,α₂Respectively the fusion coefficients of the visible light image and the infrared image and alpha₁+α₂＝1；

respectively extracting edge image features of the first image and the second image by using a canny operator to obtain edge feature images, calculating the regional variance energy features through a sliding window, and respectively obtaining regional energy values RGB (red, green, blue) of the first image and the second image at the (x, y) position_EAnd IR_E(ii) a Wherein the firstThe image and the second image are respectively a high-frequency sub-band in the ith horizontal direction of the visible light image and a high-frequency sub-band in the ith horizontal direction of the infrared image, a high-frequency sub-band in the ith vertical direction of the visible light image and a high-frequency sub-band in the ith vertical direction of the infrared image, and a high-frequency sub-band in the ith diagonal direction of the visible light image and a high-frequency sub-band in the ith diagonal direction of the infrared image;

wherein the content of the first and second substances,

respectively a first image and a second image,

is a fused image;

6. The target detection method according to claim 5, wherein the reconstructing the fused low-frequency subband and the fused high-frequency subband to generate a fused image specifically comprises:

the 0 th fused low-frequency sub-band is the fused image.

7. The target detection method according to claim 1, wherein the detection results comprise visible light mode detection results, infrared mode detection results and fusion mode detection results, wherein each mode detection result comprises target bounding box coordinates, class cls to which the target belongs and a confidence score;

all the detection results are processed as follows:

for target bounding boxes with the same value of cls, then:

the set B is the final detection result.

8. The method for detecting the target according to claim 7, wherein the fusing the target bounding box coordinates and the confidence score by a Bayesian decision level fusion algorithm specifically comprises:

9. An object detection device based on multi-modal data dual fusion, the device comprising:

10. A storage medium storing a program, wherein the program, when executed by a processor, implements the object detection method of any one of claims 1 to 8.