CN114694001A - Target detection method and device based on multi-modal image fusion - Google Patents

Target detection method and device based on multi-modal image fusion Download PDF

Info

Publication number
CN114694001A
CN114694001A CN202210137919.XA CN202210137919A CN114694001A CN 114694001 A CN114694001 A CN 114694001A CN 202210137919 A CN202210137919 A CN 202210137919A CN 114694001 A CN114694001 A CN 114694001A
Authority
CN
China
Prior art keywords
vector
layer
image
module
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210137919.XA
Other languages
Chinese (zh)
Inventor
张树
马杰超
俞益洲
李一鸣
乔昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenrui Bolian Technology Co Ltd
Shenzhen Deepwise Bolian Technology Co Ltd
Original Assignee
Beijing Shenrui Bolian Technology Co Ltd
Shenzhen Deepwise Bolian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenrui Bolian Technology Co Ltd, Shenzhen Deepwise Bolian Technology Co Ltd filed Critical Beijing Shenrui Bolian Technology Co Ltd
Priority to CN202210137919.XA priority Critical patent/CN114694001A/en
Publication of CN114694001A publication Critical patent/CN114694001A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target detection method and device based on multi-modal image fusion. The method comprises the following steps: acquiring a video image and an infrared image in real time, and respectively inputting the video image and the infrared image into a target detection model formed by a transform; respectively extracting global features of the video image and the infrared image; fusing the extracted video image features and the infrared image features; and inputting the fusion characteristics of the video image and the infrared image into a prediction module consisting of a transform full-link layer, and outputting the target type and the target position. According to the method, a pure Transformer is utilized to construct a target detection model, so that model advantages brought by the overall structure of the Transformer can be fully exerted; the invention carries out target detection based on the feature fusion of the video image and the infrared image, can realize target detection under any illumination condition, and solves the problem of poor detection effect of the existing detection system in dark environments such as night and the like.

Description

Target detection method and device based on multi-modal image fusion
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a target detection method and device based on multi-modal image fusion.
Background
For a long time, how to help the visually impaired and weak group to obtain better mobility has been a social problem of great concern. Timely and correct perception of the surrounding environment is an indispensable condition for assisting in improving the activity safety and the life quality of the target individual. With the computer vision technology rapidly developed in recent years, various deep learning models based on Convolutional Neural Network (CNN) have been able to show outstanding ability in the task of real-time recognition of images of natural scenes, even with accuracy and stability exceeding those of human beings, and have been successfully deployed in products, such as automatic driving technology which has recently achieved excellent results.
Some vision-aided perception wearable electronic devices which are developed aiming at visually impaired people and constantly emerge also benefit from the vision-aided perception wearable electronic devices, images or video data in a real-time scene are collected by means of a miniature camera or a sensor on the device, and corresponding calculation is carried out by a carried model, so that result information of scene target detection is provided for a wearer. However, most target detection models are modeled based on visible light color image data with sufficient brightness, which greatly reduces the performance of the models when receiving visible light image input with poor ambient light conditions (such as night, scenes in life such as dark space) and cannot achieve sufficient recognition capability, and the corresponding visual impairment assisting devices cannot provide danger warning for the wearer in time.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a target detection method and apparatus based on multi-modal image fusion.
In order to achieve the above object, the present invention adopts the following technical solutions.
In a first aspect, the present invention provides a target detection method based on multi-modal image fusion, including the following steps:
acquiring a video image and an infrared image which are respectively shot by a video camera and an infrared camera in real time, and respectively inputting the video image and the infrared image into a target detection model formed by a transform;
respectively extracting global features of the video image and the infrared image by using a feature coding module consisting of a Transformer coder;
fusing the extracted video image features and the infrared image features by using a feature fusion module consisting of a Transformer decoder;
and inputting the fusion characteristics of the video image and the infrared image into a prediction module consisting of a transform full-link layer, and outputting the target type and the target position.
Further, before global feature extraction, the method further comprises the following operations of respectively carrying out the following operations on the input video image and the input infrared image:
cutting the image into N slices;
expanding each slice in a channel dimension, and inputting the slice to a linear full-connection layer to obtain a d-dimension vector;
and calculating sine and cosine position codes in the slice row and column directions, and adding the sine and cosine position codes to the output of the linear full-connection layer to obtain an Nxd coding matrix.
Furthermore, the feature coding module is formed by stacking Transformer encoders, wherein each Transformer encoder comprises a multi-head self-attention module layer, a feedforward network layer, a normalization layer and a residual error unit, and the normalization layer and the residual error unit are connected with each layer; the method comprises the steps that an Nxd coding matrix of a video image or an infrared image input to a multi-head self-attention module is subjected to three different linear transformations to obtain a query vector, a key vector and a value vector with the size of Nxd', similarity is calculated between the query vector and the key vector through a vector dot product with a scaling coefficient, an attention weight matrix is obtained after normalization through a softmax function, and a path of attention results are obtained after the weight matrix is multiplied by the value vector; and splicing the multi-path attention results and then mapping the multi-path attention results back to the original dimension d' to obtain the feature codes of the video images or the infrared images.
Furthermore, the feature fusion module is formed by stacking transform decoders, wherein each transform decoder comprises a multi-head self-attention module layer, a multi-head mutual attention module layer, a feedforward network layer, a normalization layer and a residual error unit, and the normalization layer and the residual error unit are connected with each layer; query vector Q of multi-headed mutual attention module layer of ith Transformer decoderiOutput from multi-headed self-attention module layer, key vector KiSum vector ViRespectively coming from a video image characteristic A and an infrared image characteristic B output by a characteristic coding module; query vector Q of multi-headed mutual attention module layer of i +1 th Transformer decoderi+1Output from multi-headed self-attention module layer, key vector Ki+1Vector of sum values Vi+1From B and A, respectively; key vector KiSum vector ViAre all Nxd' matrices, query vector QiIs N ' x d ' matrix, N '<N;i=1,2,…。
Further, the method further comprises: and judging the dangerous target and the direction thereof according to the target category and the target position, and sending out dangerous early warning information.
In a second aspect, the present invention provides an object detection apparatus based on multi-modal image fusion, including:
the image acquisition module is used for acquiring a video image and an infrared image which are respectively shot by a video camera and an infrared camera in real time and respectively inputting the video image and the infrared image to a target detection model formed by a transform;
the characteristic extraction module is used for respectively extracting global characteristics of the video image and the infrared image by utilizing a characteristic coding module consisting of a Transformer coder;
the characteristic fusion module is used for fusing the extracted video image characteristics and the infrared image characteristics by utilizing a characteristic fusion module consisting of a Transformer decoder;
and the target prediction module is used for inputting the fusion characteristics of the video image and the infrared image into the prediction module formed by the transform full-link layer and outputting the target type and the target position.
Further, the apparatus also includes a vector embedding module to:
cutting the image into N slices;
expanding each slice in a channel dimension, and inputting the slice to a linear full-connection layer to obtain a d-dimension vector;
and calculating sine and cosine position codes in the slice row and column directions, and adding the sine and cosine position codes to the output of the linear full-connection layer to obtain an Nxd coding matrix.
Furthermore, the feature coding module is formed by stacking transform encoders, and each transform encoder comprises a multi-head self-attention module layer, a feedforward network layer, a normalization layer and a residual error unit which are connected with each layer; the method comprises the steps that an Nxd coding matrix of a video image or an infrared image input to a multi-head self-attention module is subjected to three different linear transformations to obtain a query vector, a key vector and a value vector with the size of Nxd', similarity is calculated between the query vector and the key vector through a vector dot product with a scaling coefficient, an attention weight matrix is obtained after normalization through a softmax function, and a path of attention results are obtained after the weight matrix is multiplied by the value vector; and splicing the multi-path attention results and then mapping the multi-path attention results back to the original dimension d' to obtain the feature codes of the video images or the infrared images.
Furthermore, the feature fusion module is formed by stacking Transformer decoders, wherein each Transformer decoder comprises a multi-head self-attention module layer, a multi-head mutual attention module layer, a feed-forward network layer, a normalization layer and a residual error unit, and the normalization layer and the residual error unit are connected with each layer; query vector Q of multi-headed mutual attention module layer of ith Transformer decoderiOutput from multi-headed self-attention module layer, key vector KiSum vector ViRespectively coming from a video image characteristic A and an infrared image characteristic B output by a characteristic coding module; query vector Q of multi-headed mutual attention module layer of i +1 th Transformer decoderi+1Output from multi-headed self-attention module layer, key vector Ki+1Sum vector Vi+1From B and A, respectively; key vector KiSum vector ViAre all Nxd' matrices, query vector QiIs N ' x d ' matrix, N '<N;i=1,2,…。
Furthermore, the device also comprises a danger early warning module which is used for judging the dangerous target and the position thereof according to the target category and the target position and sending out the danger early warning information.
Compared with the prior art, the invention has the following beneficial effects.
According to the method, the video image and the infrared image are obtained in real time, the target detection model formed by pure transformers is used for respectively extracting global features of the video image and the infrared image, the extracted video image features and the infrared image features are fused, target category prediction is carried out based on the fusion features, and target detection based on multi-mode image fusion is realized. According to the method, a pure Transformer is utilized to construct a target detection model, so that model advantages brought by the overall structure of the Transformer can be fully exerted; the invention carries out target detection based on the feature fusion of the video image and the infrared image, can realize target detection under any illumination condition, and solves the problem of poor detection effect of the existing detection system in dark environments such as night and the like.
Drawings
Fig. 1 is a flowchart of a target detection method based on multi-modal image fusion according to an embodiment of the present invention.
Fig. 2 is a schematic view of an overall structure of a target detection model according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of the self-attention mechanism.
FIG. 4 is a schematic diagram of the concatenation of two transform decoders.
Fig. 5 is a block diagram of an object detection apparatus based on multi-modal image fusion according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a target detection method for multi-modal image fusion according to an embodiment of the present invention, including the following steps:
step 101, acquiring a video image and an infrared image which are respectively shot by a video camera and an infrared camera in real time, and respectively inputting the video image and the infrared image into a target detection model formed by a transform;
102, respectively carrying out global feature extraction on the video image and the infrared image by using a feature coding module consisting of a Transformer coder;
step 103, fusing the extracted video image features and the infrared image features by using a feature fusion module consisting of a Transformer decoder;
and 104, inputting the fusion characteristics of the video image and the infrared image into a prediction module consisting of a transform full-link layer, and outputting the target type and the target position.
In this embodiment, step 101 is mainly used for acquiring a video image and an infrared image in real time. Most of existing target detection models for assisting visually impaired people are modeled based on visible light color image data with sufficient brightness, so that the performance of the models is greatly reduced when receiving visible light image input with poor ambient lighting conditions (such as scenes in life at night, in dark space and the like), and the required recognition capability cannot be achieved. For this reason, the present embodiment acquires an infrared image simultaneously with acquiring a video image. Because the imaging principle of the infrared camera is not influenced by the illumination condition, the collected infrared image can provide powerful supplement of scene target information in a dark environment, and therefore the target detection model based on the fusion of the video image and the infrared image can have high-level generalization capability in the light or dark scene. The target detection model of the embodiment adopts a pure Transformer structure, fully exerts model advantages brought by the overall structure of the Transformer, and can achieve better effect and generalization capability than a convolutional neural network CNN model on an image recognition task. The overall structure of the object detection model is shown in fig. 2.
In this embodiment, step 102 is mainly used for image feature extraction. In the embodiment, feature extraction is respectively performed on the video image and the infrared image by using a feature encoding module composed of a transform encoder. The Transformer encoder adopts an attention mechanism and mainly comprises a multi-head self-attention module, can extract global characteristics of an input image, and has greatly improved target detection precision compared with a CNN (compressed natural network) which can only extract local characteristics of the image.
In this embodiment, step 103 is mainly used for performing multi-modal feature fusion. The embodiment utilizes a feature fusion module composed of a Transformer decoder to fuse the extracted video image features and the infrared image features. The existing network structure model based on CNN mainly uses three fusion schemes for completing the multi-modal image fusion task, which are called early (early), middle (middle) and late (late) fusion: early fusion, namely, directly splicing (concatenate) images from multiple modes on a channel dimension at a model input end to serve as the input of the whole network; the middle-stage fusion is that different modes have independent feature extractors, and feature graphs of the modes at a certain level stage are fused by utilizing various defined fusion calculation modes; and the late fusion is to fuse the final results of all the modes passing through mutually independent feature extractors for prediction. In any fusion method, the fusion input is simply searched for, and there is not enough theoretical explanatory support, or task directivity, and it is assumed that features between the modalities are one-to-one in spatial position, and the convolution only performs local fusion calculation. However, images of different modalities and even feature maps have a certain positional deviation, and performing only local calculation may cause that corresponding features cannot be aligned, which may cause inefficient fusion and poor detection. In this embodiment, a multi-modal fusion method based on attention (a Transformer decoder includes a multi-head self-attention module and a multi-head mutual attention module) is provided by a Transformer to replace CNN, so that information of different modalities can be mutually noticed in a global scope, and therefore, the fusion is not limited by position deviation, and is more effective and has more theoretical support.
In this embodiment, step 104 is mainly used to predict the target category. In the embodiment, the target category is predicted by inputting the fusion characteristics of the video image and the infrared image into a prediction module composed of a transform full link layer. The target in this embodiment is a dangerous target that may threaten an action, and the target categories are classified according to the danger levels, for example, a tunnel or a utility pole right in front is a high risk, and a bicycle parked on a side is a medium risk. The prediction module generally outputs the target location along with the target category. The prediction module is composed of two full-connection layer branches, wherein one branch consists of N1 full-connection layers and completes prediction of target categories, and the other branch consists of N2 full-connection layers and completes regression prediction of target positions (coordinates of the upper left corner and the lower right corner of a detection frame), so that a target detection task is realized. The inputs of the two branches are the same and are the fusion characteristics finally output by the characteristic fusion module.
As an alternative embodiment, before performing global feature extraction, the method further includes the following operations performed on the input video image and the input infrared image respectively:
cutting the image into N slices;
expanding each slice in a channel dimension, and inputting the slice to a linear full-connection layer to obtain a d-dimension vector;
and calculating sine and cosine position codes in the slice row and column directions, and adding the sine and cosine position codes to the output of the linear full-connection layer to obtain an Nxd coding matrix.
The embodiment provides a technical scheme for vector embedding of the input video image and the input infrared image. For the input video image and infrared image, embedded coding is required to convert the input video image and infrared image into a sequence type input acceptable to a Transformer. Specifically, an image input having a size of C × H × W is sliced (patch), and N ═ H × (W/W) slices having a size of C × H × W can be obtained assuming that the space size of each patch to be sliced is H × W. The dimensions of each slice along the channel C are flattened to obtain a vector with dimensions C x h x w, and a matrix with dimensions N x (C x h x w) is input into a linear full-link layer to calculate the dimension change to dimension d. In addition, in order to make the coding of the patch contain two-dimensional position information rather than exhibit arrangement invariance, a fixed d-dimensional sine or cosine position code is calculated for the row and column directions respectively and added to the output of the linear layer, and finally an N × d matrix, i.e. a linear embedded coded representation of the input image, is obtained, wherein the d-dimensional vector of each row is a representative vector of the patch, and the number of rows N of the matrix can be referred to as the number of representative vectors. It should be noted that N is different according to the set size of the patch, and N can be flexibly set according to the actual requirement of a specific task.
As an optional embodiment, the feature coding module is formed by stacking Transformer encoders, each Transformer encoder comprises a multi-head self-attention module layer, a feedforward network layer, a normalization layer and a residual error unit, and the normalization layer and the residual error unit are connected with each layer; the method comprises the steps that an Nxd coding matrix of a video image or an infrared image input to a multi-head self-attention module is subjected to three different linear transformations to obtain a query vector, a key vector and a value vector with the size of Nxd', similarity is calculated between the query vector and the key vector through a vector dot product with a scaling coefficient, an attention weight matrix is obtained after normalization through a softmax function, and a path of attention results are obtained after the weight matrix is multiplied by the value vector; and splicing the multi-path attention results and then mapping the multi-path attention results back to the original dimension d' to obtain the feature codes of the video images or the infrared images.
This embodiment provides a specific technical solution for feature extraction. The feature extraction is realized by a feature coding module, the module is obtained by stacking transform encoders, the specific stacking layer number can be determined according to specific task debugging, the encoders corresponding to the two branches of the two images are mutually independent, and the stacking layer number can be the same or different. The specific structure of each transform encoder consists (in order) of a layer of multi-headed self-attention modules, a layer of forward propagation modules, and residual connections and normalization applied at each layer. The calculation process of the self-attention mechanism can be referred to fig. 3, and the input N × d encoding matrixes are respectively subjected to a linear mapping function Wq、Wk、WvTransforming to get Nxd' sized queryThe method comprises the following steps of (1) inquiring a vector (Query), a Key vector (Key) and a Value vector (Value), calculating similarity between the vector and the Key vector according to a vector dot product with a scaling coefficient, normalizing by a softmax function, obtaining an attention weight matrix, and expressing the attention weight matrix by a formula as follows:
Figure BDA0003505106990000081
wherein, alpha is a weight matrix, Q is a query vector, KTIs the transpose of the key vector. The weight matrix is used to multiply the value vector (i.e. it is equivalent to a column-wise weighted summation of the value vectors according to the weights to obtain the value at a certain point on the result matrix). The multi-head self-attention is to repeat the process independently for multiple times, and the results of the multiple times are spliced (connected) and then mapped back to the original characteristic dimension d'. The forward relay module layer is a multi-layer perceptron (MLP) structure having a hidden layer. Through a Transformer encoder, the input image can encode features on a self-modeling global scale, namely, each representative vector comprises self-computing similarity with all other representative vectors, and the input image has the global property which is not provided by the CNN for extracting the image features.
As an optional embodiment, the feature fusion module is formed by stacking Transformer decoders, each Transformer decoder comprises a multi-head self-attention module layer, a multi-head mutual-attention module layer, a feed-forward network layer, and a normalization layer and a residual error unit connected with each layer; query vector Q of multi-headed mutual attention module layer of ith Transformer decoderiOutput from multi-headed self-attention module layer, key vector KiSum vector ViRespectively coming from a video image characteristic A and an infrared image characteristic B output by a characteristic coding module; query vector Q of multi-headed mutual attention module layer of i +1 th Transformer decoderi+1Output from multi-headed self-attention module layer, key vector Ki+1Sum vector Vi+1From B and A, respectively; key vector KiSum vector ViAre all an Nxd' matrix, query vector QiIs N ' x d ' matrix, N '<N;i=1,2,…。
This embodiment provides a specific technical solution for feature fusion. The fusion of the image features of the two modalities is realized by a feature fusion module, which is stacked by a transform decoder, and the structural schematic diagram of sequentially stacking two transform decoders is shown in fig. 4. Likewise, the number of decoder specific stack layers may be determined by task specific debugging. The detailed structure of each transform decoder consists (in order) of a layer of multi-headed self-attention modules, a layer of multi-headed mutual-attention modules, a layer of forward-direction retransmission modules, and residual concatenation and normalization imposed by each layer. Wherein the multi-headed self-attention module layer and the forward propagation module layer are the same as in the transform encoder. The computing mechanism of the multi-head mutual attention module layer is the same as that of self-attention, the only difference is that the received query vector comes from the output of the multi-head self-attention module layer, and the key vector and the value vector come from the video image feature A and the infrared image feature B output by the feature coding module respectively. It is worth noting that the order of the image features A, B connected by the key vectors and value vectors of adjacent decoders is exactly opposite, for example, if the key vectors and value vectors of the current decoder are connected A, B, the key vectors and value vectors of the previous decoder and the next decoder are connected B, A, respectively, so that the query vector can alternately perform attention calculation and fusion on the features of the two modalities. The design can effectively balance some information deviations possibly existing between the two modes, including position deviations, so that effective contents with similar distribution are extracted, and key interrelations possibly existing in the whole world are modeled. However, it should be noted that a specially defined query vector needs to be individually initialized for the first layer transform decoder as an input, the query vector is a set of learnable parameters, so that how to extract the position code of the region where the target exists in the multimodal image can be implicitly learned, and the position code plays an intermediary role in fusion, has good task directionality and priority, and is a key component for completing the target detection task and the multimodal fusion task. The dimension of the query vector is the same as that of the modal image coding, but the size N '(or the number, namely the number of rows of the coding matrix) is much smaller than the number N of the modal image coding, namely N' < < N, and is slightly larger than the maximum value of the number of targets to be detected in the data image, so that missing detection can be reduced, only necessary features are interacted in the process of attention calculation, the redundancy of information is reduced, and meanwhile, the calculation cost is greatly reduced.
As an alternative embodiment, the method further comprises: and judging the dangerous target and the direction thereof according to the output target category and the target position, and sending out dangerous early warning information.
The embodiment provides a technical scheme for carrying out danger early warning. The danger early warning belongs to a post-processing step, the dangerous target is judged based on the target type and the target position output by the prediction module, the direction (including the distance) of the target relative to the user is calculated, and finally, the warning information is sent to the user through the voice module to remind the user to attract attention or avoid.
Fig. 5 is a schematic composition diagram of an object detection apparatus for multi-modal image fusion according to an embodiment of the present invention, where the apparatus includes:
the image acquisition module 11 is used for acquiring a video image and an infrared image which are respectively shot by a video camera and an infrared camera in real time and respectively inputting the video image and the infrared image into a target detection model formed by a transform;
the feature extraction module 12 is configured to perform global feature extraction on the video image and the infrared image respectively by using a feature coding module formed by a transform encoder;
a feature fusion module 13, configured to fuse the extracted video image features and infrared image features by using a feature fusion module formed by a transform decoder;
and the target prediction module 14 is used for inputting the fusion characteristics of the video image and the infrared image into a prediction module consisting of a transform full link layer and outputting a target type and a target position.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.
As an alternative embodiment, the apparatus further comprises a vector embedding module configured to:
cutting the image into N slices;
expanding each slice in a channel dimension, and inputting the slice to a linear full-connection layer to obtain a d-dimension vector;
and calculating sine and cosine position codes in the slice row and column directions, and adding the sine and cosine position codes to the output of the linear full-connection layer to obtain an Nxd coding matrix.
As an optional embodiment, the feature coding module is formed by stacking Transformer encoders, each Transformer encoder comprises a multi-head self-attention module layer, a feedforward network layer, a normalization layer and a residual error unit, and the normalization layer and the residual error unit are connected with each layer; the method comprises the steps that an Nxd coding matrix of a video image or an infrared image input to a multi-head self-attention module is subjected to three different linear transformations to obtain a query vector, a key vector and a value vector with the size of Nxd', similarity is calculated between the query vector and the key vector through a vector dot product with a scaling coefficient, an attention weight matrix is obtained after normalization through a softmax function, and a path of attention results are obtained after the weight matrix is multiplied by the value vector; and splicing the multi-path attention results and then mapping the multi-path attention results back to the original dimension d' to obtain the feature codes of the video images or the infrared images.
As an optional embodiment, the feature fusion module is formed by stacking Transformer decoders, each Transformer decoder comprises a multi-head self-attention module layer, a multi-head mutual-attention module layer, a feed-forward network layer, and a normalization layer and a residual error unit connected with each layer; query vector Q of multi-headed mutual attention module layer of ith Transformer decoderiOutput from multi-headed self-attention module layer, key vector KiSum vector ViRespectively coming from a video image characteristic A and an infrared image characteristic B output by a characteristic coding module; query vector Q of multi-headed mutual attention module layer of i +1 th Transformer decoderi+1Output from multi-headed self-attention module layer, key vector Ki+1Sum vector Vi+1From B and A, respectively; key vector KiSum vector ViAre all Nxd' matrices, query vector QiIs N ' x d ' matrix, N '<N;i=1,2,…。
As an optional embodiment, the device further includes a danger early warning module, configured to determine a dangerous target and its position according to the target category and the target position, and send out danger early warning information.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A target detection method based on multi-modal image fusion is characterized by comprising the following steps:
acquiring a video image and an infrared image which are respectively shot by a video camera and an infrared camera in real time, and respectively inputting the video image and the infrared image into a target detection model formed by a transform;
respectively extracting global features of the video image and the infrared image by using a feature coding module consisting of a Transformer coder;
fusing the extracted video image features and the infrared image features by using a feature fusion module consisting of a Transformer decoder;
and inputting the fusion characteristics of the video image and the infrared image into a prediction module consisting of a transform full-link layer, and outputting the target type and the target position.
2. The method for target detection based on multi-modal image fusion as claimed in claim 1, further comprising the following operations respectively performed on the input video image and the infrared image before performing the global feature extraction:
cutting the image into N slices;
expanding each slice in a channel dimension, and inputting the slice to a linear full-connection layer to obtain a d-dimension vector;
and calculating sine and cosine position codes in the slice row and column directions, and adding the sine and cosine position codes to the output of the linear full-connection layer to obtain an Nxd coding matrix.
3. The method of claim 2, wherein the feature encoding modules are stacked by transform encoders, each transform encoder comprises a multi-head self-attention module layer and a feedforward network layer, and a normalization layer and a residual unit connected to each layer; the method comprises the steps that an Nxd coding matrix of a video image or an infrared image input to a multi-head self-attention module is subjected to three different linear transformations to obtain a query vector, a key vector and a value vector with the size of Nxd', similarity is calculated between the query vector and the key vector through a vector dot product with a scaling coefficient, an attention weight matrix is obtained after normalization through a softmax function, and a path of attention results are obtained after the weight matrix is multiplied by the value vector; and splicing the multi-path attention results and then mapping the multi-path attention results back to the original dimension d' to obtain the feature codes of the video images or the infrared images.
4. The method of claim 3, wherein the feature fusion module is formed by stacking transform decoders, each transform decoder comprises a multi-head self-attention module layer, a multi-head mutual-attention module layer, a feedforward network layer, and a normalization layer and residual unit connected to each layer; query vector Q of multi-headed mutual attention module layer of ith Transformer decoderiOutput from multi-headed self-attention module layer, key vector KiSum vector ViRespectively coming from a video image characteristic A and an infrared image characteristic B output by a characteristic coding module; query vector Q of multi-headed mutual attention module layer of i +1 th Transformer decoderi+1Output from multi-headed self-attention module layer, key vector Ki+1Sum vector Vi+1From B and A, respectively; key vector KiSum vector ViAre all Nxd' matrices, pollQuestion vector QiIs N ' x d ' matrix, N '<N;i=1,2,…。
5. The method for object detection based on multi-modal image fusion as claimed in claim 1, further comprising: and judging the dangerous target and the direction thereof according to the target category and the target position, and sending out dangerous early warning information.
6. An object detection device based on multi-modal image fusion, comprising:
the image acquisition module is used for acquiring a video image and an infrared image which are respectively shot by a video camera and an infrared camera in real time and respectively inputting the video image and the infrared image into a target detection model formed by a transform;
the characteristic extraction module is used for respectively extracting global characteristics of the video image and the infrared image by utilizing a characteristic coding module consisting of a Transformer coder;
the characteristic fusion module is used for fusing the extracted video image characteristics and the infrared image characteristics by utilizing a characteristic fusion module consisting of a Transformer decoder;
and the target prediction module is used for inputting the fusion characteristics of the video image and the infrared image into the prediction module formed by the transform full-link layer and outputting the target type and the target position.
7. The multi-modal image fusion based object detection apparatus of claim 6, further comprising a vector embedding module configured to:
cutting the image into N slices;
expanding each slice in a channel dimension, and inputting the slice to a linear full-connection layer to obtain a d-dimension vector;
and calculating sine and cosine position codes in the slice row and column directions, and adding the sine and cosine position codes to the output of the linear full-connection layer to obtain an Nxd coding matrix.
8. The apparatus of claim 7, wherein the feature encoding module is formed by stacking transform encoders, each transform encoder comprises a multi-head self-attention module layer and a feedforward network layer, and a normalization layer and a residual unit connected to each layer; the method comprises the steps that an Nxd coding matrix of a video image or an infrared image input to a multi-head self-attention module is subjected to three different linear transformations to obtain a query vector, a key vector and a value vector with the size of Nxd', similarity is calculated between the query vector and the key vector through a vector dot product with a scaling coefficient, an attention weight matrix is obtained after normalization through a softmax function, and a path of attention results are obtained after the weight matrix is multiplied by the value vector; and splicing the multipath attention results and then mapping the result back to the original dimension d' to obtain the feature code of the video image or the infrared image.
9. The apparatus of claim 8, wherein the feature fusion module is formed by stacking transform decoders, each transform decoder comprises a multi-head self-attention module layer, a multi-head mutual-attention module layer, a feedforward network layer, and a normalization layer and residual unit connected to each layer; query vector Q of multi-headed mutual attention module layer of ith Transformer decoderiOutput from multi-headed self-attention module layer, key vector KiSum vector ViRespectively coming from a video image characteristic A and an infrared image characteristic B output by a characteristic coding module; query vector Q of multi-headed mutual attention module layer of i +1 th Transformer decoderi+1Output from multi-headed self-attention module layer, key vector Ki+1Sum vector Vi+1From B and A, respectively; key vector KiSum vector ViAre all Nxd' matrices, query vector QiIs N ' x d ' matrix, N '<N;i=1,2,…。
10. The device for detecting the target based on the multi-modal image fusion as claimed in claim 6, further comprising a danger early warning module for judging the dangerous target and its orientation according to the target category and the target position and sending out the danger early warning information.
CN202210137919.XA 2022-02-15 2022-02-15 Target detection method and device based on multi-modal image fusion Pending CN114694001A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210137919.XA CN114694001A (en) 2022-02-15 2022-02-15 Target detection method and device based on multi-modal image fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210137919.XA CN114694001A (en) 2022-02-15 2022-02-15 Target detection method and device based on multi-modal image fusion

Publications (1)

Publication Number Publication Date
CN114694001A true CN114694001A (en) 2022-07-01

Family

ID=82137295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210137919.XA Pending CN114694001A (en) 2022-02-15 2022-02-15 Target detection method and device based on multi-modal image fusion

Country Status (1)

Country Link
CN (1) CN114694001A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596614A (en) * 2022-03-03 2022-06-07 清华大学 Anti-photo attack face recognition system and method
CN115205179A (en) * 2022-07-15 2022-10-18 小米汽车科技有限公司 Image fusion method and device, vehicle and storage medium
CN115240042A (en) * 2022-07-05 2022-10-25 抖音视界有限公司 Multi-modal image recognition method and device, readable medium and electronic equipment
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116401961A (en) * 2023-06-06 2023-07-07 广东电网有限责任公司梅州供电局 Method, device, equipment and storage medium for determining pollution grade of insulator
CN116740662A (en) * 2023-08-15 2023-09-12 贵州中南锦天科技有限责任公司 Axle recognition method and system based on laser radar
CN117726991A (en) * 2024-02-07 2024-03-19 金钱猫科技股份有限公司 High-altitude hanging basket safety belt detection method and terminal

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596614A (en) * 2022-03-03 2022-06-07 清华大学 Anti-photo attack face recognition system and method
CN115240042A (en) * 2022-07-05 2022-10-25 抖音视界有限公司 Multi-modal image recognition method and device, readable medium and electronic equipment
CN115205179A (en) * 2022-07-15 2022-10-18 小米汽车科技有限公司 Image fusion method and device, vehicle and storage medium
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116246213B (en) * 2023-05-08 2023-07-28 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116401961A (en) * 2023-06-06 2023-07-07 广东电网有限责任公司梅州供电局 Method, device, equipment and storage medium for determining pollution grade of insulator
CN116401961B (en) * 2023-06-06 2023-09-08 广东电网有限责任公司梅州供电局 Method, device, equipment and storage medium for determining pollution grade of insulator
CN116740662A (en) * 2023-08-15 2023-09-12 贵州中南锦天科技有限责任公司 Axle recognition method and system based on laser radar
CN116740662B (en) * 2023-08-15 2023-11-21 贵州中南锦天科技有限责任公司 Axle recognition method and system based on laser radar
CN117726991A (en) * 2024-02-07 2024-03-19 金钱猫科技股份有限公司 High-altitude hanging basket safety belt detection method and terminal
CN117726991B (en) * 2024-02-07 2024-05-24 金钱猫科技股份有限公司 High-altitude hanging basket safety belt detection method and terminal

Similar Documents

Publication Publication Date Title
CN114694001A (en) Target detection method and device based on multi-modal image fusion
CN112801027B (en) Vehicle target detection method based on event camera
CN111523378B (en) Human behavior prediction method based on deep learning
CN112200057A (en) Face living body detection method and device, electronic equipment and storage medium
CN112288776B (en) Target tracking method based on multi-time step pyramid codec
CN116758130A (en) Monocular depth prediction method based on multipath feature extraction and multi-scale feature fusion
CN114266938A (en) Scene recognition method based on multi-mode information and global attention mechanism
CN116468714A (en) Insulator defect detection method, system and computer readable storage medium
CN116258757A (en) Monocular image depth estimation method based on multi-scale cross attention
CN115393404A (en) Double-light image registration method, device and equipment and storage medium
CN115908896A (en) Image identification system based on impulse neural network with self-attention mechanism
CN116092178A (en) Gesture recognition and tracking method and system for mobile terminal
CN114170304B (en) Camera positioning method based on multi-head self-attention and replacement attention
CN111898671B (en) Target identification method and system based on fusion of laser imager and color camera codes
Jing et al. SmokeSeger: A Transformer-CNN coupled model for urban scene smoke segmentation
CN116402811B (en) Fighting behavior identification method and electronic equipment
CN116258756B (en) Self-supervision monocular depth estimation method and system
CN117115855A (en) Human body posture estimation method and system based on multi-scale transducer learning rich visual features
Xiong et al. MLP-Pose: Human pose estimation by MLP-mixer
CN114399628B (en) Insulator high-efficiency detection system under complex space environment
CN115331301A (en) 6D attitude estimation method based on Transformer
CN116311493A (en) Two-stage human-object interaction detection method based on coding and decoding architecture
CN115100680A (en) Pedestrian detection method based on multi-source image fusion
CN114648755A (en) Text detection method for industrial container in light-weight moving state
CN114596614A (en) Anti-photo attack face recognition system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination