CN116452937A - Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism - Google Patents

Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism Download PDF

Info

Publication number
CN116452937A
CN116452937A CN202310454888.5A CN202310454888A CN116452937A CN 116452937 A CN116452937 A CN 116452937A CN 202310454888 A CN202310454888 A CN 202310454888A CN 116452937 A CN116452937 A CN 116452937A
Authority
CN
China
Prior art keywords
feature
convolution
attention
module
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310454888.5A
Other languages
Chinese (zh)
Inventor
许国良
王钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202310454888.5A priority Critical patent/CN116452937A/en
Publication of CN116452937A publication Critical patent/CN116452937A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-mode characteristic target detection method based on a dynamic convolution and attention mechanism, and belongs to the field of image recognition. In the method, two data streams are provided at the beginning stage of a backlight of YOLOv5, respectively input visible light images and infrared light images, and a dynamic convolution module ODConv, a multispectral convolution attention feature fusion module MS-CBAM and a residual error network are used for carrying out feature extraction operation. The invention has the advantages that the characteristics of the visible light image and the infrared image are combined, and the target detection precision of multi-mode and small targets is greatly improved by combining various attention mechanisms and architectures, so that the problem of weak target detection performance in a dim environment is solved. Compared with other multi-mode fusion target detection, the method has the advantages of high training speed and low hardware resource consumption.

Description

Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism
Technical Field
The invention belongs to the field of image recognition, and relates to a multi-mode feature target detection method based on a dynamic convolution and attention mechanism.
Background
Target detection is a very important technology in computer vision tasks, and the performance of the technology directly affects the detection precision and the operation efficiency of related tasks. Accordingly, the field has been receiving attention in various fields such as academia and industry. The object detection discussed in the present invention aims to improve overall network performance by using new modality data and new modality fusion methods. For example, at night, the traffic system is likely to face the shortage of monitoring video light sources, and the aim of taking pictures of most illegal behaviors from a single spectrum data source and monitoring functions of automatic alarming of pedestrians, vehicles and car accidents is difficult to achieve. The infrared image shot by the infrared camera can enhance the visible light image characteristics of objects such as vehicles, pedestrians and the like at night, and can greatly improve the detection precision of targets at night. Therefore, how to use a large amount of multispectral image data to realize the improvement of the target recognition and detection model performance is a task with great research value and challenge. The multi-mode feature fusion double-flow neural network integrates the information of the two different modes into the deep learning neural network, so that the training precision and accuracy of the target detection field to the problems are greatly improved. However, the existing convolutional receptive field of CNN can only perform information fusion in a local area, the dual-flow convolutional neural network cannot well utilize complementarity among different modes, simply superposing feature images can increase learning difficulty of the neural network, aggravate mode imbalance, and therefore performance is reduced. The invention reforms the existing YOLOv5 neural network model, introduces improved channel attention, spatial attention and dynamic convolution to form a modal fusion module, so that the modal fusion module can more fully perform cross-modal fusion, learning and prediction on the two modes under various attention. Meanwhile, the NWD positioning loss function is used for enhancing the small target detection precision.
Disclosure of Invention
It is therefore an object of the present invention to provide a method for multi-modal feature object detection based on dynamic convolution and attention mechanisms.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a multimode characteristic target detection method based on dynamic convolution and attention mechanism comprises the following steps:
s1: a neural network model of a double-flow convolution detection network based on YOLOv5 is established, wherein a back bone adopts convolution operation and a feature fusion module to perform modal fusion and feature learning;
s2: the method comprises the steps that a multispectral module MS-CBAM is formed by adopting channel attention and spatial attention, the channel attention is used for respectively carrying out feature weighting on visible light and infrared light image feature graphs, infrared light and visible light images are stacked together, the spatial attention is used for carrying out feature weighting on the feature graphs, and then a residual network is used for refining features;
s3: introducing a multi-head attention mechanism to a convolution structure, and establishing a dynamic convolution ODConv module by endowing different attention coefficient matrixes of convolution to the four dimensions of an input channel dimension, an output channel dimension, a space dimension and a convolution kernel;
s4: setting an MS-CBAM module as a position with a larger characteristic diagram of 80 multiplied by 256 to output, and setting an ODConv module as a position with a medium and small characteristic diagrams of 40 multiplied by 512 and 20 multiplied by 1024 to output; three feature graphs with different sizes are output to enter a Neck layer, namely a feature pyramid, feature extraction is carried out, the output features are predicted, and a prediction result is output;
s5: in the training stage, visible light and infrared light data enter a double-flow neural network training after being subjected to specific Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling process; initializing by using a YOLO v5l pre-training weight, and learning parameters of a network by using a random gradient descent algorithm;
in the prediction stage, a softmax classifier is used for obtaining the final classification probability of the category to which the model belongs;
in the optimization stage, the error between the true value and the predicted value is reduced by adopting a mode of combining and optimizing the positioning loss, the classification loss and the confidence coefficient loss, and NWD is introduced into the positioning loss, so that the precision of small target detection is improved; and S5, continuously repeating the step until the iteration times reach the set iteration times, completing model training, and carrying out target detection tasks.
Optionally, in the step S1, the input of the dual-flow convolutional target detection network frame based on YOLO v5 is an image pair of different modes, and the backup is a dual-flow convolutional network, and the dual-flow neural network model includes Backbone, neck and a prediction layer;
let the input visible light characteristic diagram be X V The input infrared light characteristic diagram is X T The length, width and channel number of the feature map are H, W, C respectively;
the feature extraction network structure uses three feature fusion modules and a residual network to form a three-time feature extraction circulation and refinement structure, and the ith feature fusion calculation process is expressed as follows:
wherein sigma is a feature fusion function, and the visible light image input feature graph is X V The infrared light image input characteristic diagram is X T F is a feature fusion module for carrying out batch normalization operation; fusion feature mapThe length, the width and the channel number of the device are H, W and 2C respectively; and then constructing a residual error network by combining the fusion characteristic and the original characteristic:
obtaining new characteristic diagram f for visible light and infrared light t i And
optionally, in the step S2, the visible light and the infrared light are input into the image, the channel attention mechanism calculation is performed on the visible light and the infrared light respectively, then the feature map is overlapped according to the channel dimension overlapping mode, and then the feature map is input into the space attention for operation;
the calculation of the MS-CBAM module is expressed as:
X=M S [concat[M C (X V ),M C (X T )]]
wherein M is C Representing the channel attention mechanism, M S Representing a spatial attention mechanism; concat means stacking the feature graphs in the channel dimension;
and then refining the X constructed residual error network, wherein the process is expressed as follows:
X' V =X V +X
X' T =X T +X
the finally obtained characteristic diagram is X' V ∈V B×C×H×W 、X' T ∈T B×C×H×W Representing the final output of the MS-CBAM module.
Optionally, in the step S3, a multi-head self-attention mechanism is introduced in the convolution process, and different attention coefficient matrices ODConv are given to the convolution in four dimensions of the input channel dimension, the output channel dimension, the space dimension and the convolution kernel, so that the capability of feature extraction is improved; the operation of the ODConv module as a whole is expressed as:
X'=ODConv(concat(X V ,X T ))
wherein X is V And X T Respectively inputting feature graphs of visible light and infrared light modes, wherein concat represents that the two inputs are overlapped along the channel number dimension, and ODConv represents dynamic convolution operation;
it integrates the four-dimensional dynamic convolution formulas into:
y=(α w1 ⊙α f1 ⊙α c1 ⊙α s1 ⊙W 1 +...+α wn ⊙α fn ⊙α cn ⊙α sn ⊙W n )*x
dimension W is the convolution kernel i Attention coefficient matrix of->And->Respectively along the convolution kernel W i Dynamic convolution attention coefficient matrix in the spatial dimension, input channel dimension, output channel dimension, +..
Optionally, the input and output of the MS-CBAM module and the ODConv module are both visible light and infrared light characteristic diagrams, and the output and the input form a residual error network;
the loss functions of the positioning loss, the classification loss and the confidence loss are expressed as follows:
L total =L box +L cls +L conf
wherein the positioning loss adopts an NWD loss function; the NWD loss function calculates the similarity by introducing Normalized Wasserstein Distance calculation method through the corresponding gaussian distribution.
The invention has the beneficial effects that: the invention can well optimize target detection under the condition of insufficient brightness of the whole or part of the image, and has more advantages in the aspects of prediction precision and reliability when being applied to a target detection system.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of the overall architecture of the present invention;
FIG. 2 is a flow chart of a backhaul of the present invention;
FIG. 3 is a block diagram of a dynamic convolution feature fusion module;
FIG. 4 is a block diagram of an MS-CBAM feature fusion module.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
The invention provides a target detection method based on a dynamic convolution and attention mechanism and a YOLO v5 double-flow network, which is shown in figure 1 and comprises the following steps:
step 1: in the step 1, a basic double-flow neural network model is constructed, as shown in fig. 2, YOLO v5 comprises data processing, backbone, neck and a prediction layer, and the invention designs a feature extraction and fusion idea based on dynamic convolution feature fusion module and MS-CBAM multi-mode cyclic fusion and refinement, and the fusion operation is repeated for a plurality of times and residual processing is carried out to increase the consistency of multispectral features.
Step 2: in step 2, a dynamic convolution ODConv is constructed, which gives a convolution with different attention coefficient matrices based on four dimensions of input channel, output channel, space and convolution kernel, and as shown in fig. 3, a multi-head attention mechanism and parallel strategy are used to learn the modal complementary attention in the four dimensions of the convolution kernel.
Step 3: an MS-CBAM module based on channel attention and space attention is constructed, as shown in fig. 4, the feature map is weighted in the channel dimension and the space dimension respectively, and feature refinement is carried out by utilizing a residual network.
Step 4: the MS-CBAM module is set to output as a position with a larger characteristic diagram of 80 multiplied by 256, and the ODConv module is set to output as a position with a medium size and a small size of 40 multiplied by 512 and 20 multiplied by 1024. Inputting the feature map into a feature pyramid of YOLO v5, and continuing feature fusion and prediction of YOLO v 5;
step 5: training the neural network after the parameters are determined by adopting a training sample until the training conditions are met, and testing the trained neural network by adopting a testing set;
in step 1, the invention predicts based on the Yolo v5 Neck and Head layers and establishes a base line network based on a double-flow convolution network, which is characterized in that the convolution network is used for extracting the local features of each of the visible light and infrared light bimodal states, and then a feature weighting fusion module is used for carrying out feature weighting fusion operation.
Firstly, the visible light and infrared light images processed in the step 1 are respectively subjected to three convolution operations, and the characteristic diagrams of the convolved visible light and infrared light are expressed as X V 、X T
The invention designs an operation for carrying out feature fusion by using a feature fusion mode of a residual error network formed by an MS-CBAM module and an ODConv module. As shown in fig. 2, the module and residual networkAnd jointly constructing feature cycle fusion and refined feature extraction and fusion ideas. The feature fusion operation is respectively carried out in three places of 80×80×256, 40×40×512 and 20×20×1024 in the YOLO v5 network, namely, three feature graphs of large, medium and small represented by P3, P4 and P5 in fig. 2 enter a feature pyramid. The feature cycle fusion and refinement structure of the invention can increase the consistency of multispectral features. In the ith fusion module, in order to obtain new fusion feature f, visible light image feature X V And infrared light image feature X T Can be expressed as:
wherein sigma is a feature fusion function, and F is a feature fusion module.
To avoid overfitting, operation F in all loops shares weights, then the fused features are combined with the original features to construct a residual network:
in order to prevent the problem of vanishing gradient when learning network parameters and better perform multi-spectral feature fusion, an auxiliary semantic segmentation task is used to bring separate information for each refined spectral feature.
The similarity between modalities increases with increasing number of cycles, while as the similarity between spectral features increases, their consistency increases and complementarity decreases. Consistency between multispectral features is important, but conversely, too much consistency can lead to a sharp rise or fall in the eigenvalue, and the superfluous loop fusion is meaningless. Through experiments, the feature fusion performance starts to decrease after the fourth cycle, so in practice we choose three cycles to balance the consistency and complementarity.
Meanwhile, the three feature fusion modules input three processed feature graphs of large, medium and small into the feature pyramid three times.
Further, in step 2, the present invention is illustrated with a dynamic convolution using multi-headed self-attention mechanism operations on the convolution kernel dimensions. For the dynamic convolution layer, it uses a linear combination of n convolution kernels, dynamically weighted by the attention mechanism, making the convolution operation dependent on the input signature. The operation of the ODConv whole can be expressed as:
X'=ODConv(concat(X v ,X T ))
wherein X is V And X T And respectively inputting feature maps of visible light and infrared light modes, wherein concat represents that the two inputs are overlapped along the channel number dimension, and ODConv represents dynamic convolution operation.
In particular, mathematically, a dynamic convolution operation in a single dimension can be defined as:
y=(α w1 W 1 +...+α wn W n )*x
wherein,,and->Respectively representing the input and output of the characteristic diagram matrix with the height of h, the width of w and the channel number of c. W (W) i Representing the output convolution filter->The ith convolution kernel of the composition, m=1, …, c outIs a matrix of attention coefficients of convolution kernel dimensions, which is defined by an attention function pi conditioned on input features wi (x) Calculating; * A convolution operation is represented, and the offset term is omitted here.
According to dynamicsConvolution calculation equation, dynamic convolution has two basic components: given n convolution kernels, convolution kernel W i And an attention function for calculating an attention scalar thereofThe corresponding kernel space has four dimensions of k x k with respect to the spatial kernel size, each convolution kernel having an input channel number c in And output channel number c out
The ODConv module in the invention simultaneously gives consideration to the convolution kernel dimension, the space dimension, the input channel dimension and the output channel dimension, so that the multi-mode feature fusion in the convolution operation is more comprehensive, and the formula of each dimension is similar to the dynamic convolution of the convolution kernel dimension. As shown in fig. 3, a dynamic convolution formula that integrates four dimensions can be expressed as:
y=(α w1 ⊙α f1 ⊙α c1 ⊙α s1 ⊙W 1 +...+α wn ⊙α fn ⊙α cn ⊙α sn ⊙W n )*x
is a convolution kernel W i Attention coefficient matrix of->And->Respectively along the convolution kernel W i Dynamic convolution attention coefficient matrix in the spatial dimension, input channel dimension, output channel dimension, +..
Wherein alpha is si Assigning a different attention scalar to each convolution filter at k x k spatial locations; alpha ci For each convolution filter W i m C of (2) in The channels allocate different attention amounts; alpha fi For each convolution filter W i m C of (2) out The channels allocate different attention amounts; alpha wi The attention scalar is assigned to the entire convolution kernel. It multiplies the attention coefficient matrix of these four dimensions with the corresponding dimensions given to the n convolution kernels, resulting in the output of the module.
Specifically, input X is first compressed to have c by a global average pooling operation in The length of the feature vector, through the full connection layer and the ReLU unit, the full connection layer maps the compressed feature vector to a low dimensional space with a reduction rate r. Then through four branches, each corresponding to the four dimensions, each having an output size of kxk, c in ×1、c out X 1 and n x 1 FC layers, and a Softmax or Sigmoid function, respectively, to generate a normalized attention coefficient matrix α si 、α ci 、α fi And alpha wi
Since these four dimensions are complementary, and can capture rich context cues. Thus, ODConv can significantly enhance the feature extraction capability of CNN basic convolution operations.
Further, in step 3, an MS-CBAM module based on channel attention and spatial attention is established, weighting is performed on the feature map in the channel dimension and the spatial dimension respectively, and feature refinement is performed by using a residual network.
For input feature map X V ∈V B×C×H×W ,X T ∈T B×C×H×W Where V represents the visible image, T represents the infrared image, B represents the Batch Size, C represents the channel number, H, W represents the length and width of the feature map, respectively, in pixels. The computation of the MS-CBAM module can be expressed as:
X=M s [concat[M c (X V ),M c (X T )]]
wherein M is c Representing the channel attention mechanism, M s Representing a spatial attention mechanism. Concat denotes stacking feature graphs in the channel dimension, and X represents the module output. Feature weighting in the channel dimension and the space dimension can be performed by channel attention and space attention, and the single use of a certain type of pooling operation can be reducedAdversely affect and increase the accuracy performance of the neural network.
A channel attention module (Channel Attention Module, CAM) improves the representational capacity of the feature map by learning interactions between each channel. Specifically, the channel attention module firstly carries out the operations of maximum pooling and average pooling on each channel in the input feature map in sequence to obtain the feature map of maximum pooling and average pooling. And then taking the two feature maps as input, obtaining the weight of each channel through the two full-connection layers and the Sigmoid function, and multiplying the channel weight with the original feature map to obtain the weighted feature map. The channel attention mechanism can be expressed as:
in the method, in the process of the invention,and->Mean pooling and maximum pooling are indicated, respectively.
A spatial attention module (SpatialAttention Module, SAM) improves the representational capacity of the feature map by learning interactions between each pixel in the feature map. The input profile of the module is the profile of the channel attention module output. Firstly, for an input feature map, the spatial attention module firstly carries out the operations of maximum pooling and average pooling on the input feature map to obtain the feature map of maximum pooling and average pooling. And then splicing the two feature images, obtaining the weight of each pixel through a convolution layer and a Sigmoid function, and multiplying the pixel weight with the original feature image to obtain a weighted feature image. And then, respectively carrying out average value pooling and maximum value pooling on the channel dimensions of the visible light and infrared light characteristic diagrams to obtain two characteristic diagrams with different sizes. And then, performing splicing operation on the two feature graphs in the channel dimension to obtain a feature graph with a size. Finally, the feature map is reduced in dimension to 1 channel through a 7×7 convolution operation, and then a spatial attention feature is generated through a Sigmoid activation function.
And finally, multiplying the output features of the spatial attention with the input features element by element to obtain the finally generated features. The spatial attention mechanism can be expressed as:
in the method, in the process of the invention,and->Mean pooling and maximum pooling are indicated, respectively.
The invention uses channel attention and space attention, and then refines the X construction residual network, and the process can be expressed as follows:
X' V =X V +X
X' T =X T +X
the finally obtained characteristic diagram is X' V ∈V B×C×H×W 、X' T ∈T B×C×H×W Representing the final output of the MS-CBAM module.
Further, in step 4, the feature maps at three positions P3, P4, and P5 in fig. 2, whose feature map sizes H, W, C are 80×80×256, 40×40×512, and 20×20×1024, are respectively subjected to multi-modal feature fusion by using MS-CBAM, ODConv, ODConv, and then the three large, medium, and small feature maps are input into the YOLO v5 neg feature pyramid to further perform feature fusion and extraction.
In step 5, the loss function is classified into a positioning loss, a classification loss, a confidence loss, which can be expressed as:
wherein the positioning loss uses NWD and the other loss uses YOLO v5 default loss function:
the NWD uses a measurement mode based on Wasserstein distance, so that the detection performance of a small target is greatly improved.
For small objects there will always be some background pixels in the bounding box, as it is not possible for a real object to be exactly a rectangle. In the bounding box, foreground pixels are typically centered and background pixels are typically centered on the edges. To better weight each pixel in the bounding box, the bounding box may be modeled as a 2D gaussian distribution. Specifically, for a horizontal bounding box r= (cx, cy, w, h), the inscribed ellipse can be expressed as:
wherein (mu) xy ) Is the center point of the ellipse, (σ) xy ) Is the radius of the x and y axes. Corresponding to the bounding box:
μ x =cx,μ y =cy,
the probability density function of the 2D gaussian distribution is:
wherein X, μ, Σ represent coordinates (X, y), mean and variance, respectively. When:
this ellipse is one distribution profile of a 2D gaussian distribution. Thus, the horizontal bounding box r= (cx, cy, w, h) can be modeled as a 2D gaussian distribution:
in this way, the similarity between two bounding boxes can be expressed in terms of the distance between the two gaussian distributions.
Next, the present invention uses the wasperstein distance in the optimal transport theory to calculate the distance of the two distributions. For two 2D gaussian distributions, their 2 nd order wasperstein distances can be defined as:
namely:
for two bounding boxes:
however, this is a distance measure and cannot be used directly for similarity. We use the normalized index to get a new metric called normalized wasperstein distance:
where C is a constant and is related to the dataset.
And then training the constructed model input data set, storing the model parameters of the current epoch for each iteration of one epoch, and comparing the classification precision of the current epoch with the classification precision of the previous optimal model. And outputting the pedestrian target recognition model with optimal recognition accuracy when the set maximum epoch is reached. The trained model can realize detection and identification of objects in environments with poor light, including detection and identification of objects such as people, animals, automobiles, other vehicles, obstacles and the like.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (6)

1. The multi-mode characteristic target detection method based on the dynamic convolution and the attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
s1: a neural network model of a double-flow convolution detection network based on YOLOv5 is established, wherein a back bone adopts convolution operation and a feature fusion module to perform modal fusion and feature learning;
s2: the method comprises the steps that a multispectral module MS-CBAM is formed by adopting channel attention and spatial attention, the channel attention is used for respectively carrying out feature weighting on visible light and infrared light image feature graphs, infrared light and visible light images are stacked together, the spatial attention is used for carrying out feature weighting on the feature graphs, and then a residual network is used for refining features;
s3: introducing a multi-head attention mechanism to a convolution structure, and establishing a dynamic convolution ODConv module by endowing different attention coefficient matrixes of convolution to the four dimensions of an input channel dimension, an output channel dimension, a space dimension and a convolution kernel;
s4: setting an MS-CBAM module as a position with a larger characteristic diagram of 80 multiplied by 256 to output, and setting an ODConv module as a position with a medium and small characteristic diagrams of 40 multiplied by 512 and 20 multiplied by 1024 to output; three feature graphs with different sizes are output to enter a Neck layer, namely a feature pyramid, feature extraction is carried out, the output features are predicted, and a prediction result is output;
s5: in the training stage, visible light and infrared light data enter a double-flow neural network training after being subjected to specific Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling process; initializing by using a YOLOv5l pre-training weight, and learning parameters of a network by using a random gradient descent algorithm;
in the prediction stage, a softmax classifier is used for obtaining the final classification probability of the category to which the model belongs;
in the optimization stage, the error between the true value and the predicted value is reduced by adopting a mode of combining and optimizing the positioning loss, the classification loss and the confidence coefficient loss, and NWD is introduced into the positioning loss, so that the precision of small target detection is improved; and S5, continuously repeating the step until the iteration times reach the set iteration times, completing model training, and carrying out target detection tasks.
2. The method for detecting the multi-modal feature object based on the dynamic convolution and attention mechanism according to claim 1, wherein the method comprises the following steps: in the step S1, an input of a double-flow convolution target detection network frame based on YOLOv5 is an image pair of different modes, a Backbone is a double-flow convolution network, and a double-flow neural network model comprises Backbone, neck and a prediction layer;
let the input visible light characteristic diagram be X V The input infrared light characteristic diagram is X T The length, width and channel number of the feature map are H, W, C respectively;
the feature extraction network structure uses three feature fusion modules and a residual network to form a three-time feature extraction circulation and refinement structure, and the ith feature fusion calculation process is expressed as follows:
wherein sigma is a feature fusion function, and the visible light image input feature graph is X V The infrared light image input characteristic diagram is X T F is a feature fusion module for carrying out batch normalization operation; fusion feature mapThe length, the width and the channel number of the device are H, W and 2C respectively; and then constructing a residual error network by combining the fusion characteristic and the original characteristic:
obtaining new characteristic diagram f for visible light and infrared light t i And
3. the method for detecting the multi-modal feature object based on the dynamic convolution and the attention mechanism according to claim 2, wherein the method comprises the following steps: in the step S2, the visible light and the infrared light are input into the image, the channel attention mechanism calculation is respectively carried out on the visible light and the infrared light, then the feature images are overlapped in a mode of overlapping the channel dimensions, and then the feature images are input into the space attention for operation;
the calculation of the MS-CBAM module is expressed as:
X=M S [concat[M C (X V ),M C (X T )]]
wherein M is C Representing the channel attention mechanism, M S Representing a spatial attention mechanism; concat means stacking the feature graphs in the channel dimension;
and then refining the X constructed residual error network, wherein the process is expressed as follows:
X' V =X V +X
X' T =X T +X
the finally obtained characteristic diagram is X' V ∈V B×C×H×W 、X' T ∈T B×C×H×W Representing the final output of the MS-CBAM module.
4. The method for detecting a multi-modal feature object based on dynamic convolution and attention mechanism as claimed in claim 3, wherein: in the step S3, a multi-head self-attention mechanism is introduced in the convolution process, and different attention coefficient matrixes ODConv of convolution are endowed to four dimensions of an input channel dimension, an output channel dimension, a space dimension and a convolution kernel, so that the capability of feature extraction is improved; the operation of the ODConv module as a whole is expressed as:
X'=ODConv(concat(X V ,X T ))
wherein X is V And X T Respectively inputting feature graphs of visible light and infrared light modes, wherein concat represents that the two inputs are overlapped along the channel number dimension, and ODConv represents dynamic convolution operation;
it integrates the four-dimensional dynamic convolution formulas into:
y=(α w1 ⊙α f1 ⊙α c1 ⊙α s1 ⊙W 1 +...+α wn ⊙α fn ⊙α cn ⊙α sn ⊙W n )*x
dimension W is the convolution kernel i Attention coefficient matrix of->And->Respectively along the convolution kernel W i Dynamic convolution attention coefficient matrix in the spatial dimension, input channel dimension, output channel dimension, +..
5. The method for detecting the multi-modal feature object based on the dynamic convolution and attention mechanism according to claim 4, wherein: the input and output of the MS-CBAM module and the ODConv module are both visible light and infrared light characteristic diagrams, and the output and the input form a residual error network;
the loss functions of the positioning loss, the classification loss and the confidence loss are expressed as follows:
L total =L box +L cls +L conf
wherein the positioning loss adopts an NWD loss function; the NWD loss function calculates the similarity by introducing Normalized Wasserstein Distance calculation method through the corresponding gaussian distribution.
6. The method for detecting the multi-modal feature object based on the dynamic convolution and attention mechanism according to claim 5, wherein: the NWD loss function is expressed as:
wherein,,for Wasserstein distance, +.>And C is a fixed constant related to the data set, so that the detection performance of the small target is improved.
CN202310454888.5A 2023-04-25 2023-04-25 Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism Pending CN116452937A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310454888.5A CN116452937A (en) 2023-04-25 2023-04-25 Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310454888.5A CN116452937A (en) 2023-04-25 2023-04-25 Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism

Publications (1)

Publication Number Publication Date
CN116452937A true CN116452937A (en) 2023-07-18

Family

ID=87125416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310454888.5A Pending CN116452937A (en) 2023-04-25 2023-04-25 Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism

Country Status (1)

Country Link
CN (1) CN116452937A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665176A (en) * 2023-07-21 2023-08-29 石家庄铁道大学 Multi-task network road target detection method for vehicle automatic driving
CN116883825A (en) * 2023-07-26 2023-10-13 南京信息工程大学 Underwater target detection method combining multi-mode data fusion and multiplexing
CN116977880A (en) * 2023-08-25 2023-10-31 内蒙古农业大学 Grassland rat hole detection method based on unmanned aerial vehicle image
CN117690161A (en) * 2023-12-12 2024-03-12 上海工程技术大学 Pedestrian detection method, device and medium based on image fusion
CN117893475A (en) * 2023-12-15 2024-04-16 航天科工空天动力研究院(苏州)有限责任公司 High-precision PCB micro defect detection algorithm based on multidimensional attention mechanism
CN117893537A (en) * 2024-03-14 2024-04-16 深圳市普拉托科技有限公司 Decoloring detection method and system for tray surface material
CN117935012A (en) * 2024-01-31 2024-04-26 广东海洋大学 Infrared and visible light image fusion network based on distributed structure
CN118521837A (en) * 2024-07-23 2024-08-20 诺比侃人工智能科技(成都)股份有限公司 Rapid iteration method of intelligent detection model for defects of contact net

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665176A (en) * 2023-07-21 2023-08-29 石家庄铁道大学 Multi-task network road target detection method for vehicle automatic driving
CN116665176B (en) * 2023-07-21 2023-09-26 石家庄铁道大学 Multi-task network road target detection method for vehicle automatic driving
CN116883825A (en) * 2023-07-26 2023-10-13 南京信息工程大学 Underwater target detection method combining multi-mode data fusion and multiplexing
CN116883825B (en) * 2023-07-26 2024-08-02 南京信息工程大学 Underwater target detection method combining multi-mode data fusion and Multiplemix
CN116977880A (en) * 2023-08-25 2023-10-31 内蒙古农业大学 Grassland rat hole detection method based on unmanned aerial vehicle image
CN117690161A (en) * 2023-12-12 2024-03-12 上海工程技术大学 Pedestrian detection method, device and medium based on image fusion
CN117690161B (en) * 2023-12-12 2024-06-04 上海工程技术大学 Pedestrian detection method, device and medium based on image fusion
CN117893475A (en) * 2023-12-15 2024-04-16 航天科工空天动力研究院(苏州)有限责任公司 High-precision PCB micro defect detection algorithm based on multidimensional attention mechanism
CN117935012A (en) * 2024-01-31 2024-04-26 广东海洋大学 Infrared and visible light image fusion network based on distributed structure
CN117893537A (en) * 2024-03-14 2024-04-16 深圳市普拉托科技有限公司 Decoloring detection method and system for tray surface material
CN117893537B (en) * 2024-03-14 2024-05-28 深圳市普拉托科技有限公司 Decoloring detection method and system for tray surface material
CN118521837A (en) * 2024-07-23 2024-08-20 诺比侃人工智能科技(成都)股份有限公司 Rapid iteration method of intelligent detection model for defects of contact net

Similar Documents

Publication Publication Date Title
CN116452937A (en) Multi-mode characteristic target detection method based on dynamic convolution and attention mechanism
CN111767882B (en) Multi-mode pedestrian detection method based on improved YOLO model
CN110298262B (en) Object identification method and device
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN109902806B (en) Method for determining target bounding box of noise image based on convolutional neural network
CN108229468B (en) Vehicle appearance feature recognition and vehicle retrieval method and device, storage medium and electronic equipment
CN114220124A (en) Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
CN111291809B (en) Processing device, method and storage medium
CN113420607A (en) Multi-scale target detection and identification method for unmanned aerial vehicle
CN110222718B (en) Image processing method and device
CN114445430B (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
CN113807464A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLO V5
CN110097029B (en) Identity authentication method based on high way network multi-view gait recognition
CN113326735B (en) YOLOv 5-based multi-mode small target detection method
CN115512251A (en) Unmanned aerial vehicle low-illumination target tracking method based on double-branch progressive feature enhancement
WO2022217434A1 (en) Cognitive network, method for training cognitive network, and object recognition method and apparatus
CN113298032A (en) Unmanned aerial vehicle visual angle image vehicle target detection method based on deep learning
CN116912485A (en) Scene semantic segmentation method based on feature fusion of thermal image and visible light image
CN117372898A (en) Unmanned aerial vehicle aerial image target detection method based on improved yolov8
CN113205103A (en) Lightweight tattoo detection method
CN114170526A (en) Remote sensing image multi-scale target detection and identification method based on lightweight network
CN113870160A (en) Point cloud data processing method based on converter neural network
Barodi et al. An enhanced artificial intelligence-based approach applied to vehicular traffic signs detection and road safety enhancement
CN116740516A (en) Target detection method and system based on multi-scale fusion feature extraction
CN118196544A (en) Unmanned aerial vehicle small target detection method and system based on information enhancement and feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination