CN115410162A - Multi-target detection and tracking algorithm under complex urban road environment - Google Patents
Multi-target detection and tracking algorithm under complex urban road environment Download PDFInfo
- Publication number
- CN115410162A CN115410162A CN202210862496.8A CN202210862496A CN115410162A CN 115410162 A CN115410162 A CN 115410162A CN 202210862496 A CN202210862496 A CN 202210862496A CN 115410162 A CN115410162 A CN 115410162A
- Authority
- CN
- China
- Prior art keywords
- feature map
- feature
- size
- target
- characteristic diagram
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 70
- 238000010586 diagram Methods 0.000 claims abstract description 63
- 230000004927 fusion Effects 0.000 claims abstract description 55
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000012360 testing method Methods 0.000 claims abstract description 14
- 238000007477 logistic regression Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000002776 aggregation Effects 0.000 claims description 14
- 238000004220 aggregation Methods 0.000 claims description 14
- 238000005070 sampling Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000011084 recovery Methods 0.000 claims description 3
- 238000002910 structure generation Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 18
- 238000005516 engineering process Methods 0.000 description 7
- 241001239379 Calophysus macropterus Species 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000008447 perception Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- SFRALHFBKRAJPW-UHFFFAOYSA-N IDFP Chemical compound CCCCCCCCCCCCP(F)(=O)OC(C)C SFRALHFBKRAJPW-UHFFFAOYSA-N 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/54—Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-target detection and tracking method under a complex urban road environment, which comprises the following steps: step 1: constructing a training set and a test set; and 2, step: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of the input image; and 3, step 3: extracting long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module; and 4, step 4: carrying out further feature fusion and logistic regression treatment; and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame; step 6, obtaining a trained multi-target detection and tracking model; and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame. The invention can accurately detect and track the target of multiple targets under the complex urban road environment, and can stably identify the target with larger change of the appearance scale.
Description
Technical Field
The invention belongs to the technical field of automatic driving, and relates to a traffic target detection and tracking method.
Background
The intelligent traffic becomes an important direction of future traffic development, the typical representative of the intelligent traffic is that automatic driving is a comprehensive technology with multiple subjects and multiple fields, and the development of the automatic driving technology not only needs the automatic driving capability of traffic participating vehicles, but also needs the accurate sensing technology for mastering the complex traffic environment, the high-precision map, the navigation positioning of the vehicles, the vehicle dynamics control and other technologies to construct a complete vehicle-road cooperative traffic system. In recent years, the scheme of 5G network superposition cloud computing can endow the traditional infrastructure with the road perception capability by an advanced artificial intelligence technology, and further improve the environment perception capability of a single vehicle through the Internet of things and cloud computing. No matter it is single car intelligence or car road cooperative, all need the sensor to gather external environment information. The sensors commonly used at present comprise a laser radar, a millimeter wave radar and a camera, compared with other sensors, the camera becomes a preferred visual sensor for environment perception with unique cost performance, and an artificial intelligence technology based on the camera becomes an indispensable key technology for intelligent traffic development, so that multi-target detection and tracking have important significance for the perception environment of a complex traffic environment.
Firstly, most of traffic scene targets are shot by a camera fixed at a high position, and the problems that targets at a far position of a picture are generally small, characteristic information is less, the number of targets on the same picture in a traffic scene is large, the size difference is large and the like exist. The convolutional neural network generally adopted in the current research can carry out down-sampling coding on the picture in the forward propagation process, so that the model is easy to lose a target with a small area, and the difficulty of capturing the target by the model is increased. Secondly, with the development of deep learning, a large number of research results have been obtained on multi-target tracking, but due to the influence of factors such as target appearance size change, shielding, blurring caused by rapid movement and the like in the tracking process, the existing tracking algorithm cannot reach an ideal state. Aiming at the problem of multi-target detection and tracking in a traffic scene, a target detection algorithm and a two-stage tracking network based on Kalman filtering and Hungarian algorithm are commonly used in the industry nowadays, and the model has some problems: the target detection and tracking modules are independent from each other and cannot be trained simultaneously, meanwhile, the target tracking performance is determined by the target detection accuracy, so that the bottleneck exists in network training and optimization, and stable tracking cannot be achieved for targets with large interframe displacement.
Disclosure of Invention
The invention aims to provide a multi-target detection and tracking algorithm under a complex urban road environment, and aims to solve the problems that in the prior art, the target detection accuracy is not high, and a target with large inter-frame displacement cannot be stably tracked.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-target detection and tracking method under a complex urban road environment specifically comprises the following steps:
step 1: selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set;
step 2: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of an input image and obtain a two-dimensional feature map after three feature fusion;
and 3, step 3: extracting a long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module according to the two-dimensional feature graph after feature fusion to obtain a feature graph after the dependency relationship is extracted;
and 4, step 4: generating a heat map and a target bounding box through further feature fusion and logistic regression processing;
and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame;
step 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set of the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model;
and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.
Further, in the step 1, visDrone _ mot in the main flow traffic target detection data set VisDrone is selected as the data set of the present invention.
Further, the step 2 specifically includes the following sub-steps:
step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map;
step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map;
step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;
and step 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform feature fusion on the adjacent feature maps of the obtained feature maps with the size of 1/4, the feature maps with the size of 1/8, the feature maps with the size of 1/16 and the feature maps with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16.
Further, in step 24, the feature fusion module is configured to implement the following operations:
step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;
step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2-fold upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1';
step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in the step 242 to obtain a combined characteristic diagram;
step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/4 and a feature map with the size of 1/8, the obtained two-dimensional feature map F2' is a feature map with the size of 1/4;
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/8 and a feature map with the size of 1/16, the obtained two-dimensional feature map F2' is a feature map with the size of 1/8;
when the feature maps F1 and F2 are feature maps of 1/16 size and 1/32 size, respectively, the two-dimensional feature map F2' obtained is a feature map of 1/16 size.
Further, the step 3 specifically includes the following sub-steps:
step 31, collapsing the two-dimensional characteristic diagram of 1/16 size finally obtained in the step 2 into a one-dimensional sequence, and performing convolution to form a K, V and Q characteristic diagram;
step 32, respectively adding the position code and the feature map K and the feature map Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, inputting the two feature maps and the feature map V into the multi-head attention module as common input, and processing to obtain a new feature map;
step 33, performing fusion operation and LayerNorm operation of adding corresponding values among feature maps on the new feature map obtained in step 32 and the V, K and Q feature maps obtained in step 31;
and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.
Further, the position code in step 32 is obtained by the following formula:
PE (pos,2i) =sin(pos/10000 2i/d )
pE (pos,2i+1) =cos(pos/10000 2i/d )
wherein, PE (·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.
Further, the step 4 specifically includes the following sub-steps:
and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.
Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;
step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;
step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution of 1/4 of the original image;
step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets
Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:
wherein,i.e. step 45 obtains the center point of the predicted target,representing the offset of the center point from the target center point,indicating the size of the corresponding border of the object.
Further, the step 5 specifically includes the following sub-steps:
step 51, inputting the same graph of step 2The image is taken as a T-1 frame image, a next frame image, namely a T-frame image, is selected, the T-frame image and the T-1 frame image are taken as input, and feature images f are respectively generated through the treatment of a CenterTrack backbone network T And f T-1 ;
Step 52, the feature map f T And f T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output feature map f' T ;
Step 53, comparing the heat map obtained in step 4 with the feature map f obtained in step 51 T-1 Performing Hadamard multiplication to generate feature mapWill be provided withAnd the feature map f 'obtained in step 52' T Performing deformable convolution together to generate feature maps
Step 54, willSequentially using 31 × 1 convolution operations and down-sampling operations to generate a T-1 th frame feature map; the characteristic diagram f obtained in the step 51 is processed T Performing operation by using 31 × 1 convolutions to generate a Tth frame feature map;
step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' T 。
Further, the step 52 specifically includes the following operations:
step 521, the feature map f is processed T And f T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules T And e T-1 I.e. the appearance coding vector of the target;
step 522, for the feature map e T And e T-1 Performing a max pooling operation to obtain e' T And e' T-1 To reduce model complexity, e 'is used' T And e' T-1 Obtaining a cost space matrix C by transposition calculation of the product, wherein the position of a target on the cost space matrix C in the current frame is (i, j), and extracting a two-dimensional cost matrix C containing position information of the target in the current frame in the previous frame image from the cost space matrix C i,j To C, to i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction
G i,j,l =(l-j)×s1≤l≤W C (4)
M i,j,k =(k-i)×s1≤k≤H C (5)
Wherein s is the downsampling multiple of the feature map relative to the original image, W C 、H C Is the size of the width and height dimensions of the feature map, G i,j,l For the offset, M, at which the object (i, j) in the T-frame image appears at the horizontal position l in the T-1 frame image i,j,k An offset for the T frame object (i, j) to appear at vertical position k in the T-1 frame image;
step 524, the result of step 522 is processedMultiplying the obtained product by the offset templates G and M defined in step 523, and then superimposing the obtained product on the channel to obtain a feature map O T An offset template representing the target in both horizontal and vertical directions; then O is introduced T 2 times up sampling recovery to H F ×W F Size at the same time, adding O T The horizontal and vertical channels of the characteristic map are respectively compared with f obtained in step 51 T 、f T-1 Performing superposition on channels, and performing convolution to form characteristic diagrams in horizontal and vertical directions2 feature maps with the same size and 9 channel numbers are superposed on the channels to obtain an output feature map f' T 。
Compared with the prior art, the invention has the beneficial effects that:
(1) in the invention, the resolution of the input picture of the adopted data set is properly increased, and the size of the final characteristic diagram is ensured to reserve more detailed information;
(2) in the multi-target detection module, a deep feature map containing more semantic information and a shallow feature map containing more detailed information are fused through the feature fusion module, so that the detection capability of the model on small targets is improved;
(3) in the invention, a Transformer coding module self-attention mechanism is introduced into a multi-target detection module, the dependency relationship on a long distance is captured, the characteristic potential relation in a characteristic diagram is explored, and a target with larger appearance dimension change can be stably identified;
(4) the multi-target tracking algorithm based on cost space and inter-frame information fusion is provided, a cost space matrix is used for predicting the position of a target of a current frame in a previous frame, the targets between two frames can be correlated, and the tracking effect is realized;
(5) in the multi-target tracking module, an attention propagation module is introduced to fuse the characteristics of multi-frame targets, so that the problem of target space dislocation caused by inter-frame target motion is solved, and the model can still accurately realize tracking under the condition that the target is shielded.
Drawings
FIG. 1 is a schematic diagram of a multi-target detection module of the present invention;
FIG. 2 is a schematic diagram of a feature fusion module in the multi-target detection module;
FIG. 3 is a schematic diagram of a Transformer encoding module in a multi-target detection module;
FIG. 4 is a schematic diagram of the multi-target tracking module of the present invention;
FIG. 5 is a schematic diagram of a cost space module in the multi-target tracking module;
FIG. 6 is a schematic diagram of the experimental results of the multi-target detection module of the present invention; the result diagrams are respectively a target center point and a target boundary box result diagram obtained by detecting a small target and a large target by the module.
FIG. 7 is a schematic diagram of the experimental results of the multi-target tracking module of the present invention. The four pictures of the two sections of test cases are respectively the 0 th frame, the 5 th frame, the 10 th frame and the 15 th frame.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
The multi-target detection and tracking model is divided into two parts, firstly, a multi-target detection module framework is shown in figure 1, a deep-shallow network fusion characteristic diagram is obtained by adding a characteristic fusion module mainly based on an improved DLA34 as a backbone network, a Transformer coding module is introduced to carry out self-attention coding on the fused characteristic diagram, and the problem of limitation of the network on the large-target semantic extraction capability caused by overlarge target characteristic scale difference is solved; and finally, generating a target heat map and regressing to obtain a bounding box of the corresponding target so as to realize the detection of the traffic target. As shown in fig. 4, a target tracking module framework generates a feature map through a centrtrack backbone network, and realizes target association and tracking between two frames by using a cost space matrix; by using the attention propagation module, the target information of the front frame and the rear frame are fused and complemented, and the accurate tracking under the condition that the target is fuzzy or shielded is realized.
The invention relates to a multi-target detection and tracking method under a complex urban road environment, which specifically comprises the following steps:
step 1: and selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set.
The method comprises the following steps: and selecting the VisDrone _ mot in the VisDrone as the data set of the invention. The VisDrone _ mot data set collects street overhead views of multiple cities in china from an unmanned aerial vehicle, provides 96 video sequences, including 56 training video sequences comprising 24201 frames of images, 7 verification video sequences comprising 2819 frames of images, and 33 test sequences comprising 12968 frames of images, and manually marks a bounding box of an identified object in each video frame. And increasing the resolution to 1024 × 1024 for the input picture in the VisDrone _ mot data set, ensuring that the size of the final feature map output by the multi-target detection module is 256 × 256, reserving more detailed information, and simultaneously using a data enhancement mode combining random scaling, random clipping and color dithering, wherein the data enhancement mode is that the random inversion is performed, and the resolution is 0.6 to 1.3 times, as an extended training sample.
And 2, step: and adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of the input image and obtain a two-dimensional feature map after the three features are fused. As shown in fig. 1, the method specifically includes the following sub-steps:
and step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map. Wherein the characteristics of the aggregate nodes are fused as shown in formula (1):
N(X 1 ,...,X n )=σ(BN(∑w i x i +b),...,BN(∑w i x i +b)) (1)
where N (-) represents an aggregation node, σ (-) represents a feature aggregation, w i x i + b denotes a convolution operation, BN denotes a BatchNorm operation, X i=1...N Corresponding to the output of the convolution module.
And step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map. This step is intended to pass the network shallow feature information to the network deep layer.
Step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;
step 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform feature fusion on the adjacent feature maps of the obtained feature map with the size of 1/4, the feature map with the size of 1/8, the feature map with the size of 1/16 and the feature map with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16;
the feature fusion module is used for realizing the following operations:
step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;
step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2 times upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1', so as to obtain more target position information and reduce the model parameters;
step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in the step 242 to obtain a combined characteristic diagram;
step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/4 and a feature map with the size of 1/8, the obtained two-dimensional feature map F2' is a feature map with the size of 1/4;
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/8 and a feature map with the size of 1/16, the obtained two-dimensional feature map F2' is a feature map with the size of 1/8;
when the feature map F1 and the feature map F2 are respectively a feature map of 1/16 size and a feature map of 1/32 size, the obtained two-dimensional feature map F2' is a feature map of 1/16 size.
And step 3: and (3) according to the feature graph obtained in the step (2) after feature fusion, extracting the long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module to obtain the feature graph after the dependency relationship is extracted. As shown in fig. 3, the method specifically includes the following sub-steps:
step 31, collapsing the two-dimensional characteristic diagram of 1/16 size finally obtained in the step 2 into a one-dimensional sequence, and performing convolution to form three K (Key), V (Value) and Q (Query) characteristic diagrams;
and step 32, respectively adding the position code and the feature map K and the feature map Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, and enabling the two feature maps and the feature map V to be used as common input to enter a multi-head attention module to be processed to obtain a new feature map so as to capture long-distance dependence in the image. Wherein the position code is obtained by the following formula (1) (2):
PE (pos,2i) =sin(pos/10000 2i/d ) (1)
PE (pos,2i+1) =cos(pos/10000 2i/d ) (2)
wherein, PE (·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.
Step 33, performing fusion operation and LayerNorm (LN) operation of adding corresponding values between feature maps on the new feature map obtained in step 32 and the V, K, Q feature maps obtained in step 31 to avoid information loss;
and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.
And 4, step 4: and (4) generating a heat map and a target boundary frame by further carrying out feature fusion and logistic regression treatment according to the feature maps obtained in the step (2) and the step (3). The method specifically comprises the following substeps:
and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.
Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;
step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;
step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution of 1/4 of the original image;
step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets
Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:
wherein,i.e. step 45 obtains the center point of the predicted target,representing the offset of the center point from the target center point,indicating the size of the corresponding border of the object.
And 5: and (4) performing target association processing and tracking by using a multi-target tracking module according to the input image in the step (2) and the heat map obtained in the step (4) to obtain a tracking characteristic diagram with a target detection frame. As shown in fig. 4, the method specifically includes the following sub-steps:
step 51, using the same image input in step 2 as the T-1 frame image, and selecting the next frame image, i.e. the first frame imageThe T frame image takes the T frame and the T-1 frame image as input, and generates a feature map f through the CenterTrack backbone network processing T And f T-1 ;
Step 52, the feature map f T And f T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output characteristic diagram f' T . The method specifically comprises the following operations:
step 521, the feature map f is processed T And f T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules T And e T-1 I.e. the appearance coding vector of the target;
step 522, for the feature map e T And e T-1 Performing a max pooling operation to obtain e' T And e' T-1 To reduce model complexity, use e' T And e' T-1 And (3) obtaining a cost space matrix C by transposition calculation of the product so as to store the similarity of corresponding points between two frame feature maps, wherein the position of the target on the cost space matrix C in the current frame is (i, j), and extracting a two-dimensional cost matrix C containing the position information of the target in the current frame in the previous frame image from the cost space matrix C i,j To C i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction
G i,j,l =(l-j)×s1≤l≤W C (4)
M i,j,k =(k-i)×s1≤k≤H C (5)
Wherein s is the down-sampling multiple of the feature map relative to the original image, W C 、H C Is the size of the width and height dimensions of the feature map, G i,j,l For objects (i, j) in T-frame images appearing at horizontal positions l in T-1 frame imagesOffset, M i,j,k The offset at vertical position k occurs in the T-1 frame image for T frame object (i, j).
Step 524, the result of step 522 is processedMultiplying the offset templates G and M defined in step 523, and then superimposing the offset templates G and M on the channel to obtain a feature map O T Representing offset templates of the target in both horizontal and vertical directions; then O is introduced T 2 times up sampling recovery to H F ×W F Size, simultaneously, adding O T The horizontal and vertical channels of the feature map are respectively compared with f obtained in step 51 T 、f T-1 Superposing the characteristic diagrams on channels, forming 2 characteristic diagrams with the same size and the same number of channels as 9 in the horizontal direction and the vertical direction through convolution, and superposing the 2 characteristic diagrams on the channels to obtain an output characteristic diagram f' T 。
Step 53, the heat map obtained in step 4 and the feature map f obtained in step 51 are combined T-1 Performing a hadamard product to generate a feature mapWill be provided withAnd the feature map f 'obtained in step 52' T Performing deformable convolution together to generate feature maps
Step 54, willThe T-1 th frame feature map (q) is generated by sequentially using 31 × 1 convolution operations and a down-sampling operation t-1 、 k t-1 、v t-1 ) (ii) a The characteristic diagram f obtained in the step 51 is processed T Operating with 31 × 1 convolutions, a Tth frame feature map (Q) is generated t 、K t 、V t );
Step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' T . Wherein, the calculation process of the attention spreading module is shown as formula (6):
wherein,is a 1 × 1 convolution, d k For the dimensions of feature maps Q and K, Q t 、k t-1 、v t-1 、V t The signature obtained in step 54.
And 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set in the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model.
And 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.
In order to verify the feasibility and effectiveness of the invention, the invention carries out the following experiments:
first, the model is evaluated for multiple target detection modules (i.e., steps 2-4) using average accuracy and recall. The average accuracy is obtained from the accuracy, and the formulas of the accuracy P and the recall rate R are specifically shown in formulas (7) and (8).
Where P is the percentage of Targets (TP) that should be retrieved to all retrieved targets (TP + FP). R is the percentage of Targets (TP) that should be retrieved to all targets (TP + FN) that should be retrieved.
In the detection task, the accuracy rate reflects the model checking capability, and the recall rate reflects the model checking capability. The two indexes are restricted with each other, and the relative balance between the precision ratio and the recall ratio is found through the average precision ratio (AP) under different confidence coefficient thresholds, so that a two-dimensional PR curve with the precision ratio and the recall ratio as horizontal and vertical coordinates is made. The average Accuracy (AP) is the area enclosed by the PR curve, which is equal to the average operation on the accuracy.
The method comprises the steps of firstly carrying out quantitative analysis on a multi-target detection module, comparing the multi-target detection module with a baseline model on a VisDrone _ mot data set, and adding performance comparison specific to each category between the method and various baseline methods in an experiment, so that results can be obtained. Compared with a common model with excellent performance, the method provided by the invention has the advantages that the identification performance for large targets is optimal, the precision reaches 42.16 and 33.10, and the detection capability is good.
Meanwhile, in order to intuitively reflect the performance of the whole multi-target detection module and qualitatively analyze the module, the result is shown in fig. 6, so that the model has good detection performance on targets with different scales, after the Transformer module is added, the model can capture the long-distance dependency more stably, and the recognition effect on a large target is still relatively robust while the small target has good recognition capability.
Next, for the multi-target tracking module (i.e., step 5), evaluation is performed using indices such as MOTA (↓), MOTP (±), IDF1 (±), MT (±), ML (↓), FP (↓), FN (↓), frag (↓), IDSW (↓), and the like. ↓ indicates larger index value and better model performance, and ↓ indicates smaller index value and better model performance.
The MOTA represents the multi-target tracking accuracy, measures the capability of an algorithm for continuously tracking the target, and is used for counting the error accumulation condition in tracking, and the formula is shown as (9).
Wherein m is t Corresponding to FP, it represents false positive (false positive number) in the prediction result, i.e. the predicted position in the t-th frame has no corresponding tracking target matching with it. fp t Corresponding to FN, represents false negatives (number of missed detections), i.e. no corresponding predicted position of the target matches it in the t-th frame. mme (mme) t Corresponding to IDSW, representing the number of mismatches, i.e. the number of times the ID switch occurred for the target is tracked in the tth frame, g t The sum of the real target number in the frame is indicated, and the MOTA comprehensively considers false detection, missed detection and ID exchange in the target track.
The MOTP expression also directly reflects the tracking effect of the model, reflects the difference between the tracking result and the distance of the label track, and the formula is expressed as (10).
Wherein, c t The matching number of the t frame is shown, and the track error is calculated for each pair of matchingAnd summing to obtain a final numerical value, wherein the larger the index is, the better the model performance is, and the smaller the track error is.
MT is the majority trace number (Mostly trace), which refers to the number of traces that hit more than 80% of the tag traces, with larger numbers being better. ML is the majority loss (Mostly lost), which refers to the number of traces that are lost greater than 80% of the tag traces, with smaller numbers being better. Frag is the number of transitions, which refers to the number of changes in the trace from the "tracking" state to the "no tracking" state.
For a multi-target tracking detector, the ID related indexes are also important, and specifically, the following three important indexes are provided: IDP, IDR, IDF1. The IDP represents Identification Precision (Identification Precision), which refers to the ID Identification accuracy of each target box, and is expressed by the formula (11).
Wherein IDTP and IDFP are the number of true positive examples and false positive examples of ID prediction respectively. The IDR represents an Identification Recall rate (Identification Recall) which indicates an ID Identification Recall rate of each target box, and is represented by the formula (12).
Where IDFN is a false negative of ID prediction. IDF1 represents an ID-predicted F value (Identification F-Score), and indicates an ID-identified F value of each target box, and the larger the index value, the better, the calculation formula is shown as (13).
IDF1 is the first default indicator used to evaluate the performance of the tracker, and any two of the three indicators can be used to infer the other.
Firstly, comparing a multi-target tracking module with a quantitative experiment of a mainstream baseline model in recent years, on a VisDrone _ mot data set, compared with a second good model, the tracking method provided by the invention is respectively 3.2 and 1.8 higher in MOTA and MOTP indexes, and obtains better results on other indexes, but the model has less false detection rate, so that ML and MT indexes are disturbed in a normal range. Compared with a TBD (tunnel boring machine) model, the JDT model can carry out end-to-end optimization in the training process due to mutual promotion of detection and tracking tasks, and can obtain better effect on the tracking task.
Secondly, the qualitative analysis is performed on the model in the data set, as shown in fig. 7, two test examples are shown, and four pictures of each test example are selected for display, namely, the 0 th frame, the 5 th frame, the 10 th frame and the 15 th frame of images in the time dimension. As can be seen from the figure, the model can stably track multiple targets in the traffic scene, and particularly has excellent detection and tracking capability on small targets in the traffic scene.
Claims (9)
1. A multi-target detection and tracking method under a complex urban road environment is characterized by specifically comprising the following steps:
step 1: selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set;
step 2: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of an input image and obtain a two-dimensional feature map after three feature fusion;
and 3, step 3: extracting the long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module according to the two-dimensional feature graph after feature fusion to obtain the feature graph after the dependency relationship is extracted;
and 4, step 4: generating a heat map and a target boundary frame through further feature fusion and logistic regression processing;
and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame;
step 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set of the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model;
and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.
2. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein in step 1, visDrone _ mot in the main-flow traffic target detection data set VisDrone is selected as the data set of the present invention.
3. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein said step 2 comprises the following substeps:
step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map;
step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map;
step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;
and 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform adjacent feature fusion on the obtained feature map with the size of 1/4, the feature map with the size of 1/8, the feature map with the size of 1/16 and the feature map with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16.
4. The multi-target detection and tracking method in a complex urban road environment according to claim 3, wherein in step 24, the feature fusion module is configured to implement the following operations:
step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;
step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2-fold upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1';
step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in step 242 to obtain a combined characteristic diagram;
step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';
when the characteristic diagram F1 and the characteristic diagram F2 are respectively a characteristic diagram with the size of 1/4 and a characteristic diagram with the size of 1/8, the obtained two-dimensional characteristic diagram F2' is a characteristic diagram with the size of 1/4;
when the characteristic diagram F1 and the characteristic diagram F2 are respectively a characteristic diagram with the size of 1/8 and a characteristic diagram with the size of 1/16, the obtained two-dimensional characteristic diagram F2' is a characteristic diagram with the size of 1/8;
when the feature maps F1 and F2 are feature maps of 1/16 size and 1/32 size, respectively, the two-dimensional feature map F2' obtained is a feature map of 1/16 size.
5. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein said step 3 comprises the following substeps:
step 31, collapsing the two-dimensional characteristic diagram with the size of 1/16 obtained finally in the step 2 into a one-dimensional sequence, and convoluting to form a K, V and Q characteristic diagram;
step 32, respectively adding the position codes and the feature maps K and Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, inputting the two feature maps and the feature map V into a multi-head attention module as common input, and processing to obtain a new feature map;
step 33, performing fusion operation and LayerNorm operation of adding corresponding values among feature maps on the new feature map obtained in step 32 and the V, K and Q feature maps obtained in step 31;
and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.
6. The multi-target detection and tracking method in a complex urban road environment according to claim 5, wherein the position code in step 32 is obtained by the following formula:
PE (pos,2i) =sin(pos/10000 2i/d )
PE (pos,2i+1) =cos(pos/10000 2i/d )
wherein, PE (·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.
7. The multi-target detection and tracking method in the complex urban road environment according to claim 1, wherein the step 4 specifically comprises the following substeps:
and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.
Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;
step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;
step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution being the size of 1/4 of the original image;
step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets
Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:
8. The multi-target detection and tracking method in the complex urban road environment according to claim 1, wherein the step 5 specifically comprises the following substeps:
step 51, using the same image input in step 2 as the T-1 frame image, selecting the next frame image, namely the T-frame image, using the T-frame and the T-1 frame image as input, and respectively generating the feature map f through the CenterTrack backbone network processing T And f T-1 ;
Step 52, the feature map f T And f T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output feature map f' T ;
Step 53, comparing the heat map obtained in step 4 with the feature map f obtained in step 51 T-1 Performing a hadamard product to generate a feature mapWill be provided withAnd the feature map f 'obtained in step 52' T Together withPerforming a deformable convolution to generate a feature map
Step 54, willGenerating a T-1 frame feature map by sequentially using 31 × 1 convolution operations and a downsampling operation; the characteristic diagram f obtained in the step 51 is processed T Performing operation by using 31 × 1 convolutions to generate a Tth frame feature map;
step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' T 。
9. The multi-target detection and tracking method in a complex urban road environment according to claim 8, wherein said step 52 specifically comprises the following operations:
step 521, the feature map f is processed T And f T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules T And e T-1 I.e. the appearance coding vector of the target;
step 522, for the feature map e T And e T-1 Performing a max pooling operation to obtain e' T And e' T-1 To reduce model complexity, e 'is used' T And e' T-1 The transposition calculation of the product obtains a cost space matrix C, the position of the target on the cost space matrix C in the current frame is (i, j), and a two-dimensional cost matrix C containing the position information of the target in the current frame in the previous frame image is extracted from the cost space matrix C i,j To C, to i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction
G i,j,l =(l-j)×s1≤l≤W C (4)
M i,j,k =(k-i)×s1≤k≤H C (5)
Wherein s is the downsampling multiple of the feature map relative to the original image, W C 、H C Is the size of the width and height dimensions of the feature map, G i,j,l M is an offset of an object (i, j) in the T frame image appearing in the horizontal position l in the T-1 frame image i,j,k An offset for a T frame object (i, j) appearing at vertical position k in the T-1 frame image;
step 524, the result of step 522 is processedMultiplying the offset templates G and M defined in step 523, and then superimposing the offset templates G and M on the channel to obtain a feature map O T Representing offset templates of the target in both horizontal and vertical directions; then O is introduced T 2 times up sampling recovery to H F ×W F Size, simultaneously, adding O T The horizontal and vertical channels of the characteristic map are respectively compared with f obtained in step 51 T 、f T-1 Superposing on channels, forming 2 feature maps with unchanged feature map sizes in the horizontal direction and the vertical direction and 9 channels by convolution, and superposing the 2 feature maps on the channels to obtain an output feature map f' T 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210862496.8A CN115410162A (en) | 2022-07-21 | 2022-07-21 | Multi-target detection and tracking algorithm under complex urban road environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210862496.8A CN115410162A (en) | 2022-07-21 | 2022-07-21 | Multi-target detection and tracking algorithm under complex urban road environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115410162A true CN115410162A (en) | 2022-11-29 |
Family
ID=84157278
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210862496.8A Pending CN115410162A (en) | 2022-07-21 | 2022-07-21 | Multi-target detection and tracking algorithm under complex urban road environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115410162A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117557977A (en) * | 2023-12-28 | 2024-02-13 | 安徽蔚来智驾科技有限公司 | Environment perception information acquisition method, readable storage medium and intelligent device |
CN117690165A (en) * | 2024-02-02 | 2024-03-12 | 四川泓宝润业工程技术有限公司 | Method and device for detecting personnel passing between drill rod and hydraulic pliers |
-
2022
- 2022-07-21 CN CN202210862496.8A patent/CN115410162A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117557977A (en) * | 2023-12-28 | 2024-02-13 | 安徽蔚来智驾科技有限公司 | Environment perception information acquisition method, readable storage medium and intelligent device |
CN117557977B (en) * | 2023-12-28 | 2024-04-30 | 安徽蔚来智驾科技有限公司 | Environment perception information acquisition method, readable storage medium and intelligent device |
CN117690165A (en) * | 2024-02-02 | 2024-03-12 | 四川泓宝润业工程技术有限公司 | Method and device for detecting personnel passing between drill rod and hydraulic pliers |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | A Real‐Time Vehicle Counting, Speed Estimation, and Classification System Based on Virtual Detection Zone and YOLO | |
CN115410162A (en) | Multi-target detection and tracking algorithm under complex urban road environment | |
CN113129336A (en) | End-to-end multi-vehicle tracking method, system and computer readable medium | |
CN116309705B (en) | Satellite video single-target tracking method and system based on feature interaction | |
CN117593650B (en) | Moving point filtering vision SLAM method based on 4D millimeter wave radar and SAM image segmentation | |
CN113592905A (en) | Monocular camera-based vehicle running track prediction method | |
CN116311353A (en) | Intensive pedestrian multi-target tracking method based on feature fusion, computer equipment and storage medium | |
Yang et al. | SAM-Net: Semantic probabilistic and attention mechanisms of dynamic objects for self-supervised depth and camera pose estimation in visual odometry applications | |
Liang et al. | Global-local feature aggregation for event-based object detection on eventkitti | |
CN117765524A (en) | Three-dimensional target detection method based on multiple views | |
Qiu et al. | MFIALane: Multiscale feature information aggregator network for lane detection | |
Ding et al. | Novel Pipeline Integrating Cross-Modality and Motion Model for Nearshore Multi-Object Tracking in Optical Video Surveillance | |
Yin et al. | V2VFormer $++ $: Multi-Modal Vehicle-to-Vehicle Cooperative Perception via Global-Local Transformer | |
Wang et al. | Simple but effective: Upper-body geometric features for traffic command gesture recognition | |
CN113724293A (en) | Vision-based intelligent internet public transport scene target tracking method and system | |
Wu et al. | Joint Semantic Segmentation using representations of LiDAR point clouds and camera images | |
Liu et al. | Learning TBox with a cascaded anchor-free network for vehicle detection | |
Ganeriwala et al. | Cross dataset analysis and network architecture repair for autonomous car lane detection | |
Abualhanud et al. | Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images | |
Li et al. | Traffic4d: Single view reconstruction of repetitious activity using longitudinal self-supervision | |
CN115424187B (en) | Auxiliary driving method for multi-angle camera collaborative importance ranking constraint | |
Zhang et al. | ST-MAE: robust lane detection in continuous multi-frame driving scenes based on a deep hybrid network | |
Hadi et al. | Semantic instance segmentation in a 3D traffic scene reconstruction task | |
RILL | Intuitive Estimation of Speed using Motion and Monocular Depth Information | |
Krebs et al. | Generating 3D person trajectories from sparse image annotations in an intelligent vehicles setting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |