CN115410162A - Multi-target detection and tracking algorithm under complex urban road environment - Google Patents

Multi-target detection and tracking algorithm under complex urban road environment Download PDF

Info

Publication number
CN115410162A
CN115410162A CN202210862496.8A CN202210862496A CN115410162A CN 115410162 A CN115410162 A CN 115410162A CN 202210862496 A CN202210862496 A CN 202210862496A CN 115410162 A CN115410162 A CN 115410162A
Authority
CN
China
Prior art keywords
feature map
feature
size
target
characteristic diagram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210862496.8A
Other languages
Chinese (zh)
Inventor
刘占文
员惠莹
赵彬岩
李超
樊星
王洋
杨楠
齐明远
李宇航
孙士杰
蒋渊德
韩毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changan University
Original Assignee
Changan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changan University filed Critical Changan University
Priority to CN202210862496.8A priority Critical patent/CN115410162A/en
Publication of CN115410162A publication Critical patent/CN115410162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/54Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-target detection and tracking method under a complex urban road environment, which comprises the following steps: step 1: constructing a training set and a test set; and 2, step: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of the input image; and 3, step 3: extracting long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module; and 4, step 4: carrying out further feature fusion and logistic regression treatment; and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame; step 6, obtaining a trained multi-target detection and tracking model; and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame. The invention can accurately detect and track the target of multiple targets under the complex urban road environment, and can stably identify the target with larger change of the appearance scale.

Description

Multi-target detection and tracking algorithm under complex urban road environment
Technical Field
The invention belongs to the technical field of automatic driving, and relates to a traffic target detection and tracking method.
Background
The intelligent traffic becomes an important direction of future traffic development, the typical representative of the intelligent traffic is that automatic driving is a comprehensive technology with multiple subjects and multiple fields, and the development of the automatic driving technology not only needs the automatic driving capability of traffic participating vehicles, but also needs the accurate sensing technology for mastering the complex traffic environment, the high-precision map, the navigation positioning of the vehicles, the vehicle dynamics control and other technologies to construct a complete vehicle-road cooperative traffic system. In recent years, the scheme of 5G network superposition cloud computing can endow the traditional infrastructure with the road perception capability by an advanced artificial intelligence technology, and further improve the environment perception capability of a single vehicle through the Internet of things and cloud computing. No matter it is single car intelligence or car road cooperative, all need the sensor to gather external environment information. The sensors commonly used at present comprise a laser radar, a millimeter wave radar and a camera, compared with other sensors, the camera becomes a preferred visual sensor for environment perception with unique cost performance, and an artificial intelligence technology based on the camera becomes an indispensable key technology for intelligent traffic development, so that multi-target detection and tracking have important significance for the perception environment of a complex traffic environment.
Firstly, most of traffic scene targets are shot by a camera fixed at a high position, and the problems that targets at a far position of a picture are generally small, characteristic information is less, the number of targets on the same picture in a traffic scene is large, the size difference is large and the like exist. The convolutional neural network generally adopted in the current research can carry out down-sampling coding on the picture in the forward propagation process, so that the model is easy to lose a target with a small area, and the difficulty of capturing the target by the model is increased. Secondly, with the development of deep learning, a large number of research results have been obtained on multi-target tracking, but due to the influence of factors such as target appearance size change, shielding, blurring caused by rapid movement and the like in the tracking process, the existing tracking algorithm cannot reach an ideal state. Aiming at the problem of multi-target detection and tracking in a traffic scene, a target detection algorithm and a two-stage tracking network based on Kalman filtering and Hungarian algorithm are commonly used in the industry nowadays, and the model has some problems: the target detection and tracking modules are independent from each other and cannot be trained simultaneously, meanwhile, the target tracking performance is determined by the target detection accuracy, so that the bottleneck exists in network training and optimization, and stable tracking cannot be achieved for targets with large interframe displacement.
Disclosure of Invention
The invention aims to provide a multi-target detection and tracking algorithm under a complex urban road environment, and aims to solve the problems that in the prior art, the target detection accuracy is not high, and a target with large inter-frame displacement cannot be stably tracked.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-target detection and tracking method under a complex urban road environment specifically comprises the following steps:
step 1: selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set;
step 2: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of an input image and obtain a two-dimensional feature map after three feature fusion;
and 3, step 3: extracting a long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module according to the two-dimensional feature graph after feature fusion to obtain a feature graph after the dependency relationship is extracted;
and 4, step 4: generating a heat map and a target bounding box through further feature fusion and logistic regression processing;
and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame;
step 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set of the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model;
and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.
Further, in the step 1, visDrone _ mot in the main flow traffic target detection data set VisDrone is selected as the data set of the present invention.
Further, the step 2 specifically includes the following sub-steps:
step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map;
step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map;
step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;
and step 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform feature fusion on the adjacent feature maps of the obtained feature maps with the size of 1/4, the feature maps with the size of 1/8, the feature maps with the size of 1/16 and the feature maps with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16.
Further, in step 24, the feature fusion module is configured to implement the following operations:
step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;
step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2-fold upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1';
step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in the step 242 to obtain a combined characteristic diagram;
step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/4 and a feature map with the size of 1/8, the obtained two-dimensional feature map F2' is a feature map with the size of 1/4;
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/8 and a feature map with the size of 1/16, the obtained two-dimensional feature map F2' is a feature map with the size of 1/8;
when the feature maps F1 and F2 are feature maps of 1/16 size and 1/32 size, respectively, the two-dimensional feature map F2' obtained is a feature map of 1/16 size.
Further, the step 3 specifically includes the following sub-steps:
step 31, collapsing the two-dimensional characteristic diagram of 1/16 size finally obtained in the step 2 into a one-dimensional sequence, and performing convolution to form a K, V and Q characteristic diagram;
step 32, respectively adding the position code and the feature map K and the feature map Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, inputting the two feature maps and the feature map V into the multi-head attention module as common input, and processing to obtain a new feature map;
step 33, performing fusion operation and LayerNorm operation of adding corresponding values among feature maps on the new feature map obtained in step 32 and the V, K and Q feature maps obtained in step 31;
and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.
Further, the position code in step 32 is obtained by the following formula:
PE (pos,2i) =sin(pos/10000 2i/d )
pE (pos,2i+1) =cos(pos/10000 2i/d )
wherein, PE (·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.
Further, the step 4 specifically includes the following sub-steps:
and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.
Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;
step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;
step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution of 1/4 of the original image;
step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets
Figure BDA0003757244400000041
Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:
Figure BDA0003757244400000042
wherein,
Figure BDA0003757244400000043
i.e. step 45 obtains the center point of the predicted target,
Figure BDA0003757244400000044
representing the offset of the center point from the target center point,
Figure BDA0003757244400000045
indicating the size of the corresponding border of the object.
Further, the step 5 specifically includes the following sub-steps:
step 51, inputting the same graph of step 2The image is taken as a T-1 frame image, a next frame image, namely a T-frame image, is selected, the T-frame image and the T-1 frame image are taken as input, and feature images f are respectively generated through the treatment of a CenterTrack backbone network T And f T-1
Step 52, the feature map f T And f T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output feature map f' T
Step 53, comparing the heat map obtained in step 4 with the feature map f obtained in step 51 T-1 Performing Hadamard multiplication to generate feature map
Figure BDA0003757244400000046
Will be provided with
Figure BDA0003757244400000047
And the feature map f 'obtained in step 52' T Performing deformable convolution together to generate feature maps
Figure BDA0003757244400000048
Step 54, will
Figure BDA0003757244400000049
Sequentially using 31 × 1 convolution operations and down-sampling operations to generate a T-1 th frame feature map; the characteristic diagram f obtained in the step 51 is processed T Performing operation by using 31 × 1 convolutions to generate a Tth frame feature map;
step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' T
Further, the step 52 specifically includes the following operations:
step 521, the feature map f is processed T And f T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules T And e T-1 I.e. the appearance coding vector of the target;
step 522, for the feature map e T And e T-1 Performing a max pooling operation to obtain e' T And e' T-1 To reduce model complexity, e 'is used' T And e' T-1 Obtaining a cost space matrix C by transposition calculation of the product, wherein the position of a target on the cost space matrix C in the current frame is (i, j), and extracting a two-dimensional cost matrix C containing position information of the target in the current frame in the previous frame image from the cost space matrix C i,j To C, to i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction
Figure BDA0003757244400000051
Step 523, define two offset templates by equations (4) and (5)
Figure BDA0003757244400000052
G i,j,l =(l-j)×s1≤l≤W C (4)
M i,j,k =(k-i)×s1≤k≤H C (5)
Wherein s is the downsampling multiple of the feature map relative to the original image, W C 、H C Is the size of the width and height dimensions of the feature map, G i,j,l For the offset, M, at which the object (i, j) in the T-frame image appears at the horizontal position l in the T-1 frame image i,j,k An offset for the T frame object (i, j) to appear at vertical position k in the T-1 frame image;
step 524, the result of step 522 is processed
Figure BDA0003757244400000053
Multiplying the obtained product by the offset templates G and M defined in step 523, and then superimposing the obtained product on the channel to obtain a feature map O T An offset template representing the target in both horizontal and vertical directions; then O is introduced T 2 times up sampling recovery to H F ×W F Size at the same time, adding O T The horizontal and vertical channels of the characteristic map are respectively compared with f obtained in step 51 T 、f T-1 Performing superposition on channels, and performing convolution to form characteristic diagrams in horizontal and vertical directions2 feature maps with the same size and 9 channel numbers are superposed on the channels to obtain an output feature map f' T
Compared with the prior art, the invention has the beneficial effects that:
(1) in the invention, the resolution of the input picture of the adopted data set is properly increased, and the size of the final characteristic diagram is ensured to reserve more detailed information;
(2) in the multi-target detection module, a deep feature map containing more semantic information and a shallow feature map containing more detailed information are fused through the feature fusion module, so that the detection capability of the model on small targets is improved;
(3) in the invention, a Transformer coding module self-attention mechanism is introduced into a multi-target detection module, the dependency relationship on a long distance is captured, the characteristic potential relation in a characteristic diagram is explored, and a target with larger appearance dimension change can be stably identified;
(4) the multi-target tracking algorithm based on cost space and inter-frame information fusion is provided, a cost space matrix is used for predicting the position of a target of a current frame in a previous frame, the targets between two frames can be correlated, and the tracking effect is realized;
(5) in the multi-target tracking module, an attention propagation module is introduced to fuse the characteristics of multi-frame targets, so that the problem of target space dislocation caused by inter-frame target motion is solved, and the model can still accurately realize tracking under the condition that the target is shielded.
Drawings
FIG. 1 is a schematic diagram of a multi-target detection module of the present invention;
FIG. 2 is a schematic diagram of a feature fusion module in the multi-target detection module;
FIG. 3 is a schematic diagram of a Transformer encoding module in a multi-target detection module;
FIG. 4 is a schematic diagram of the multi-target tracking module of the present invention;
FIG. 5 is a schematic diagram of a cost space module in the multi-target tracking module;
FIG. 6 is a schematic diagram of the experimental results of the multi-target detection module of the present invention; the result diagrams are respectively a target center point and a target boundary box result diagram obtained by detecting a small target and a large target by the module.
FIG. 7 is a schematic diagram of the experimental results of the multi-target tracking module of the present invention. The four pictures of the two sections of test cases are respectively the 0 th frame, the 5 th frame, the 10 th frame and the 15 th frame.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
The multi-target detection and tracking model is divided into two parts, firstly, a multi-target detection module framework is shown in figure 1, a deep-shallow network fusion characteristic diagram is obtained by adding a characteristic fusion module mainly based on an improved DLA34 as a backbone network, a Transformer coding module is introduced to carry out self-attention coding on the fused characteristic diagram, and the problem of limitation of the network on the large-target semantic extraction capability caused by overlarge target characteristic scale difference is solved; and finally, generating a target heat map and regressing to obtain a bounding box of the corresponding target so as to realize the detection of the traffic target. As shown in fig. 4, a target tracking module framework generates a feature map through a centrtrack backbone network, and realizes target association and tracking between two frames by using a cost space matrix; by using the attention propagation module, the target information of the front frame and the rear frame are fused and complemented, and the accurate tracking under the condition that the target is fuzzy or shielded is realized.
The invention relates to a multi-target detection and tracking method under a complex urban road environment, which specifically comprises the following steps:
step 1: and selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set.
The method comprises the following steps: and selecting the VisDrone _ mot in the VisDrone as the data set of the invention. The VisDrone _ mot data set collects street overhead views of multiple cities in china from an unmanned aerial vehicle, provides 96 video sequences, including 56 training video sequences comprising 24201 frames of images, 7 verification video sequences comprising 2819 frames of images, and 33 test sequences comprising 12968 frames of images, and manually marks a bounding box of an identified object in each video frame. And increasing the resolution to 1024 × 1024 for the input picture in the VisDrone _ mot data set, ensuring that the size of the final feature map output by the multi-target detection module is 256 × 256, reserving more detailed information, and simultaneously using a data enhancement mode combining random scaling, random clipping and color dithering, wherein the data enhancement mode is that the random inversion is performed, and the resolution is 0.6 to 1.3 times, as an extended training sample.
And 2, step: and adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of the input image and obtain a two-dimensional feature map after the three features are fused. As shown in fig. 1, the method specifically includes the following sub-steps:
and step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map. Wherein the characteristics of the aggregate nodes are fused as shown in formula (1):
N(X 1 ,...,X n )=σ(BN(∑w i x i +b),...,BN(∑w i x i +b)) (1)
where N (-) represents an aggregation node, σ (-) represents a feature aggregation, w i x i + b denotes a convolution operation, BN denotes a BatchNorm operation, X i=1...N Corresponding to the output of the convolution module.
And step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map. This step is intended to pass the network shallow feature information to the network deep layer.
Step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;
step 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform feature fusion on the adjacent feature maps of the obtained feature map with the size of 1/4, the feature map with the size of 1/8, the feature map with the size of 1/16 and the feature map with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16;
the feature fusion module is used for realizing the following operations:
step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;
step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2 times upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1', so as to obtain more target position information and reduce the model parameters;
step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in the step 242 to obtain a combined characteristic diagram;
step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/4 and a feature map with the size of 1/8, the obtained two-dimensional feature map F2' is a feature map with the size of 1/4;
when the feature map F1 and the feature map F2 are respectively a feature map with the size of 1/8 and a feature map with the size of 1/16, the obtained two-dimensional feature map F2' is a feature map with the size of 1/8;
when the feature map F1 and the feature map F2 are respectively a feature map of 1/16 size and a feature map of 1/32 size, the obtained two-dimensional feature map F2' is a feature map of 1/16 size.
And step 3: and (3) according to the feature graph obtained in the step (2) after feature fusion, extracting the long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module to obtain the feature graph after the dependency relationship is extracted. As shown in fig. 3, the method specifically includes the following sub-steps:
step 31, collapsing the two-dimensional characteristic diagram of 1/16 size finally obtained in the step 2 into a one-dimensional sequence, and performing convolution to form three K (Key), V (Value) and Q (Query) characteristic diagrams;
and step 32, respectively adding the position code and the feature map K and the feature map Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, and enabling the two feature maps and the feature map V to be used as common input to enter a multi-head attention module to be processed to obtain a new feature map so as to capture long-distance dependence in the image. Wherein the position code is obtained by the following formula (1) (2):
PE (pos,2i) =sin(pos/10000 2i/d ) (1)
PE (pos,2i+1) =cos(pos/10000 2i/d ) (2)
wherein, PE (·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.
Step 33, performing fusion operation and LayerNorm (LN) operation of adding corresponding values between feature maps on the new feature map obtained in step 32 and the V, K, Q feature maps obtained in step 31 to avoid information loss;
and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.
And 4, step 4: and (4) generating a heat map and a target boundary frame by further carrying out feature fusion and logistic regression treatment according to the feature maps obtained in the step (2) and the step (3). The method specifically comprises the following substeps:
and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.
Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;
step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;
step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution of 1/4 of the original image;
step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets
Figure BDA0003757244400000081
Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:
Figure BDA0003757244400000082
wherein,
Figure BDA0003757244400000083
i.e. step 45 obtains the center point of the predicted target,
Figure BDA0003757244400000084
representing the offset of the center point from the target center point,
Figure BDA0003757244400000085
indicating the size of the corresponding border of the object.
And 5: and (4) performing target association processing and tracking by using a multi-target tracking module according to the input image in the step (2) and the heat map obtained in the step (4) to obtain a tracking characteristic diagram with a target detection frame. As shown in fig. 4, the method specifically includes the following sub-steps:
step 51, using the same image input in step 2 as the T-1 frame image, and selecting the next frame image, i.e. the first frame imageThe T frame image takes the T frame and the T-1 frame image as input, and generates a feature map f through the CenterTrack backbone network processing T And f T-1
Step 52, the feature map f T And f T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output characteristic diagram f' T . The method specifically comprises the following operations:
step 521, the feature map f is processed T And f T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules T And e T-1 I.e. the appearance coding vector of the target;
step 522, for the feature map e T And e T-1 Performing a max pooling operation to obtain e' T And e' T-1 To reduce model complexity, use e' T And e' T-1 And (3) obtaining a cost space matrix C by transposition calculation of the product so as to store the similarity of corresponding points between two frame feature maps, wherein the position of the target on the cost space matrix C in the current frame is (i, j), and extracting a two-dimensional cost matrix C containing the position information of the target in the current frame in the previous frame image from the cost space matrix C i,j To C i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction
Figure BDA0003757244400000091
Step 523, define two offset templates by equations (4) and (5)
Figure BDA0003757244400000092
G i,j,l =(l-j)×s1≤l≤W C (4)
M i,j,k =(k-i)×s1≤k≤H C (5)
Wherein s is the down-sampling multiple of the feature map relative to the original image, W C 、H C Is the size of the width and height dimensions of the feature map, G i,j,l For objects (i, j) in T-frame images appearing at horizontal positions l in T-1 frame imagesOffset, M i,j,k The offset at vertical position k occurs in the T-1 frame image for T frame object (i, j).
Step 524, the result of step 522 is processed
Figure BDA0003757244400000093
Multiplying the offset templates G and M defined in step 523, and then superimposing the offset templates G and M on the channel to obtain a feature map O T Representing offset templates of the target in both horizontal and vertical directions; then O is introduced T 2 times up sampling recovery to H F ×W F Size, simultaneously, adding O T The horizontal and vertical channels of the feature map are respectively compared with f obtained in step 51 T 、f T-1 Superposing the characteristic diagrams on channels, forming 2 characteristic diagrams with the same size and the same number of channels as 9 in the horizontal direction and the vertical direction through convolution, and superposing the 2 characteristic diagrams on the channels to obtain an output characteristic diagram f' T
Step 53, the heat map obtained in step 4 and the feature map f obtained in step 51 are combined T-1 Performing a hadamard product to generate a feature map
Figure BDA0003757244400000094
Will be provided with
Figure BDA0003757244400000095
And the feature map f 'obtained in step 52' T Performing deformable convolution together to generate feature maps
Figure BDA0003757244400000096
Step 54, will
Figure BDA0003757244400000097
The T-1 th frame feature map (q) is generated by sequentially using 31 × 1 convolution operations and a down-sampling operation t-1 、 k t-1 、v t-1 ) (ii) a The characteristic diagram f obtained in the step 51 is processed T Operating with 31 × 1 convolutions, a Tth frame feature map (Q) is generated t 、K t 、V t );
Step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' T . Wherein, the calculation process of the attention spreading module is shown as formula (6):
Figure BDA0003757244400000101
wherein,
Figure BDA0003757244400000102
is a 1 × 1 convolution, d k For the dimensions of feature maps Q and K, Q t 、k t-1 、v t-1 、V t The signature obtained in step 54.
And 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set in the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model.
And 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.
In order to verify the feasibility and effectiveness of the invention, the invention carries out the following experiments:
first, the model is evaluated for multiple target detection modules (i.e., steps 2-4) using average accuracy and recall. The average accuracy is obtained from the accuracy, and the formulas of the accuracy P and the recall rate R are specifically shown in formulas (7) and (8).
Figure BDA0003757244400000103
Figure BDA0003757244400000104
Where P is the percentage of Targets (TP) that should be retrieved to all retrieved targets (TP + FP). R is the percentage of Targets (TP) that should be retrieved to all targets (TP + FN) that should be retrieved.
In the detection task, the accuracy rate reflects the model checking capability, and the recall rate reflects the model checking capability. The two indexes are restricted with each other, and the relative balance between the precision ratio and the recall ratio is found through the average precision ratio (AP) under different confidence coefficient thresholds, so that a two-dimensional PR curve with the precision ratio and the recall ratio as horizontal and vertical coordinates is made. The average Accuracy (AP) is the area enclosed by the PR curve, which is equal to the average operation on the accuracy.
The method comprises the steps of firstly carrying out quantitative analysis on a multi-target detection module, comparing the multi-target detection module with a baseline model on a VisDrone _ mot data set, and adding performance comparison specific to each category between the method and various baseline methods in an experiment, so that results can be obtained. Compared with a common model with excellent performance, the method provided by the invention has the advantages that the identification performance for large targets is optimal, the precision reaches 42.16 and 33.10, and the detection capability is good.
Meanwhile, in order to intuitively reflect the performance of the whole multi-target detection module and qualitatively analyze the module, the result is shown in fig. 6, so that the model has good detection performance on targets with different scales, after the Transformer module is added, the model can capture the long-distance dependency more stably, and the recognition effect on a large target is still relatively robust while the small target has good recognition capability.
Next, for the multi-target tracking module (i.e., step 5), evaluation is performed using indices such as MOTA (↓), MOTP (±), IDF1 (±), MT (±), ML (↓), FP (↓), FN (↓), frag (↓), IDSW (↓), and the like. ↓ indicates larger index value and better model performance, and ↓ indicates smaller index value and better model performance.
The MOTA represents the multi-target tracking accuracy, measures the capability of an algorithm for continuously tracking the target, and is used for counting the error accumulation condition in tracking, and the formula is shown as (9).
Figure BDA0003757244400000111
Wherein m is t Corresponding to FP, it represents false positive (false positive number) in the prediction result, i.e. the predicted position in the t-th frame has no corresponding tracking target matching with it. fp t Corresponding to FN, represents false negatives (number of missed detections), i.e. no corresponding predicted position of the target matches it in the t-th frame. mme (mme) t Corresponding to IDSW, representing the number of mismatches, i.e. the number of times the ID switch occurred for the target is tracked in the tth frame, g t The sum of the real target number in the frame is indicated, and the MOTA comprehensively considers false detection, missed detection and ID exchange in the target track.
The MOTP expression also directly reflects the tracking effect of the model, reflects the difference between the tracking result and the distance of the label track, and the formula is expressed as (10).
Figure BDA0003757244400000112
Wherein, c t The matching number of the t frame is shown, and the track error is calculated for each pair of matching
Figure BDA0003757244400000113
And summing to obtain a final numerical value, wherein the larger the index is, the better the model performance is, and the smaller the track error is.
MT is the majority trace number (Mostly trace), which refers to the number of traces that hit more than 80% of the tag traces, with larger numbers being better. ML is the majority loss (Mostly lost), which refers to the number of traces that are lost greater than 80% of the tag traces, with smaller numbers being better. Frag is the number of transitions, which refers to the number of changes in the trace from the "tracking" state to the "no tracking" state.
For a multi-target tracking detector, the ID related indexes are also important, and specifically, the following three important indexes are provided: IDP, IDR, IDF1. The IDP represents Identification Precision (Identification Precision), which refers to the ID Identification accuracy of each target box, and is expressed by the formula (11).
Figure BDA0003757244400000114
Wherein IDTP and IDFP are the number of true positive examples and false positive examples of ID prediction respectively. The IDR represents an Identification Recall rate (Identification Recall) which indicates an ID Identification Recall rate of each target box, and is represented by the formula (12).
Figure BDA0003757244400000115
Where IDFN is a false negative of ID prediction. IDF1 represents an ID-predicted F value (Identification F-Score), and indicates an ID-identified F value of each target box, and the larger the index value, the better, the calculation formula is shown as (13).
Figure BDA0003757244400000121
IDF1 is the first default indicator used to evaluate the performance of the tracker, and any two of the three indicators can be used to infer the other.
Firstly, comparing a multi-target tracking module with a quantitative experiment of a mainstream baseline model in recent years, on a VisDrone _ mot data set, compared with a second good model, the tracking method provided by the invention is respectively 3.2 and 1.8 higher in MOTA and MOTP indexes, and obtains better results on other indexes, but the model has less false detection rate, so that ML and MT indexes are disturbed in a normal range. Compared with a TBD (tunnel boring machine) model, the JDT model can carry out end-to-end optimization in the training process due to mutual promotion of detection and tracking tasks, and can obtain better effect on the tracking task.
Secondly, the qualitative analysis is performed on the model in the data set, as shown in fig. 7, two test examples are shown, and four pictures of each test example are selected for display, namely, the 0 th frame, the 5 th frame, the 10 th frame and the 15 th frame of images in the time dimension. As can be seen from the figure, the model can stably track multiple targets in the traffic scene, and particularly has excellent detection and tracking capability on small targets in the traffic scene.

Claims (9)

1. A multi-target detection and tracking method under a complex urban road environment is characterized by specifically comprising the following steps:
step 1: selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set;
step 2: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of an input image and obtain a two-dimensional feature map after three feature fusion;
and 3, step 3: extracting the long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module according to the two-dimensional feature graph after feature fusion to obtain the feature graph after the dependency relationship is extracted;
and 4, step 4: generating a heat map and a target boundary frame through further feature fusion and logistic regression processing;
and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame;
step 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set of the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model;
and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.
2. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein in step 1, visDrone _ mot in the main-flow traffic target detection data set VisDrone is selected as the data set of the present invention.
3. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein said step 2 comprises the following substeps:
step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map;
step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map;
step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;
and 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform adjacent feature fusion on the obtained feature map with the size of 1/4, the feature map with the size of 1/8, the feature map with the size of 1/16 and the feature map with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16.
4. The multi-target detection and tracking method in a complex urban road environment according to claim 3, wherein in step 24, the feature fusion module is configured to implement the following operations:
step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;
step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2-fold upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1';
step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in step 242 to obtain a combined characteristic diagram;
step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';
when the characteristic diagram F1 and the characteristic diagram F2 are respectively a characteristic diagram with the size of 1/4 and a characteristic diagram with the size of 1/8, the obtained two-dimensional characteristic diagram F2' is a characteristic diagram with the size of 1/4;
when the characteristic diagram F1 and the characteristic diagram F2 are respectively a characteristic diagram with the size of 1/8 and a characteristic diagram with the size of 1/16, the obtained two-dimensional characteristic diagram F2' is a characteristic diagram with the size of 1/8;
when the feature maps F1 and F2 are feature maps of 1/16 size and 1/32 size, respectively, the two-dimensional feature map F2' obtained is a feature map of 1/16 size.
5. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein said step 3 comprises the following substeps:
step 31, collapsing the two-dimensional characteristic diagram with the size of 1/16 obtained finally in the step 2 into a one-dimensional sequence, and convoluting to form a K, V and Q characteristic diagram;
step 32, respectively adding the position codes and the feature maps K and Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, inputting the two feature maps and the feature map V into a multi-head attention module as common input, and processing to obtain a new feature map;
step 33, performing fusion operation and LayerNorm operation of adding corresponding values among feature maps on the new feature map obtained in step 32 and the V, K and Q feature maps obtained in step 31;
and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.
6. The multi-target detection and tracking method in a complex urban road environment according to claim 5, wherein the position code in step 32 is obtained by the following formula:
PE (pos,2i) =sin(pos/10000 2i/d )
PE (pos,2i+1) =cos(pos/10000 2i/d )
wherein, PE (·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.
7. The multi-target detection and tracking method in the complex urban road environment according to claim 1, wherein the step 4 specifically comprises the following substeps:
and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.
Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;
step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;
step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution being the size of 1/4 of the original image;
step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets
Figure FDA0003757244390000031
Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:
Figure FDA0003757244390000032
wherein,
Figure FDA0003757244390000033
i.e. step 45 obtains the center point of the predicted target,
Figure FDA0003757244390000034
representing the offset of the center point from the target center point,
Figure FDA0003757244390000035
indicating the size of the corresponding bounding box of the object.
8. The multi-target detection and tracking method in the complex urban road environment according to claim 1, wherein the step 5 specifically comprises the following substeps:
step 51, using the same image input in step 2 as the T-1 frame image, selecting the next frame image, namely the T-frame image, using the T-frame and the T-1 frame image as input, and respectively generating the feature map f through the CenterTrack backbone network processing T And f T-1
Step 52, the feature map f T And f T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output feature map f' T
Step 53, comparing the heat map obtained in step 4 with the feature map f obtained in step 51 T-1 Performing a hadamard product to generate a feature map
Figure FDA0003757244390000036
Will be provided with
Figure FDA0003757244390000037
And the feature map f 'obtained in step 52' T Together withPerforming a deformable convolution to generate a feature map
Figure FDA0003757244390000038
Step 54, will
Figure FDA0003757244390000041
Generating a T-1 frame feature map by sequentially using 31 × 1 convolution operations and a downsampling operation; the characteristic diagram f obtained in the step 51 is processed T Performing operation by using 31 × 1 convolutions to generate a Tth frame feature map;
step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' T
9. The multi-target detection and tracking method in a complex urban road environment according to claim 8, wherein said step 52 specifically comprises the following operations:
step 521, the feature map f is processed T And f T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules T And e T-1 I.e. the appearance coding vector of the target;
step 522, for the feature map e T And e T-1 Performing a max pooling operation to obtain e' T And e' T-1 To reduce model complexity, e 'is used' T And e' T-1 The transposition calculation of the product obtains a cost space matrix C, the position of the target on the cost space matrix C in the current frame is (i, j), and a two-dimensional cost matrix C containing the position information of the target in the current frame in the previous frame image is extracted from the cost space matrix C i,j To C, to i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction
Figure FDA0003757244390000042
Step 523, define two biases by equations (4) and (5)Movable template
Figure FDA0003757244390000043
G i,j,l =(l-j)×s1≤l≤W C (4)
M i,j,k =(k-i)×s1≤k≤H C (5)
Wherein s is the downsampling multiple of the feature map relative to the original image, W C 、H C Is the size of the width and height dimensions of the feature map, G i,j,l M is an offset of an object (i, j) in the T frame image appearing in the horizontal position l in the T-1 frame image i,j,k An offset for a T frame object (i, j) appearing at vertical position k in the T-1 frame image;
step 524, the result of step 522 is processed
Figure FDA0003757244390000044
Multiplying the offset templates G and M defined in step 523, and then superimposing the offset templates G and M on the channel to obtain a feature map O T Representing offset templates of the target in both horizontal and vertical directions; then O is introduced T 2 times up sampling recovery to H F ×W F Size, simultaneously, adding O T The horizontal and vertical channels of the characteristic map are respectively compared with f obtained in step 51 T 、f T-1 Superposing on channels, forming 2 feature maps with unchanged feature map sizes in the horizontal direction and the vertical direction and 9 channels by convolution, and superposing the 2 feature maps on the channels to obtain an output feature map f' T
CN202210862496.8A 2022-07-21 2022-07-21 Multi-target detection and tracking algorithm under complex urban road environment Pending CN115410162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210862496.8A CN115410162A (en) 2022-07-21 2022-07-21 Multi-target detection and tracking algorithm under complex urban road environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210862496.8A CN115410162A (en) 2022-07-21 2022-07-21 Multi-target detection and tracking algorithm under complex urban road environment

Publications (1)

Publication Number Publication Date
CN115410162A true CN115410162A (en) 2022-11-29

Family

ID=84157278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210862496.8A Pending CN115410162A (en) 2022-07-21 2022-07-21 Multi-target detection and tracking algorithm under complex urban road environment

Country Status (1)

Country Link
CN (1) CN115410162A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557977A (en) * 2023-12-28 2024-02-13 安徽蔚来智驾科技有限公司 Environment perception information acquisition method, readable storage medium and intelligent device
CN117690165A (en) * 2024-02-02 2024-03-12 四川泓宝润业工程技术有限公司 Method and device for detecting personnel passing between drill rod and hydraulic pliers

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557977A (en) * 2023-12-28 2024-02-13 安徽蔚来智驾科技有限公司 Environment perception information acquisition method, readable storage medium and intelligent device
CN117557977B (en) * 2023-12-28 2024-04-30 安徽蔚来智驾科技有限公司 Environment perception information acquisition method, readable storage medium and intelligent device
CN117690165A (en) * 2024-02-02 2024-03-12 四川泓宝润业工程技术有限公司 Method and device for detecting personnel passing between drill rod and hydraulic pliers

Similar Documents

Publication Publication Date Title
Lin et al. A Real‐Time Vehicle Counting, Speed Estimation, and Classification System Based on Virtual Detection Zone and YOLO
CN115410162A (en) Multi-target detection and tracking algorithm under complex urban road environment
CN113129336A (en) End-to-end multi-vehicle tracking method, system and computer readable medium
CN116309705B (en) Satellite video single-target tracking method and system based on feature interaction
CN117593650B (en) Moving point filtering vision SLAM method based on 4D millimeter wave radar and SAM image segmentation
CN113592905A (en) Monocular camera-based vehicle running track prediction method
CN116311353A (en) Intensive pedestrian multi-target tracking method based on feature fusion, computer equipment and storage medium
Yang et al. SAM-Net: Semantic probabilistic and attention mechanisms of dynamic objects for self-supervised depth and camera pose estimation in visual odometry applications
Liang et al. Global-local feature aggregation for event-based object detection on eventkitti
CN117765524A (en) Three-dimensional target detection method based on multiple views
Qiu et al. MFIALane: Multiscale feature information aggregator network for lane detection
Ding et al. Novel Pipeline Integrating Cross-Modality and Motion Model for Nearshore Multi-Object Tracking in Optical Video Surveillance
Yin et al. V2VFormer $++ $: Multi-Modal Vehicle-to-Vehicle Cooperative Perception via Global-Local Transformer
Wang et al. Simple but effective: Upper-body geometric features for traffic command gesture recognition
CN113724293A (en) Vision-based intelligent internet public transport scene target tracking method and system
Wu et al. Joint Semantic Segmentation using representations of LiDAR point clouds and camera images
Liu et al. Learning TBox with a cascaded anchor-free network for vehicle detection
Ganeriwala et al. Cross dataset analysis and network architecture repair for autonomous car lane detection
Abualhanud et al. Self-Supervised 3D Semantic Occupancy Prediction from Multi-View 2D Surround Images
Li et al. Traffic4d: Single view reconstruction of repetitious activity using longitudinal self-supervision
CN115424187B (en) Auxiliary driving method for multi-angle camera collaborative importance ranking constraint
Zhang et al. ST-MAE: robust lane detection in continuous multi-frame driving scenes based on a deep hybrid network
Hadi et al. Semantic instance segmentation in a 3D traffic scene reconstruction task
RILL Intuitive Estimation of Speed using Motion and Monocular Depth Information
Krebs et al. Generating 3D person trajectories from sparse image annotations in an intelligent vehicles setting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination