CN115410162A

CN115410162A - Multi-target detection and tracking algorithm under complex urban road environment

Info

Publication number: CN115410162A
Application number: CN202210862496.8A
Authority: CN
Inventors: 刘占文; 员惠莹; 赵彬岩; 李超; 樊星; 王洋; 杨楠; 齐明远; 李宇航; 孙士杰; 蒋渊德; 韩毅
Original assignee: Changan University
Current assignee: Changan University
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-29

Abstract

The invention discloses a multi-target detection and tracking method in a complex urban road environment: step 1: construct a training set and a test set; step 2: add feature fusion modules layer by layer on the basis of the existing DLA34 backbone network to realize input images Deep and shallow network feature fusion; Step 3: Use the Transformer coding module to extract long-distance feature dependencies in the feature map; Step 4: Through further feature fusion and logistic regression processing; Step 5: Use the multi-target tracking module for target association processing and tracking , to obtain the tracking feature map with the target detection frame; step 6, to obtain the trained multi-target detection and tracking model; step 7, input the video data to be detected into the trained multi-target detection and tracking model, and obtain the target Tracking feature maps for detection boxes. The invention can accurately detect and track multiple targets in a complex urban road environment, and can stably identify targets with large changes in appearance scale.

Description

A Multi-target Detection and Tracking Algorithm in Complex Urban Road Environment

技术领域technical field

本发明属于自动驾驶技术领域，涉及一种交通目标检测与跟踪的方法。The invention belongs to the technical field of automatic driving, and relates to a method for detecting and tracking a traffic target.

背景技术Background technique

智慧交通已成为未来交通发展的重要方向，其典型代表自动驾驶是一门多学科、多领域交织的综合技术，自动驾驶技术的发展不仅需要交通参与车辆具备自动驾驶能力，还需要掌握复杂交通环境的精确感知技术，高精度地图及车辆的导航定位，车辆动力学控制等技术，构建完善的车路协同交通体系。近年来，5G网络叠加云计算的方案能够以先进的人工智能技术赋予传统的基础设施感知道路的能力，进而通过物联网、云计算提高单车的环境感知能力。不论是单车智能还是车路协同，都需要传感器对外界环境信息进行采集。目前常用的传感器包括激光雷达、毫米波雷达和摄像机，相比于其他传感器，摄像机以其独有的性价比成为了环境感知的首选视觉传感器，基于摄像机的人工智能技术成为了智慧交通发展不可或缺的关键技术，因此，多目标检测及跟踪对复杂交通环境的感知环境具有重要意义。Smart transportation has become an important direction of future transportation development. Its typical representative automatic driving is a multi-disciplinary and multi-field intertwined comprehensive technology. The development of automatic driving technology requires not only the ability of traffic participating vehicles to have automatic driving capabilities, but also the need to master complex traffic environments. Advanced precise perception technology, high-precision maps, vehicle navigation and positioning, vehicle dynamics control and other technologies to build a complete vehicle-road collaborative traffic system. In recent years, the 5G network superimposed cloud computing solution can use advanced artificial intelligence technology to endow traditional infrastructure with the ability to perceive the road, and then improve the environmental awareness of bicycles through the Internet of Things and cloud computing. Whether it is single-vehicle intelligence or vehicle-road coordination, sensors are required to collect external environmental information. Currently commonly used sensors include lidar, millimeter-wave radar and cameras. Compared with other sensors, cameras have become the preferred visual sensor for environmental perception due to their unique cost performance. Camera-based artificial intelligence technology has become indispensable for the development of smart transportation. Therefore, multi-target detection and tracking is of great significance to the perception of the complex traffic environment.

首先，交通场景目标多是由固定在高处的摄像机拍摄，存在画面较远处目标普遍较小、特征信息较少、交通场景下同画面目标数量较多且尺寸相差较大等问题。当前研究普遍采用的卷积神经网络在前向传播过程中会对图片进行下采样编码，导致模型易丢失面积较小的目标，加大模型捕捉目标的难度。其次，随着深度学习的发展在多目标跟踪上已经取得了大量的研究成果，但是由于跟踪过程中目标外观尺寸变化、遮挡、快速移动导致的模糊等因素的影响，使得现有跟踪算法无法达到理想状态。针对交通场景下多目标检测与跟踪问题，现今工业界普遍使用目标检测算法和基于卡尔曼滤波、匈牙利算法的两阶段跟踪网络，这样的模型具有一些问题：目标检测和跟踪模块相互独立，无法同时训练，同时，目标检测的精准度决定目标跟踪的性能，导致网络的训练和优化存在瓶颈，且针对帧间位移大的目标无法做到稳定的跟踪。First of all, the objects in traffic scenes are mostly captured by cameras fixed at high places. There are problems such as small objects farther away in the picture, less feature information, and a large number of objects in the same picture in traffic scenes with large differences in size. The convolutional neural network commonly used in current research will down-sample and encode the image during the forward propagation process, which makes the model easy to lose the target with a small area and makes it more difficult for the model to capture the target. Secondly, with the development of deep learning, a lot of research results have been achieved in multi-target tracking. However, due to the influence of factors such as changes in the appearance and size of targets, occlusion, and blurring caused by rapid movement during the tracking process, existing tracking algorithms cannot achieve this goal. Ideal state. Aiming at the problem of multi-target detection and tracking in traffic scenarios, target detection algorithms and two-stage tracking networks based on Kalman filter and Hungarian algorithm are commonly used in the industry today. Such a model has some problems: the target detection and tracking modules are independent of each other and cannot be simultaneously At the same time, the accuracy of target detection determines the performance of target tracking, resulting in bottlenecks in network training and optimization, and stable tracking of targets with large inter-frame displacements cannot be achieved.

发明内容Contents of the invention

本发明的目的是提供一种复杂城市道路环境下的多目标检测与跟踪算法，以解决现有技术中存在的目标检测的精准度不高以及对帧间位移大的目标无法做到稳定的跟踪的问题。The purpose of the present invention is to provide a multi-target detection and tracking algorithm in a complex urban road environment to solve the problem of low target detection accuracy and the inability to achieve stable tracking of targets with large inter-frame displacements in the prior art The problem.

为实现上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种复杂城市道路环境下的多目标检测与跟踪方法，具体包括以下步骤：A multi-target detection and tracking method in a complex urban road environment, specifically comprising the following steps:

步骤1：选取公共数据集进行数据增强，得到数据集，构建训练集和测试集；Step 1: Select a public dataset for data augmentation, obtain the dataset, and build a training set and a test set;

步骤2：在现有的DLA34骨干网络基础上逐层增加特征融合模块实现输入图像的深浅层网络特征融合，得到三个特征融合后的二维特征图；Step 2: On the basis of the existing DLA34 backbone network, add the feature fusion module layer by layer to realize the deep and shallow network feature fusion of the input image, and obtain the two-dimensional feature map after three feature fusion;

步骤3：根据特征融合后的二维特征图，采用Transformer编码模块提取特征图中长距离特征依赖关系，得到提取依赖关系后的特征图；Step 3: According to the two-dimensional feature map after feature fusion, the Transformer coding module is used to extract the long-distance feature dependency relationship in the feature map, and the feature map after the dependency relationship is extracted;

步骤4：通过进一步特征融合及逻辑回归处理，生成热图以及目标边界框；Step 4: Through further feature fusion and logistic regression processing, generate a heat map and target bounding box;

步骤5：利用多目标跟踪模块进行目标关联处理与跟踪，得到带有目标检测框的跟踪特征图；Step 5: Use the multi-target tracking module to perform target association processing and tracking, and obtain a tracking feature map with a target detection frame;

步骤6，采用步骤1的训练集对由步骤2、3、4、5组成的多目标检测与跟踪模型进行训练，并采用测试集进行测试，最终得到训练好的多目标检测与跟踪模型；Step 6, using the training set of step 1 to train the multi-target detection and tracking model composed of steps 2, 3, 4, and 5, and using the test set to test, and finally obtain the trained multi-target detection and tracking model;

步骤7，将待检测的视频数据输入训练好的多目标检测与跟踪模型，得到带有目标检测框的跟踪特征图。Step 7: Input the video data to be detected into the trained multi-target detection and tracking model to obtain the tracking feature map with the target detection frame.

进一步的，所述步骤1中，选择主流交通目标检测数据集VisDrone中的VisDrone_mot 作为本发明的数据集。Further, in the step 1, VisDrone_mot in the mainstream traffic object detection data set VisDrone is selected as the data set of the present invention.

进一步的，所述步骤2具体包括如下子步骤：Further, the step 2 specifically includes the following sub-steps:

步骤21，将训练集中的图像输入至DLA34网络，经过BatchNorm层和ReLU层对原始图像进行两次卷积核为3×3大小的卷积操作得到两张特征图，将卷积后的两张特征图输入聚合结点进行特征融合，得到分辨率为原输入特征图1/4大小的特征图；Step 21, input the images in the training set to the DLA34 network, perform two convolution operations on the original image with a convolution kernel of 3×3 size through the BatchNorm layer and the ReLU layer to obtain two feature maps, and the two convoluted two The feature map is input to the aggregation node for feature fusion, and a feature map with a resolution of 1/4 the size of the original input feature map is obtained;

步骤22，将步骤21中得到的1/4大小的特征图进行2倍下采样得到新的特征图，将该特征图重复两次步骤21中的卷积操作和聚合操作得到两张特征图，并与步骤21中得到的聚合结点作为共同输入再次进行聚合操作，得到分辨率为原输入特征图1/8大小的特征图；Step 22, perform 2 times downsampling on the 1/4 size feature map obtained in step 21 to obtain a new feature map, repeat the convolution operation and aggregation operation in step 21 twice on the feature map to obtain two feature maps, And perform the aggregation operation again with the aggregation node obtained in step 21 as a common input, and obtain a feature map with a resolution of 1/8 of the size of the original input feature map;

步骤23，按照步骤22中由1/4大小的特征图得到原输入特征图1/8大小的特征图的相同的方式，由1/8大小的特征图得到1/16大小的特征图，再由1/16大小的特征图得到1/32大小的特征图；Step 23: Obtain a feature map of size 1/16 from the feature map of size 1/8 in the same way as in step 22 to obtain a feature map of size 1/8 of the original input feature map from a feature map of size 1/4, and then A feature map of size 1/32 is obtained from a feature map of size 1/16;

步骤24，如图2所示，将得到的1/4大小的特征图、1/8大小的特征图、1/16大小的特征图、1/32大小的特征图依次采用特征融合模块进行相邻特征图特征融合，分别得到1/4大小、 1/8大小和1/16大小的新的特征图。Step 24, as shown in Figure 2, the obtained 1/4 size feature map, 1/8 size feature map, 1/16 size feature map, and 1/32 size feature map are sequentially used in the feature fusion module for correlation The adjacent feature map features are fused to obtain new feature maps of 1/4 size, 1/8 size and 1/16 size respectively.

进一步的，所述步骤24中，所述的特征融合模块用于实现以下操作：Further, in the step 24, the feature fusion module is used to realize the following operations:

步骤241，对特征图F1进行卷积核为3×3大小的可变形卷积处理，将处理得到的结果通过BatchNorm层和ReLU层得到经过映射的特征图；Step 241, perform deformable convolution processing with a convolution kernel of 3×3 size on the feature map F1, and pass the processed result through the BatchNorm layer and the ReLU layer to obtain a mapped feature map;

步骤242，使用直接插值上采样加卷积处理的形式代替DLA34骨干网络中的转置卷积，对步骤241中得到映射后的特征图进行2倍上采样，得到特征图F1＇；Step 242, using direct interpolation upsampling plus convolution processing to replace the transposed convolution in the DLA34 backbone network, and perform 2 times upsampling on the mapped feature map obtained in step 241 to obtain the feature map F1';

步骤243，将步骤242中得到的特征图F1＇与特征图F2对应通道值相加，得到合并特征图；Step 243, adding the feature map F1' obtained in step 242 to the corresponding channel value of the feature map F2 to obtain a merged feature map;

步骤244，将步骤243得到的合并特征图经3×3大小的可变形卷积处理后，依次通过 BatchNorm层和ReLU层，得到二维特征图F2＇；Step 244, after the merged feature map obtained in step 243 is processed by a deformable convolution of 3×3 size, it passes through the BatchNorm layer and the ReLU layer in turn to obtain a two-dimensional feature map F2';

当特征图F1、特征图F2分别为1/4大小的特征图、1/8大小的特征图时，得到的二维特征图F2＇为1/4大小的特征图；When the feature map F1 and feature map F2 are 1/4 size feature map and 1/8 size feature map respectively, the obtained two-dimensional feature map F2' is a 1/4 size feature map;

当特征图F1、特征图F2分别为1/8大小的特征图、1/16大小的特征图时，得到的二维特征图F2＇为1/8大小的特征图；When the feature map F1 and feature map F2 are 1/8 size feature map and 1/16 size feature map respectively, the obtained two-dimensional feature map F2' is a 1/8 size feature map;

当特征图F1、特征图F2分别为1/16大小的特征图、1/32大小的特征图时，得到的二维特征图F2＇为1/16大小的特征图。When the feature map F1 and the feature map F2 are feature maps with a size of 1/16 and feature maps with a size of 1/32 respectively, the obtained two-dimensional feature map F2' is a feature map with a size of 1/16.

进一步的，所述步骤3具体包括如下子步骤：Further, the step 3 specifically includes the following sub-steps:

步骤31，将步骤2最终得到的1/16大小的二维特征图坍缩成一维序列，并卷积形成K、 V、Q特征图；Step 31, collapse the 1/16-size two-dimensional feature map finally obtained in step 2 into a one-dimensional sequence, and convolute to form K, V, Q feature maps;

步骤32，将位置编码与步骤31得到的特征图K和特征图Q分别进行逐像素相加得到带有位置信息的两个特征图，该两个特征图与特征图V作为共同输入进入多头注意力模块，经处理得到新的特征图；Step 32, add the position code and the feature map K and feature map Q obtained in step 31 pixel by pixel to obtain two feature maps with position information, and the two feature maps and feature map V are used as common input into the multi-head attention The force module is processed to obtain a new feature map;

步骤33，对步骤32得到的新的特征图再与步骤31得到的V、K、Q特征图进行特征图间对应值相加的融合操作和LayerNorm操作；Step 33, performing a fusion operation and a LayerNorm operation of adding corresponding values between the feature maps to the new feature map obtained in step 32 and the V, K, and Q feature maps obtained in step 31;

步骤34，步骤33得到的结果进入前馈神经网络中进行处理，并通过残差连接输出，得到新的特征图。The results obtained in step 34 and step 33 enter the feedforward neural network for processing, and output through the residual connection to obtain a new feature map.

进一步的，所述步骤32中的位置编码由下式得到：Further, the position code in the step 32 is obtained by the following formula:

PE_(pos,2i)＝sin(pos/10000^2i/d)PE _(pos,2i) = sin(pos/10000 ^2i/d )

pE_(pos,2i+1)＝cos(pos/10000^2i/d)pE _(pos,2i+1) ＝cos(pos/10000 ^2i/d )

其中，PE_(·)为位置编码的矩阵，其分辨率大小与输入特征图分辨率大小一样，pos表示向量在序列中的位置，而i是通道的索引，d表示输入特征图的通道数。Among them, PE _{( )} is the matrix of position encoding, and its resolution is the same as the resolution of the input feature map, pos indicates the position of the vector in the sequence, and i is the index of the channel, and d indicates the channel number of the input feature map.

进一步的，所述步骤4具体包括如下子步骤：Further, the step 4 specifically includes the following sub-steps:

步骤41，将步骤3最终得到的特征图进行2倍上采样得到一个新的特征图。Step 41, perform double upsampling on the feature map finally obtained in step 3 to obtain a new feature map.

步骤42，将步骤24得到的1/4大小与1/8大小的特征图采用与步骤24中相同的特征融合模块进行特征融合，得到一个1/4大小的新特征图；Step 42, use the same feature fusion module as in step 24 to perform feature fusion on the 1/4 size and 1/8 size feature map obtained in step 24, and obtain a 1/4 size new feature map;

步骤43，将步骤24得到的1/8大小、1/16大小的特征图采用特征融合模块进行特征融合，并与步骤41得到的特征图进行逐像素相加得到一个1/8大小的新特征图；Step 43, use the feature fusion module to perform feature fusion on the 1/8 size and 1/16 size feature map obtained in step 24, and add pixel by pixel to the feature map obtained in step 41 to obtain a new feature of 1/8 size picture;

步骤44，将步骤42得到的1/4大小的特征图和步骤43得到的1/8大小的特征图仍采用特征融合模块进行特征融合生成分辨率为原图1/4大小的热图；Step 44, using the feature fusion module to perform feature fusion on the 1/4 size feature map obtained in step 42 and the 1/8 size feature map obtained in step 43 to generate a heat map with a resolution of 1/4 the size of the original image;

步骤45，对步骤44中得到的热图与步骤1得到的数据集中包含目标中心点的热图标签进行逻辑回归，得到预测目标的中心点

Step 45, perform logistic regression on the heat map obtained in step 44 and the heat map label containing the center point of the target in the data set obtained in step 1, to obtain the center point of the predicted target

步骤46，通过式(3)得到每个目标对应的边框左上点与右下点坐标，生成目标边界框：Step 46, obtain the coordinates of the upper left point and the lower right point of the frame corresponding to each target through formula (3), and generate the target bounding box:

其中，

即步骤45得到预测目标的中心点，

表示中心点与目标中心点的偏移量，

表示目标对应的边框的尺寸。in,

That is, step 45 obtains the center point of the predicted target,

Indicates the offset between the center point and the target center point,

Indicates the size of the bounding box corresponding to the target.

进一步的，所述步骤5具体包括如下子步骤：Further, the step 5 specifically includes the following sub-steps:

步骤51，将输入步骤2的同一个图像作为第T-1帧图像，并选取其下一帧图像即第T帧图像，把第T帧和第T-1帧图像作为输入，经过CenterTrack骨干网络处理分别生成特征图f_T和 f_T-1；Step 51, use the same image input in step 2 as the T-1th frame image, and select the next frame image, that is, the T-th frame image, and use the T-th frame and the T-1th frame image as input, and pass through the CenterTrack backbone network Processing generates feature maps f _T and f _T-1 respectively;

步骤52，将特征图f_T和f_T-1分别送入如图5所示的代价空间模块进行目标关联处理，得到输出特征图f′_T；Step 52, send the feature maps f _T and f _T-1 to the cost space module as shown in Figure 5 for target association processing, and obtain the output feature map f'_T;

步骤53，将步骤4中得到的热图与步骤51得到的特征图f_T-1进行哈德玛乘积生成特征图

将

与步骤52得到的特征图f′_T一起进行可变形卷积生成特征图

Step 53, perform the Hadamard product of the heat map obtained in step 4 and the feature map f _T-1 obtained in step 51 to generate a feature map

Will

Perform deformable convolution together with the feature map f' _T obtained in step 52 to generate a feature map

步骤54，将

依次使用3个1×1卷积操作、下采样操作，生成第T-1帧特征图；将步骤51中得到的特征图f_T使用3个1×1卷积进行操作，生成第T帧特征图；Step 54, will

Use three 1×1 convolution operations and downsampling operations in turn to generate the feature map of the T-1th frame; use the feature map f _T obtained in step 51 to operate with three 1×1 convolutions to generate the feature map of the T-th frame picture;

步骤55，将步骤54中的得到的第T帧特征图与第T-1帧特征图共同输入注意力传播模块进行特征传播得到带有目标检测框的跟踪特征图V′_T。Step 55, input the T-th frame feature map obtained in step 54 together with the T-1-th frame feature map into the attention propagation module for feature propagation to obtain a tracking feature map V′ _T with a target detection frame.

进一步的，所述步骤52具体包括如下操作：Further, the step 52 specifically includes the following operations:

步骤521，将特征图f_T和f_T-1分别送入代价空间模块中的三层权值共享的卷积结构生成特征图e_T和e_T-1，即目标的外观编码向量；Step 521, send the feature maps f _T and f _T-1 to the three-layer weight-shared convolution structure in the cost space module to generate feature maps e _T and e _T-1 , which are the appearance encoding vectors of the target;

步骤522，对特征图e_T和e_T-1进行最大池化操作得到e′_T和e′_T-1，以降低模型复杂度，使用e′_T和e′_T-1乘积的转置计算得到代价空间矩阵C，代价空间矩阵C上的目标在当前帧的位置为(i,j)，从代价空间矩阵C中提取含有当前帧中的目标在前一帧图像中的位置信息的二维代价矩阵C_i,j，对C_i,j的水平方向和竖直方向分别取最大值得到对应方向的特征图

Step 522, perform the maximum pooling operation on the feature maps e _T and e _T-1 to obtain e′ _T and e′ _T-1 to reduce the complexity of the model, and use the transposition calculation of the product of e′ _T and e′ _T-1 Get the cost space matrix C, the position of the target on the cost space matrix C in the current frame is (i, j), and extract the two-dimensional image containing the position information of the target in the current frame in the previous frame image from the cost space matrix C The cost matrix C _i,j takes the maximum value for the horizontal and vertical directions of C _i,j respectively to obtain the feature map of the corresponding direction

步骤523，通过式(4)和(5)定义两个偏移模板

Step 523, define two offset templates by formula (4) and (5)

G_i,j,l＝(l-j)×s1≤l≤W_C (4)G _i,j,l =(lj)×s1≤l≤W _C (4)

M_i,j,k＝(k-i)×s1≤k≤H_C (5)M _i,j,k ＝(ki)× _s1≤k≤HC (5)

其中，s为特征图相对于原图的下采样倍数，W_C、H_C为特征图的宽高尺寸大小，G_i,j,l为T帧图像中的目标(i,j)在T-1帧图像中出现在水平位置l的偏移量，M_i,j,k为T帧目标(i,j)在T-1 帧图像中出现在竖直位置k的偏移量；Among them, s is the downsampling multiple of the feature map relative to the original image, W _C , H _C are the width and height dimensions of the feature map, G _i,j,l is the target (i,j) in the T frame image at T- The offset that appears at horizontal position l in 1 frame image, M _{i, j, k} is the offset that T frame target (i, j) appears in vertical position k in T-1 frame image;

步骤524，将步骤522得到的

与步骤523中定义的偏移模板G和M相乘之后进行通道上的叠加，得到特征图O_T，代表目标在水平和竖直两个方向上的偏移模板；之后将O_T进行2倍上采样恢复为H_F×W_F大小，同时，将O_T特征图的水平与竖直两个通道分别与步骤51 得到的f_T、f_T-1进行通道上的叠加，再经过卷积形成水平和竖直方向上特征图大小不变、通道数为9的2个特征图，将这2个特征图进行通道上的叠加得到输出特征图f′_T。Step 524, the step 522 obtained

After multiplying the offset templates G and M defined in step 523, superimpose on the channel to obtain the feature map O _T , which represents the offset template of the target in both horizontal and vertical directions; then O _T is doubled The upsampling is restored to the size of H _F ×W _F. At the same time, the horizontal and vertical channels of the O _T feature map are respectively superimposed on the channels with f _T and f _T-1 obtained in step 51, and then formed by convolution The size of the feature map in the horizontal and vertical directions is constant, and the number of channels is 9. The two feature maps are superimposed on the channels to obtain the output feature map f′ _T .

与现有技术相比，本发明的有益效果为：Compared with prior art, the beneficial effect of the present invention is:

①在本发明中，适当增加所采用的数据集的输入图片分辨率，保证最终特征图的大小以保留更多的细节信息；①In the present invention, the input picture resolution of the data set used is appropriately increased to ensure the size of the final feature map to retain more detailed information;

②本发明在多目标检测模块中，将含有更多语义信息的深层特征图与含有较多细节信息的浅层特征图通过特征融合模块进行融合，提高模型对小目标的检测能力；②In the multi-target detection module of the present invention, the deep feature map containing more semantic information and the shallow feature map containing more detailed information are fused through the feature fusion module to improve the detection ability of the model for small targets;

③本发明在多目标检测模块中，通过引入Transformer编码模块自注意力机制，捕捉长距离上的依赖关系，发掘特征图中的特征潜在联系，可以稳定识别外观尺度变化较大的目标；③In the multi-target detection module, the present invention introduces the self-attention mechanism of the Transformer coding module to capture long-distance dependencies and explore the potential relationship of features in the feature map, so that it can stably identify targets with large changes in appearance scale;

④提出一种基于代价空间和帧间信息融合的多目标跟踪算法，使用代价空间矩阵预测当前帧目标在上一帧中的位置，可以将两帧之间的目标进行关联，实现跟踪的效果；④Propose a multi-target tracking algorithm based on cost space and inter-frame information fusion, use the cost space matrix to predict the position of the target in the current frame in the previous frame, and associate the targets between the two frames to achieve the tracking effect;

⑤在多目标跟踪模块中，引入注意力传播模块，将多帧目标的特征进行融合，弥补帧间目标运动造成的目标空间错位问题，使得模型在目标被遮挡的情况下依然能够准确的实现跟踪。⑤ In the multi-target tracking module, the attention propagation module is introduced to fuse the features of multi-frame targets to make up for the target space misalignment problem caused by the target movement between frames, so that the model can still accurately track the target when the target is occluded .

附图说明Description of drawings

图1是本发明的多目标检测模块示意图；Fig. 1 is a schematic diagram of a multi-target detection module of the present invention;

图2是多目标检测模块中的特征融合模块示意图；Fig. 2 is a schematic diagram of the feature fusion module in the multi-target detection module;

图3是多目标检测模块中的Transformer编码模块示意图；Fig. 3 is a schematic diagram of the Transformer encoding module in the multi-target detection module;

图4是本发明的多目标跟踪模块示意图；Fig. 4 is a schematic diagram of a multi-target tracking module of the present invention;

图5是多目标跟踪模块中的代价空间模块示意图；Fig. 5 is a schematic diagram of the cost space module in the multi-target tracking module;

图6是本发明的多目标检测模块实验结果示意图；它们分别为模块对小目标及大目标检测得到的目标中心点和目标边界框结果示意图。6 is a schematic diagram of the experimental results of the multi-target detection module of the present invention; they are respectively schematic diagrams of the target center point and target bounding box results obtained by the module for detecting small targets and large targets.

图7是本发明的多目标跟踪模块实验结果示意图。分别为两段测试用例的各四张图片，每段测试用例的四张图片分别为第0帧、5帧、10帧和15帧。Fig. 7 is a schematic diagram of the experimental results of the multi-target tracking module of the present invention. They are four pictures of two test cases respectively, and the four pictures of each test case are frame 0, frame 5, frame 10 and frame 15 respectively.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明的多目标检测与跟踪模型分为两个部分，首先如图1所示为多目标检测模块框架，主要基于改进的DLA34作为骨干网络，通过添加特征融合模块得到深浅层网络融合特征图，引入Transformer编码模块对融合后的特征图进行自注意力编码，解决因目标特征尺度差异过大导致的网络对大目标语义提取能力的限制问题；最终生成目标热图并回归，得到对应目标的边界框实现以交通目标检测。如图4所示为目标跟踪模块框架，通过CenterTrack骨干网络生成特征图，利用代价空间矩阵实现两帧之间的目标关联与跟踪；通过使用注意力传播模块，将前后两帧目标信息进行融合互补，实现目标模糊或被遮挡情况下的准确跟踪。The multi-target detection and tracking model of the present invention is divided into two parts. First, it is a multi-target detection module framework as shown in FIG. Introduce the Transformer encoding module to perform self-attention encoding on the fused feature map, so as to solve the limitation of the network's ability to extract the semantics of large targets caused by the large difference in target feature scale; finally generate a target heat map and return it to obtain the boundary of the corresponding target The frame implements traffic object detection. Figure 4 shows the framework of the target tracking module. The feature map is generated through the CenterTrack backbone network, and the target association and tracking between the two frames are realized by using the cost space matrix; the target information of the two frames before and after is fused and complemented by using the attention propagation module. , to achieve accurate tracking when the target is blurred or occluded.

本发明的复杂城市道路环境下的多目标检测与跟踪方法，具体包括以下步骤：The multi-target detection and tracking method under the complex urban road environment of the present invention specifically comprises the following steps:

步骤1：选取公共数据集进行数据增强，得到数据集，构建训练集和测试集。Step 1: Select a public dataset for data augmentation, obtain a dataset, and construct a training set and a test set.

具体是：选择主流交通目标检测数据集VisDrone中的VisDrone_mot作为本发明的数据集。VisDrone_mot数据集由无人机采集了中国多个城市的街道空中俯视景象，提供了96个视频序列，其中包括56个训练视频序列包含24201帧图像、7个验证视频序列包含2819帧图像、33个测试序列包含12968帧图像，并在每个视频帧中手动标注了识别对象的边界框。对 VisDrone_mot数据集中的输入图片增加分辨率至1024×1024，保证多目标检测模块输出的最终特征图的大小为256×256，并保留更多的细节信息，同时，使用随机翻转、分辨率大小为 0.6至1.3倍间的随机缩放、随机裁剪和颜色抖动相结合的数据增强方式，作为扩展训练样本。Specifically: select VisDrone_mot in the mainstream traffic object detection data set VisDrone as the data set of the present invention. The VisDrone_mot data set collects aerial aerial views of streets in multiple cities in China by drones, and provides 96 video sequences, including 56 training video sequences containing 24201 frame images, 7 verification video sequences containing 2819 frame images, 33 The test sequence contains 12968 frames of images, and the bounding boxes of the recognized objects are manually annotated in each video frame. Increase the resolution of the input pictures in the VisDrone_mot dataset to 1024×1024, ensure that the final feature map output by the multi-target detection module has a size of 256×256, and retain more detailed information. At the same time, use random flipping, and the resolution size is The data enhancement method combining random scaling, random cropping and color dithering between 0.6 and 1.3 times is used as an extended training sample.

步骤2：在现有的DLA34骨干网络基础上逐层增加特征融合模块实现输入图像的深浅层网络特征融合，得到三个特征融合后的二维特征图。如图1所示，具体包括如下子步骤：Step 2: On the basis of the existing DLA34 backbone network, add a feature fusion module layer by layer to realize the deep and shallow network feature fusion of the input image, and obtain a two-dimensional feature map after three feature fusions. As shown in Figure 1, it specifically includes the following sub-steps:

步骤21，将训练集中的图像输入至DLA34网络，经过BatchNorm层和ReLU层对原始图像进行两次卷积核为3×3大小的卷积操作得到两张特征图，将卷积后的两张特征图输入聚合结点进行特征融合，得到分辨率为原输入特征图1/4大小的特征图。其中聚合结点的特征融合如式(1)：Step 21, input the images in the training set to the DLA34 network, perform two convolution operations on the original image with a convolution kernel of 3×3 size through the BatchNorm layer and the ReLU layer to obtain two feature maps, and the two convoluted two The feature map is input to the aggregation node for feature fusion, and a feature map with a resolution of 1/4 the size of the original input feature map is obtained. The feature fusion of the aggregation node is as formula (1):

N(X₁，...，X_n)＝σ(BN(∑w_ix_i+b)，...，BN(∑w_ix_i+b)) (1)N(X ₁ ,...,X _n )=σ(BN(∑w _i x _i +b),...,BN(∑w _i x _i +b)) (1)

其中，N(·)表示聚合结点，σ(·)表示特征聚合，w_ix_i+b表示卷积操作，BN表示BatchNorm 操作，X_i＝1...N对应卷积模块的输出。Among them, N(·) represents aggregation node, σ(·) represents feature aggregation, w _i x _i +b represents convolution operation, BN represents BatchNorm operation, Xi _=1...N corresponds to the output of the convolution module.

步骤22，将步骤21中得到的1/4大小的特征图进行2倍下采样得到新的特征图，将该特征图重复两次步骤21中的卷积操作和聚合操作得到两张特征图，并与步骤21中得到的聚合结点作为共同输入再次进行聚合操作，得到分辨率为原输入特征图1/8大小的特征图。该步骤目的是将网络浅层特征信息传递到网络深层。Step 22, perform 2 times downsampling on the 1/4 size feature map obtained in step 21 to obtain a new feature map, repeat the convolution operation and aggregation operation in step 21 twice on the feature map to obtain two feature maps, And perform the aggregation operation again with the aggregation node obtained in step 21 as a common input, and obtain a feature map with a resolution of 1/8 of the size of the original input feature map. The purpose of this step is to transfer the feature information of the shallow layer of the network to the deep layer of the network.

步骤24，如图2所示，将得到的1/4大小的特征图、1/8大小的特征图、1/16大小的特征图、1/32大小的特征图依次采用特征融合模块进行相邻特征图特征融合，分别得到1/4大小、 1/8大小和1/16大小的新的特征图；Step 24, as shown in Figure 2, the obtained 1/4 size feature map, 1/8 size feature map, 1/16 size feature map, and 1/32 size feature map are sequentially used in the feature fusion module for correlation Neighboring feature map features are fused to obtain new feature maps of 1/4 size, 1/8 size and 1/16 size respectively;

所述的特征融合模块用于实现以下操作：The feature fusion module is used to realize the following operations:

步骤242，使用直接插值上采样加卷积处理的形式代替DLA34骨干网络中的转置卷积，对步骤241中得到映射后的特征图进行2倍上采样，得到特征图F1′，以获得更多目标位置信息并减小模型参数量；Step 242, use the form of direct interpolation upsampling plus convolution to replace the transposed convolution in the DLA34 backbone network, perform 2 times upsampling on the mapped feature map obtained in step 241, and obtain the feature map F1′ to obtain a more Multi-target location information and reduce the amount of model parameters;

步骤243，将步骤242中得到的特征图F1′与特征图F2对应通道值相加，得到合并特征图；Step 243, adding the feature map F1' obtained in step 242 to the corresponding channel value of the feature map F2 to obtain a merged feature map;

步骤244，将步骤243得到的合并特征图经3×3大小的可变形卷积处理后，依次通过 BatchNorm层和ReLU层，得到二维特征图F2′；Step 244, after the merged feature map obtained in step 243 is processed by a deformable convolution with a size of 3 × 3, the two-dimensional feature map F2' is obtained through the BatchNorm layer and the ReLU layer in turn;

当特征图F1、特征图F2分别为1/4大小的特征图、1/8大小的特征图时，得到的二维特征图F2′为1/4大小的特征图；When the feature map F1 and feature map F2 are 1/4 size feature map and 1/8 size feature map respectively, the obtained two-dimensional feature map F2′ is a 1/4 size feature map;

当特征图F1、特征图F2分别为1/8大小的特征图、1/16大小的特征图时，得到的二维特征图F2′为1/8大小的特征图；When the feature map F1 and feature map F2 are respectively 1/8 size feature map and 1/16 size feature map, the obtained two-dimensional feature map F2′ is a 1/8 size feature map;

当特征图F1、特征图F2分别为1/16大小的特征图、1/32大小的特征图时，得到的二维特征图F2′为1/16大小的特征图。When the feature map F1 and feature map F2 are feature maps of 1/16 size and 1/32 size respectively, the obtained two-dimensional feature map F2′ is a feature map of 1/16 size.

步骤3：根据步骤2得到的特征融合后的特征图，采用Transformer编码模块提取特征图中长距离特征依赖关系，得到提取依赖关系后的特征图。如图3所示，具体包括如下子步骤：Step 3: According to the feature map after feature fusion obtained in step 2, the Transformer encoding module is used to extract the long-distance feature dependencies in the feature map, and the feature map after the dependency relationship is extracted. As shown in Figure 3, it specifically includes the following sub-steps:

步骤31，将步骤2最终得到的1/16大小的二维特征图坍缩成一维序列，并卷积形成三份 K(Key)、V(Value)、Q(Query)特征图；Step 31, collapse the 1/16 size two-dimensional feature map finally obtained in step 2 into a one-dimensional sequence, and convolute to form three K (Key), V (Value), Q (Query) feature maps;

步骤32，将位置编码与步骤31得到的特征图K和特征图Q分别进行逐像素相加得到带有位置信息的两个特征图，该两个特征图与特征图V作为共同输入进入多头注意力模块经处理得到新的特征图，以捕捉图像中的长距离依赖。其中位置编码由式(1)(2)得到：Step 32, add the position code and the feature map K and feature map Q obtained in step 31 pixel by pixel to obtain two feature maps with position information, and the two feature maps and feature map V are used as common input into the multi-head attention The force module is processed to obtain new feature maps to capture long-distance dependencies in images. The position code is obtained by formula (1) (2):

PE_(pos，2i)＝sin(pos/10000^2i/d) (1)PE _{(pos, 2i)} = sin(pos/10000 ^2i/d ) (1)

PE_(pos，2i+1)＝cos(pos/10000^2i/d) (2)PE _{(pos, 2i+1)} = cos(pos/10000 ^2i/d ) (2)

步骤33，对步骤32得到的新的特征图再与步骤31得到的V、K、Q特征图进行特征图间对应值相加的融合操作和LayerNorm(LN)操作，以避免信息损失；Step 33, the fusion operation and the LayerNorm (LN) operation of adding corresponding values between the feature maps to the new feature map obtained in step 32 and the V, K, and Q feature maps obtained in step 31 to avoid information loss;

步骤4：根据步骤2、步骤3得到的特征图，通过进一步特征融合及逻辑回归处理，生成热图以及目标边界框。具体包括如下子步骤：Step 4: According to the feature maps obtained in steps 2 and 3, through further feature fusion and logistic regression processing, generate heat maps and target bounding boxes. Specifically include the following sub-steps:

其中，

即步骤45得到预测目标的中心点，

表示中心点与目标中心点的偏移量，

表示目标对应的边框的尺寸。in,

That is, step 45 obtains the center point of the predicted target,

Indicates the offset between the center point and the target center point,

Indicates the size of the bounding box corresponding to the target.

步骤5：根据步骤2的输入图像以及步骤4得到的热图，利用多目标跟踪模块进行目标关联处理与跟踪，得到带有目标检测框的跟踪特征图。如图4所示，具体包括如下子步骤：Step 5: According to the input image in step 2 and the heat map obtained in step 4, use the multi-target tracking module to perform target association processing and tracking, and obtain a tracking feature map with a target detection frame. As shown in Figure 4, it specifically includes the following sub-steps:

步骤52，将特征图f_T和f_T-1分别送入如图5所示的代价空间模块进行目标关联处理，得到输出特征图f′_T。具体包括如下操作：In step 52, the feature maps f _T and f _T-1 are respectively sent to the cost space module shown in Fig. 5 for target association processing, and the output feature map f' _T is obtained. Specifically include the following operations:

步骤522，对特征图e_T和e_T-1进行最大池化操作得到e′_T和e′_T-1，以降低模型复杂度，使用e′_T和e′_T-1乘积的转置计算得到代价空间矩阵C，以保存两帧特征图之间对应点的相似度，代价空间矩阵C上的目标在当前帧的位置为(i，j)，从代价空间矩阵C中提取含有当前帧中的目标在前一帧图像中的位置信息的二维代价矩阵C_i，j，对C_i，j的水平方向和竖直方向分别取最大值得到对应方向的特征图

Step 522, perform the maximum pooling operation on the feature maps e _T and e _T-1 to obtain e′ _T and e′ _T-1 to reduce the complexity of the model, and use the transposition calculation of the product of e′ _T and e′ _T-1 Get the cost space matrix C to save the similarity of corresponding points between the feature maps of the two frames. The position of the target on the cost space matrix C in the current frame is (i, j), and extract from the cost space matrix C that contains The two-dimensional cost matrix C _{i, j} of the position information of the target in the previous frame image, take the maximum value for the horizontal direction and vertical direction of C _{i, j} respectively to obtain the feature map of the corresponding direction

步骤523，通过式(4)和(5)定义两个偏移模板

Step 523, define two offset templates by formula (4) and (5)

G_i，j，l＝(l-j)×s1≤l≤W_C (4)G _i,j,l =(lj)×s1≤l≤W _C (4)

M_i，j，k＝(k-i)×s1≤k≤H_C (5)M _i,j,k =(ki)× _s1≤k≤HC (5)

其中，s为特征图相对于原图的下采样倍数，W_C、H_C为特征图的宽高尺寸大小，G_i，j，l为T帧图像中的目标(i，j)在T-1帧图像中出现在水平位置l的偏移量，M_i，j，k为T帧目标(i，j)在T-1 帧图像中出现在竖直位置k的偏移量。Among them, s is the downsampling multiple of the feature map relative to the original image, W _C , H _C are the width and height dimensions of the feature map, G _{i, j, l} is the target (i, j) in the T frame image at T- The offset that appears at horizontal position l in 1 frame of image, M _{i, j, k} is the offset of T-frame target (i, j) that appears at vertical position k in T-1 frame of image.

步骤524，将步骤522得到的

与步骤523中定义的偏移模板G和M相乘之后进行通道上的叠加，得到特征图O_T，代表目标在水平和竖直两个方向上的偏移模板；之后将O_T进行2倍上采样恢复为H_F×W_F大小，同时，将O_T特征图的水平与竖直两个通道分别与步骤51得到的f_T、f_T-1进行通道上的叠加，再经过卷积形成水平和竖直方向上特征图大小不变、通道数为9的2个特征图，将这2个特征图进行通道上的叠加得到输出特征图f′_T。Step 524, the step 522 obtained

将

与步骤52得到的特征图f′_T一起进行可变形卷积生成特征图

Will

步骤54，将

依次使用3个1×1卷积操作、下采样操作，生成第T-1帧特征图(q_t-1、 k_t-1、v_t-1)；将步骤51中得到的特征图f_T使用3个1×1卷积进行操作，生成第T帧特征图(Q_t、K_t、V_t)；Step 54, will

Use three 1×1 convolution operations and downsampling operations sequentially to generate the feature map (q _t-1 , k _t-1 , v _t-1 ) of frame T-1; the feature map f _T obtained in step 51 Use three 1×1 convolutions to generate the T-th frame feature map (Q _t , K _t , V _t );

步骤55，将步骤54中的得到的第T帧特征图与第T-1帧特征图共同输入注意力传播模块进行特征传播得到带有目标检测框的跟踪特征图V′_T。其中，注意力传播模块计算过程如式 (6)所示：Step 55, input the T-th frame feature map obtained in step 54 together with the T-1-th frame feature map into the attention propagation module for feature propagation to obtain a tracking feature map V′ _T with a target detection frame. Among them, the calculation process of the attention propagation module is shown in formula (6):

其中，

为1×1卷积，d_k为特征图Q和K的维度，Q_t、k_t-1、v_t-1、V_t为步骤54中得到的特征图。in,

is a 1×1 convolution, d _k is the dimension of the feature maps Q and K, and Q _t , k _t-1 , v _t-1 , V _t are the feature maps obtained in step 54.

步骤6，采用步骤1的训练集对由步骤2、3、4、5组成的多目标检测与跟踪模型进行训练，并采用测试集进行测试，最终得到训练好的多目标检测与跟踪模型。Step 6: Use the training set in step 1 to train the multi-target detection and tracking model composed of steps 2, 3, 4, and 5, and use the test set to test, and finally obtain the trained multi-target detection and tracking model.

为验证本发明的可行性和有效性，本发明进行了如下实验：For verifying feasibility and effectiveness of the present invention, the present invention has carried out following experiment:

首先，针对多目标检测模块(即步骤2至步骤4)，使用平均精确率和召回率对模型进行评价。平均精确率由精确率求得，精确率P和召回率R的公式具体如式(7)、(8)所示。First, for the multi-object detection module (ie step 2 to step 4), the model is evaluated using the average precision and recall. The average precision rate is obtained from the precision rate, and the formulas of the precision rate P and the recall rate R are specifically shown in formulas (7) and (8).

其中，P为应该被检索的目标(TP)占所有被检索到的目标(TP+FP)的百分比。R为应该被检索的目标(TP)占应该被检索到的所有目标(TP+FN)的百分比。Among them, P is the percentage of the target that should be retrieved (TP) to all retrieved targets (TP+FP). R is the percentage of the target (TP) that should be retrieved to all the targets (TP+FN) that should be retrieved.

在检测任务中，精确率体现了模型查准的能力，而召回率体现了模型查全的能力。两个指标相互制约，通过在不同置信度阈值下的平均精确率(AP)寻找查准率与查全率间的相对平衡，做出以精确率和召回率作为横纵坐标的二维PR曲线。平均精确率(AP)则为PR曲线包围的面积，等于对精确率进行平均操作。In the detection task, the precision rate reflects the ability of the model to be accurate, and the recall rate reflects the ability of the model to be recalled. The two indicators restrict each other, find the relative balance between the precision rate and the recall rate through the average precision rate (AP) under different confidence thresholds, and make a two-dimensional PR curve with the precision rate and recall rate as the horizontal and vertical coordinates . The average precision rate (AP) is the area surrounded by the PR curve, which is equal to the average operation of the precision rate.

本发明首先对多目标检测模块进行定量分析，通过与基线模型在VisDrone_mot数据集上进行对比，并在实验中增加了本发明方法与各种基线方法具体到每一个类别的性能对比，从结果可得，相比于其他模型本发明所提出的方法能够保证针对较小目标正确识别的情况下兼顾对一些大目标的识别性能。相比于性能优越的常用模型，本发明的方法针对大目标的识别性能最佳，精度达到了42.16和33.10，具有良好的检测能力。The present invention first quantitatively analyzes the multi-target detection module, compares it with the baseline model on the VisDrone_mot data set, and increases the performance comparison between the method of the present invention and various baseline methods specific to each category in the experiment, from the results. Therefore, compared with other models, the method proposed by the present invention can ensure the recognition performance of some large targets while ensuring the correct recognition of smaller targets. Compared with the commonly used models with superior performance, the method of the present invention has the best recognition performance for large targets, with the accuracy reaching 42.16 and 33.10, and has good detection ability.

同时，为了直观地反应整体多目标检测模块性能，对模块进行定性分析，结果如图6所示，能够看出本发明模型对不同尺度的目标有很好的检测性能，添加了Transformer模块之后，模型对远距离依赖关系的捕捉更加稳定，在对小目标拥有良好识别能力的同时对大目标的识别效果依然比较鲁棒。At the same time, in order to intuitively reflect the performance of the overall multi-target detection module, a qualitative analysis was performed on the module. The results are shown in Figure 6. It can be seen that the model of the present invention has good detection performance for targets of different scales. After adding the Transformer module, The model captures the long-distance dependencies more stably, and has a good recognition ability for small targets while the recognition effect for large targets is still relatively robust.

其次，针对多目标跟踪模块(即步骤5)，使用MOTA(↑)、MOTP(↑)、IDF1(↑)、MT(↑)、ML(↓)、FP(↓)、FN(↓)、Frag(↓)和IDSW(↓)等指标进行评价。↑表示该指标数值越大模型性能越好，↓表示该指标数值越小模型性能越好。Secondly, for the multi-target tracking module (ie step 5), use MOTA(↑), MOTP(↑), IDF1(↑), MT(↑), ML(↓), FP(↓), FN(↓), Frag (↓) and IDSW (↓) and other indicators for evaluation. ↑ indicates that the larger the index value is, the better the model performance is, and ↓ indicates that the smaller the index value is, the better the model performance is.

其中，MOTA代表多目标跟踪准确度，衡量算法连续跟踪目标的能力，用于统计在跟踪中的误差累积情况，其式如(9)所示。Among them, MOTA stands for multi-target tracking accuracy, which measures the ability of the algorithm to continuously track targets, and is used to count the accumulation of errors in tracking, and its formula is shown in (9).

其中，m_t对应FP，代表预测结果中的假阳性(误检数)，即在第t帧中预测位置没有对应的跟踪目标与其匹配。fp_t对应FN，代表假阴性(漏检数)，即在第t帧中目标没有对应的预测位置与其匹配。mme_t对应IDSW，代表误配数，即在第t帧中跟踪目标发生ID切换的次数，g_t指帧中的真正目标数总和，MOTA综合考虑了目标轨迹中的误检、漏检和ID交换。Among them, m _t corresponds to FP, which represents the false positives (number of false detections) in the prediction results, that is, there is no corresponding tracking target matching the predicted position in the t-th frame. fp _t corresponds to FN, which represents false negatives (number of missed detections), that is, the target does not have a corresponding predicted position to match it in the tth frame. mme _t corresponds to IDSW, which represents the number of mismatches, that is, the number of ID switching times of the tracking target in the tth frame, g _t refers to the total number of real targets in the frame, MOTA comprehensively considers the false detection, missed detection and ID in the target trajectory exchange.

MOTP表示也直接反应了模型跟踪的效果，反映了跟踪结果与标签轨迹距离上的差距，公式表示为(10)所示。The MOTP expression also directly reflects the effect of model tracking, reflecting the distance between the tracking result and the label trajectory, and the formula is expressed as (10).

其中，c_t表示第t帧的匹配个数，对每对匹配计算轨迹误差

再进行求和得到最终的数值，该指标越大表示模型性能越好，轨迹误差越小。Among them, c _t represents the number of matches in the tth frame, and the trajectory error is calculated for each pair of matches

Then sum to get the final value. The larger the index, the better the model performance and the smaller the trajectory error.

MT为多数跟踪数(Mostly tracked)，指命中的轨迹大于80％标签轨迹的轨迹数，该数值越大越好。ML为多数丢失数(Mostly lost)，指丢失的轨迹大于80％标签轨迹的轨迹数，该数值越小越好。Frag为跳变数，指跟踪轨迹从“跟踪”状态到“不跟踪”状态的变化数。MT is the number of most tracked (Mostly tracked), which refers to the number of tracks whose hit track is greater than 80% of the tag track, and the larger the value, the better. ML is the most lost number (Mostly lost), which refers to the number of lost tracks that are greater than 80% of the label tracks. The smaller the value, the better. Frag is the number of jumps, which refers to the number of changes in the tracking track from the "tracking" state to the "not tracking" state.

对于一个多目标跟踪检测器，ID相关的指标同样重要，具体有以下三个重要的指标： IDP、IDR、IDF1。IDP表示识别精确度(Identification Precision)，指每个目标框的ID识别准确率，其公式为(11)所示。For a multi-target tracking detector, ID-related indicators are equally important, specifically the following three important indicators: IDP, IDR, IDF1. IDP stands for Identification Precision, which refers to the ID recognition accuracy rate of each target frame, and its formula is shown in (11).

其中IDTP和IDFP分别是ID预测的真阳例数和假阳例数。IDR表示识别召回率(Identification Recall)，指每个目标框的ID识别召回率，其公式为(12)所示。where IDTP and IDFP are the number of true positive cases and false positive cases predicted by ID, respectively. IDR stands for Identification Recall, which refers to the ID recognition recall rate of each target box, and its formula is shown in (12).

其中IDFN为ID预测的假阴例。IDF1表示ID预测的F值(Identification F-Score)，指每个目标框的ID识别F值，该指标值越大越好，其计算公式为(13)所示。where IDFN is the false negative of ID prediction. IDF1 represents the ID prediction F-score (Identification F-Score), which refers to the ID identification F-score of each target frame. The larger the index value, the better, and its calculation formula is shown in (13).

IDF1是用来评价跟踪器好坏的第一默认指标，上述三个指标可以使用任意的两个推断另外一个。IDF1 is the first default indicator used to evaluate the quality of the tracker. Any two of the above three indicators can be used to infer the other.

首先，将多目标跟踪模块与近些年的主流基线模型的定量实验对比，在VisDrone_mot数据集上，相比于第二好的模型，本发明提出的跟踪方法在MOTA和MOTP指标上分别高出了 3.2和1.8，并且在其他指标上均取得了比较好的结果，而由于本发明模型的误检率较少，导致了ML、MT指标的在正常范围的扰动。相比于TBD模型，JDT模型由于检测和跟踪任务相互促进，在训练过程中能够进行端对端的优化，并且能够在跟踪任务上取得更好的效果。First, comparing the quantitative experiments of the multi-target tracking module with the mainstream baseline model in recent years, on the VisDrone_mot data set, compared with the second best model, the tracking method proposed by the present invention is higher than the MOTA and MOTP indicators respectively. 3.2 and 1.8, and achieved relatively good results on other indicators, but because the false detection rate of the model of the present invention is less, it leads to disturbances in the normal range of ML and MT indicators. Compared with the TBD model, the JDT model can perform end-to-end optimization during the training process because the detection and tracking tasks promote each other, and can achieve better results on the tracking task.

其次，在上述数据集中对模型进行定性分析，如图7所示，其中展示了两段测试样例，每一段选取了其中的四张图片进行展示，分别是时间维度上的第0帧、5帧、10帧和15帧图像。从图中可以看出，模型能够对交通场景下的多目标进行稳定的跟踪，尤其对交通场景下的小目标拥有优秀的检测跟踪能力。Secondly, qualitatively analyze the model in the above data set, as shown in Figure 7, which shows two sections of test samples, and each section selects four of the pictures for display, which are frame 0 and frame 5 in the time dimension. frame, 10-frame and 15-frame images. It can be seen from the figure that the model can stably track multiple targets in traffic scenes, especially has excellent detection and tracking capabilities for small targets in traffic scenes.

Claims

1. A multi-target detection and tracking method under a complex urban road environment is characterized by specifically comprising the following steps:

step 1: selecting a public data set for data enhancement to obtain a data set, and constructing a training set and a test set;

step 2: adding a feature fusion module layer by layer on the basis of the existing DLA34 backbone network to realize the deep and shallow network feature fusion of an input image and obtain a two-dimensional feature map after three feature fusion;

and 3, step 3: extracting the long-distance feature dependency relationship in the feature graph by adopting a Transformer coding module according to the two-dimensional feature graph after feature fusion to obtain the feature graph after the dependency relationship is extracted;

and 4, step 4: generating a heat map and a target boundary frame through further feature fusion and logistic regression processing;

and 5: performing target association processing and tracking by using a multi-target tracking module to obtain a tracking characteristic diagram with a target detection frame;

step 6, training the multi-target detection and tracking model formed by the steps 2, 3, 4 and 5 by adopting the training set of the step 1, and testing by adopting the test set to finally obtain the trained multi-target detection and tracking model;

and 7, inputting the video data to be detected into the trained multi-target detection and tracking model to obtain a tracking characteristic diagram with a target detection frame.

2. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein in step 1, visDrone _ mot in the main-flow traffic target detection data set VisDrone is selected as the data set of the present invention.

3. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein said step 2 comprises the following substeps:

step 21, inputting the images in the training set into a DLA34 network, performing convolution operation with convolution kernel of 3 × 3 twice on the original image through a BatchNorm layer and a ReLU layer to obtain two feature maps, inputting the two feature maps after convolution into an aggregation node for feature fusion, and obtaining a feature map with resolution of 1/4 of the original input feature map;

step 22, carrying out 2-time down-sampling on the feature map with the size of 1/4 obtained in the step 21 to obtain a new feature map, repeating the convolution operation and the aggregation operation in the step 21 on the feature map twice to obtain two feature maps, and carrying out the aggregation operation again by taking the aggregation node obtained in the step 21 as a common input to obtain the feature map with the resolution of 1/8 of the original input feature map;

step 23, obtaining a feature map with the size of 1/16 from the feature map with the size of 1/8 according to the same manner of obtaining the feature map with the size of 1/8 from the feature map with the size of 1/4 in the step 22, and obtaining a feature map with the size of 1/32 from the feature map with the size of 1/16;

and 24, as shown in fig. 2, sequentially adopting a feature fusion module to perform adjacent feature fusion on the obtained feature map with the size of 1/4, the feature map with the size of 1/8, the feature map with the size of 1/16 and the feature map with the size of 1/32 to respectively obtain new feature maps with the size of 1/4, the size of 1/8 and the size of 1/16.

4. The multi-target detection and tracking method in a complex urban road environment according to claim 3, wherein in step 24, the feature fusion module is configured to implement the following operations:

step 241, performing deformable convolution processing with a convolution kernel of 3 × 3 on the feature map F1, and obtaining a mapped feature map by passing a result obtained by the processing through a BatchNorm layer and a ReLU layer;

step 242, replacing the transposed convolution in the DLA34 backbone network with a direct interpolation upsampling and convolution processing mode, and performing 2-fold upsampling on the feature map obtained after mapping in step 241 to obtain a feature map F1';

step 243, adding the channel values corresponding to the characteristic diagram F1' and the characteristic diagram F2 obtained in step 242 to obtain a combined characteristic diagram;

step 244, performing 3 × 3 deformable convolution processing on the merged feature map obtained in step 243, and then sequentially passing through the BatchNorm layer and the ReLU layer to obtain a two-dimensional feature map F2';

when the characteristic diagram F1 and the characteristic diagram F2 are respectively a characteristic diagram with the size of 1/4 and a characteristic diagram with the size of 1/8, the obtained two-dimensional characteristic diagram F2' is a characteristic diagram with the size of 1/4;

when the characteristic diagram F1 and the characteristic diagram F2 are respectively a characteristic diagram with the size of 1/8 and a characteristic diagram with the size of 1/16, the obtained two-dimensional characteristic diagram F2' is a characteristic diagram with the size of 1/8;

when the feature maps F1 and F2 are feature maps of 1/16 size and 1/32 size, respectively, the two-dimensional feature map F2' obtained is a feature map of 1/16 size.

5. The multi-target detection and tracking method in a complex urban road environment according to claim 1, wherein said step 3 comprises the following substeps:

step 31, collapsing the two-dimensional characteristic diagram with the size of 1/16 obtained finally in the step 2 into a one-dimensional sequence, and convoluting to form a K, V and Q characteristic diagram;

step 32, respectively adding the position codes and the feature maps K and Q obtained in the step 31 pixel by pixel to obtain two feature maps with position information, inputting the two feature maps and the feature map V into a multi-head attention module as common input, and processing to obtain a new feature map;

step 33, performing fusion operation and LayerNorm operation of adding corresponding values among feature maps on the new feature map obtained in step 32 and the V, K and Q feature maps obtained in step 31;

and step 34, processing the result obtained in the step 33 in a feed-forward neural network, and connecting and outputting through residual errors to obtain a new characteristic diagram.

6. The multi-target detection and tracking method in a complex urban road environment according to claim 5, wherein the position code in step 32 is obtained by the following formula:

PE _(pos，2i) ＝sin(pos/10000 ^2i/d )

PE _(pos，2i+1) ＝cos(pos/10000 ^2i/d )

wherein, PE _(·) The matrix for position coding has the same resolution size as the input feature map, pos represents the position of the vector in the sequence, i is the index of the channel, and d represents the number of channels of the input feature map.

7. The multi-target detection and tracking method in the complex urban road environment according to claim 1, wherein the step 4 specifically comprises the following substeps:

and step 41, performing 2 times of upsampling on the feature map finally obtained in the step 3 to obtain a new feature map.

Step 42, performing feature fusion on the feature map with the size of 1/4 and the size of 1/8 obtained in the step 24 by using the same feature fusion module as that in the step 24 to obtain a new feature map with the size of 1/4;

step 43, performing feature fusion on the feature maps with the size of 1/8 and the size of 1/16 obtained in the step 24 by using a feature fusion module, and performing pixel-by-pixel addition on the feature maps obtained in the step 41 to obtain a new feature map with the size of 1/8;

step 44, performing feature fusion on the feature map with the size of 1/4 obtained in the step 42 and the feature map with the size of 1/8 obtained in the step 43 by using a feature fusion module to generate a heat map with the resolution being the size of 1/4 of the original image;

step 45, performing logistic regression on the heat map obtained in the step 44 and the heat map labels containing the target center points in the data set obtained in the step 1 to obtain the center points of the predicted targets

Step 46, obtaining coordinates of a left upper point and a right lower point of a frame corresponding to each target through the formula (3), and generating a target boundary frame:

wherein,

i.e. step 45 obtains the center point of the predicted target,

representing the offset of the center point from the target center point,

indicating the size of the corresponding bounding box of the object.

8. The multi-target detection and tracking method in the complex urban road environment according to claim 1, wherein the step 5 specifically comprises the following substeps:

step 51, using the same image input in step 2 as the T-1 frame image, selecting the next frame image, namely the T-frame image, using the T-frame and the T-1 frame image as input, and respectively generating the feature map f through the CenterTrack backbone network processing _T And f _T-1 ；

Step 52, the feature map f _T And f _T-1 Respectively sending the data to a cost space module shown in FIG. 5 for target correlation processing to obtain an output feature map f' _T ；

Step 53, comparing the heat map obtained in step 4 with the feature map f obtained in step 51 _T-1 Performing a hadamard product to generate a feature map

Will be provided with

And the feature map f 'obtained in step 52' _T Together withPerforming a deformable convolution to generate a feature map

Step 54, will

Generating a T-1 frame feature map by sequentially using 31 × 1 convolution operations and a downsampling operation; the characteristic diagram f obtained in the step 51 is processed _T Performing operation by using 31 × 1 convolutions to generate a Tth frame feature map;

step 55, inputting the T-th frame feature map obtained in the step 54 and the T-1 th frame feature map into the attention propagation module together for feature propagation to obtain a tracking feature map V 'with a target detection frame' _T 。

9. The multi-target detection and tracking method in a complex urban road environment according to claim 8, wherein said step 52 specifically comprises the following operations:

step 521, the feature map f is processed _T And f _T-1 Three layers of convolution structure generation characteristic graphs e shared by weights respectively sent to cost space modules _T And e _T-1 I.e. the appearance coding vector of the target;

step 522, for the feature map e _T And e _T-1 Performing a max pooling operation to obtain e' _T And e' _T-1 To reduce model complexity, e 'is used' _T And e' _T-1 The transposition calculation of the product obtains a cost space matrix C, the position of the target on the cost space matrix C in the current frame is (i, j), and a two-dimensional cost matrix C containing the position information of the target in the current frame in the previous frame image is extracted from the cost space matrix C _i,j To C, to _i,j Respectively taking the maximum value in the horizontal direction and the vertical direction to obtain a characteristic diagram in the corresponding direction

Step 523, define two biases by equations (4) and (5)Movable template

G _i,j,l ＝(l-j)×s1≤l≤W _C (4)

M _i,j,k ＝(k-i)×s1≤k≤H _C (5)

Wherein s is the downsampling multiple of the feature map relative to the original image, W _C 、H _C Is the size of the width and height dimensions of the feature map, G _i,j,l M is an offset of an object (i, j) in the T frame image appearing in the horizontal position l in the T-1 frame image _i,j,k An offset for a T frame object (i, j) appearing at vertical position k in the T-1 frame image;

step 524, the result of step 522 is processed

Multiplying the offset templates G and M defined in step 523, and then superimposing the offset templates G and M on the channel to obtain a feature map O _T Representing offset templates of the target in both horizontal and vertical directions; then O is introduced _T 2 times up sampling recovery to H _F ×W _F Size, simultaneously, adding O _T The horizontal and vertical channels of the characteristic map are respectively compared with f obtained in step 51 _T 、f _T-1 Superposing on channels, forming 2 feature maps with unchanged feature map sizes in the horizontal direction and the vertical direction and 9 channels by convolution, and superposing the 2 feature maps on the channels to obtain an output feature map f' _T 。