CN114387265A

CN114387265A - Anchor-frame-free detection and tracking unified method based on attention module addition

Info

Publication number: CN114387265A
Application number: CN202210057161.9A
Authority: CN
Inventors: 张红颖; 贺鹏艺
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-04-22

Abstract

An anchor-frame-free detection and tracking unified method based on an attention adding module. It comprises obtaining a pre-processed image; obtaining an initial feature extraction network model; obtaining a trained feature extraction network model; and continuously detecting and tracking the pedestrian target by utilizing the trained feature extraction network model. The invention has the following effects: by adopting a multi-task learning strategy, the training time of the network is greatly reduced; the trained network model has higher accuracy and robustness; the multi-scale information interaction is fully utilized, the pedestrian target characteristics with better expressive power are deeply extracted and fused, and the pedestrian target is accurately tracked under the scene that pedestrians are mutually shielded; the second-generation residual block is utilized to form a backbone network in a network model, and meanwhile, a more efficient attention module is combined for information interaction, so that the detection precision of the prediction method is higher, the re-identification performance is stronger, and the method can be suitable for detecting and tracking the pedestrian target in the scene that passengers shelter from each other seriously, such as a terminal building.

Description

A unified method for anchor-free detection and tracking based on adding attention module

技术领域technical field

本发明属于民用航空技术领域，特别是涉及一种基于添加注意力模块的无锚框检测、跟踪统一方法。The invention belongs to the technical field of civil aviation, in particular to an anchor-free frame detection and tracking unified method based on adding an attention module.

背景技术Background technique

随着智能视频监控在诸如交通枢纽、商业区等公共区域的广泛应用并在安防、客流监测等方面表现出色，其所依托的计算机视觉技术也在高速发展中。行人跟踪作为计算机视觉中的热门技术，能够通过分析获得的视频数据得到行人目标的身份、位置信息及运动轨迹，相对于其他定位方法具有较高的主动性、实时性、实用性，因此在诸如机场航站楼等场所具有指导分区规划、提供旅客私人定制、提醒旅客登机信息、维护秩序与安全等作用，具有一定的应用价值。With the wide application of intelligent video surveillance in public areas such as transportation hubs and commercial areas, and its outstanding performance in security, passenger flow monitoring, etc., the computer vision technology it relies on is also developing rapidly. As a popular technology in computer vision, pedestrian tracking can obtain the identity, location information and motion trajectory of pedestrian targets by analyzing the obtained video data. Compared with other positioning methods, it has higher initiative, real-time, and practicability. Airport terminals and other places have the functions of guiding zoning planning, providing personal customization for passengers, reminding passengers of boarding information, maintaining order and safety, etc., and have certain application value.

随着深度学习广泛应用于计算机视觉领域，基于深度学习的多目标跟踪算法渐渐在行人跟踪领域中占据主体地位。目前主流的行人跟踪算法如FairMOT、JDE等，主体思路为将多目标跟踪问题划分为检测和跟踪两个部分，通过检测网络获取行人位置，然后使用数据关联技术匹配前后帧进行跟踪，因此目标检测网络的优良大体上决定了跟踪性能。With the wide application of deep learning in the field of computer vision, multi-target tracking algorithms based on deep learning gradually occupy a dominant position in the field of pedestrian tracking. The current mainstream pedestrian tracking algorithms such as FairMOT, JDE, etc., the main idea is to divide the multi-target tracking problem into two parts: detection and tracking, obtain the pedestrian position through the detection network, and then use the data association technology to match the frame before and after tracking, so target detection The goodness of the network largely determines the tracking performance.

目前的行人检测、跟踪方法大多仍执行先检测后跟踪的策略，同时检测网络对多个目标检测性能受限，因此存在诸如检测精度较低、遮挡场景目标丢失、网络参数冗杂、算法实时性差等问题而亟待解决。Most of the current pedestrian detection and tracking methods still implement the strategy of first detection and then tracking, and the detection performance of the detection network for multiple targets is limited, so there are problems such as low detection accuracy, loss of targets in occluded scenes, redundant network parameters, and poor real-time performance of the algorithm. problem to be solved urgently.

发明内容SUMMARY OF THE INVENTION

为了解决上述问题，本发明的目的在于提供一种基于添加注意力模块的无锚框检测、跟踪统一方法。In order to solve the above problems, the purpose of the present invention is to provide a unified method for anchor-free frame detection and tracking based on adding an attention module.

为了达到上述目的，本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法包括按顺序进行的下列步骤：In order to achieve the above object, the unified method for anchor-free frame detection and tracking based on adding an attention module provided by the present invention includes the following steps in sequence:

1)获取航站楼内客流密集区域的图像并进行预处理，获得预处理图像，并且每帧预处理图像带有一个标签，该标签内包含当前帧图像内所有行人目标的位置信息；1) Obtain the image of the passenger-intensive area in the terminal building and preprocess it to obtain a preprocessed image, and each frame of the preprocessed image has a label, and the label contains the position information of all pedestrian targets in the current frame image;

2)构建原始特征提取网络模型，然后将上述预处理图像输入原始特征提取网络模型进行特征提取，获得初始特征提取网络模型；2) constructing an original feature extraction network model, and then inputting the above-mentioned preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;

3)针对检测任务的目标中心点定位、边界尺寸、偏移误差以及重识别任务分别设置相应的损失函数；然后使用大量现有数据训练上述初始特征提取网络模型的参数，获得训练好的特征提取网络模型；3) Set the corresponding loss function for the target center point location, boundary size, offset error and re-identification task of the detection task respectively; then use a large amount of existing data to train the parameters of the above initial feature extraction network model to obtain the trained feature extraction network model;

4)利用上述训练后的特征提取网络模型对行人目标进行持续检测和跟踪。4) Use the above trained feature extraction network model to continuously detect and track pedestrian targets.

在步骤1)中，所述获取航站楼内客流密集区域的图像并进行预处理，获得预处理图像的方法是：利用位于航站楼内客流密集区域的监控摄像头，在客流量较大时间段内以固定时间间隔拍摄旅客行走、遮挡过程中的图像，并对图像进行去模糊、降噪和提升分辨率在内的预处理，获得预处理图像。In step 1), the image of the passenger-intensive area in the terminal building is obtained and preprocessed, and the method for obtaining the pre-processed image is: using a surveillance camera located in the passenger-intensive area in the terminal building, when the passenger flow is large During the segment, the images during the walking and occlusion of passengers are taken at fixed time intervals, and the images are preprocessed, including deblurring, noise reduction, and resolution enhancement, to obtain preprocessed images.

在步骤2)中，所述构建原始特征提取网络模型，然后将上述预处理图像输入原始特征提取网络模型进行特征提取，获得初始特征提取网络模型的方法是：In step 2), the described construction of the original feature extraction network model, then the above-mentioned preprocessed image is input into the original feature extraction network model for feature extraction, and the method for obtaining the initial feature extraction network model is:

原始特征提取网络模型共分为五个阶段：stem、stage1、stage2、stage3、stage4；其中stem为主干网络；stage1至stage4为阶段1至阶段4；The original feature extraction network model is divided into five stages: stem, stage1, stage2, stage3, stage4; where stem is the backbone network; stage1 to stage4 are stage 1 to stage 4;

首先主干网络stem通过两个步长为2的3×3的卷积层将预处理图像的高宽变为原来的四分之一，然后使用4个二代残差块bottle2neck进行特征提取，并将输出的特征图输入阶段1中；阶段1-阶段3进行特征提取和融合操作，都是在上一阶段的基础上产生一个低分辨率分支，然后每个低分辨率分支使用4个添加两层注意力模块的基准残差块2eca-basicblock进行特征提取，最后将得到的特征图进行重复多尺度融合并输入阶段4；阶段4为头网络，首先将三个并行的低分辨率分支输出的特征图通过双线性插值方法上采样为高分辨率分支的尺寸大小，然后通过拼接操作和全连接层得到最终的输出特征图，用于检测和重识别，并获得初始特征提取网络模型。First, the backbone network stem changes the height and width of the preprocessed image to a quarter of the original size through two 3×3 convolutional layers with a stride of 2, and then uses four second-generation residual blocks bottle2neck for feature extraction, and Input the output feature map into stage 1; stage 1-stage 3 perform feature extraction and fusion operations, which are based on the previous stage to generate a low-resolution branch, and then each low-resolution branch uses 4 to add two. The benchmark residual block 2eca-basicblock of the layer attention module is used for feature extraction, and finally the obtained feature map is repeatedly multi-scale fusion and input to stage 4; stage 4 is the head network. First, the three parallel low-resolution branches output The feature map is up-sampled to the size of the high-resolution branch by the bilinear interpolation method, and then the final output feature map is obtained through the stitching operation and the fully connected layer for detection and re-identification, and the initial feature extraction network model is obtained.

在步骤3)中，所述针对检测任务的目标中心点定位、边界尺寸、偏移误差以及重识别任务分别设置相应的损失函数；然后使用大量现有数据训练上述初始特征提取网络模型的参数，获得训练好的特征提取网络模型的方法是：In step 3), the target center point positioning, boundary size, offset error and re-identification task for the detection task are respectively set with corresponding loss functions; then use a large amount of existing data to train the parameters of the above-mentioned initial feature extraction network model, The way to get a trained feature extraction network model is:

目标中心点定位的损失函数使用变形的focal loss，用于计算预测的热图和实际真实的热图之间的损失，该损失函数能够有效处理目标中心点和周围各点样本不平衡的问题，公式如式(1)所示：The loss function of the target center point positioning uses the deformed focal loss to calculate the loss between the predicted heat map and the actual real heat map. This loss function can effectively deal with the problem of imbalance between the target center point and the surrounding points. The formula is shown in formula (1):

式(1)中，

是预测的热图响应值，M_xy是热图的真实响应值；设行人目标区域的两个角点坐标分别为(x₁,y₁)和(x₂,y₂)，则经过尺寸缩减后行人目标的中心点坐标为(cⁱ _x,cⁱ _y)＝((x₁+x₂)/8,(y₁+y₂)/8)，而行人目标某角点坐标(x,y)关于中心点坐标的热图的真实响应值如式(2)所示：In formula (1),

is the predicted response value of the heat map, and M _xy is the real response value of the heat map; let the coordinates of the two corners of the pedestrian target area be (x ₁ , y ₁ ) and (x ₂ , y ₂ ) respectively, then after the size reduction The coordinates of the center point of the pedestrian target are ( ^ci _x , ^ci _y )=((x ₁ +x ₂ )/8,(y ₁ +y ₂ )/8), while the coordinates of a corner of the pedestrian target (x, y) The true response value of the heat map with respect to the coordinates of the center point is shown in equation (2):

其中N表示图像中行人目标的数量，i表示第几个行人目标，σ_c表示标准方差；where N represents the number of pedestrian objects in the image, i represents the number of pedestrian objects, and σ _c represents the standard deviation;

边界尺寸和偏移误差选择两个l₁ loss作为损失函数，根据每个行人目标给出的角点坐标，损失函数如式(3)所示：Two l ₁ losses are selected as loss functions for boundary size and offset error. According to the corner coordinates given by each pedestrian target, the loss function is shown in formula (3):

其中，sⁱ表示行人目标的真实尺寸，oⁱ表示行人目标尺寸的真实偏移量，

和

分别表示尺寸和偏移量的预测值，L_box表示由两个分支的损失相加得到的定位损失；Among them, s ⁱ represents the real size of the pedestrian target, o ⁱ represents the real offset of the pedestrian target size,

and

represent the predicted values of size and offset, respectively, and L _box represents the localization loss obtained by adding the losses of the two branches;

重识别任务实质上是一个分类任务，因此选择softmax loss作为损失函数，在获取的热图上行人目标的中心点处提取一个身份特征向量进行学习并将其映射为一个类分布向量p(k)，将每个行人目标的独热编码表示为Lⁱ(k)，将类别数记为K，重识别任务的损失函数如式(4)所示：The re-identification task is essentially a classification task, so softmax loss is selected as the loss function, and an identity feature vector is extracted at the center point of the pedestrian target in the obtained heat map for learning and mapped to a class distribution vector p(k) , denote the one-hot encoding of each pedestrian target as Li ( ^k ), and denote the number of categories as K. The loss function of the re-identification task is shown in formula (4):

设置好上述所有损失函数后，选择CUHK-SYSU、PRW、MOT16数据集中的训练集图像作为训练集，2DMOT15数据集中的训练集图像作为验证集，对上述初始特征提取网络模型的参数进行训练；训练迭代次数设置为36轮次，其中前31轮学习率设置为1e-4，随后4轮学习率设为1e-5，最后一轮使用1e-6的学习率训练达到拟合；训练过程中输入的图像尺寸为(1088,608)，批尺寸设为6，利用Adam优化器进行模型优化，使用relu作为激活函数，设置正则化系数为0.001，训练完成后最终获得训练好的特征提取网络模型。After setting all the above loss functions, select the training set images in the CUHK-SYSU, PRW, MOT16 data sets as the training set, and the training set images in the 2DMOT15 data set as the verification set, and train the parameters of the above initial feature extraction network model; training; The number of iterations is set to 36 rounds, in which the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the next 4 rounds is set to 1e-5, and the last round uses the learning rate of 1e-6 to achieve fitting; input during the training process The image size is (1088,608), the batch size is set to 6, the model is optimized using the Adam optimizer, relu is used as the activation function, and the regularization coefficient is set to 0.001. After the training is completed, the trained feature extraction network model is finally obtained.

在步骤4)中，所述利用上述训练后的特征提取网络模型对行人目标进行持续检测和跟踪的具体步骤如下：In step 4), the specific steps for continuously detecting and tracking pedestrian targets using the above-mentioned trained feature extraction network model are as follows:

4.1.首先，将第一帧图像作为输入图像，根据输入图像的标签信息初始化距离矩阵并进行封装，得到行人目标的外观信息和运动信息，用于后续的数据匹配；4.1. First, take the first frame image as the input image, initialize the distance matrix according to the label information of the input image and encapsulate it, and obtain the appearance information and motion information of the pedestrian target for subsequent data matching;

4.2.将每个行人目标作为一个类别，通过边界框对每个类别进行实例化作为一个跟踪对象，并根据当前帧检测结果，利用卡尔曼滤波方法预测出行人目标在下一帧图像中的位置信息；4.2. Take each pedestrian target as a category, instantiate each category as a tracking object through the bounding box, and use the Kalman filtering method to predict the position information of the pedestrian target in the next frame image according to the detection results of the current frame. ;

4.3.将上述预测得到的位置信息与外观信息和运动信息利用马氏距离度量进行匹配，以判断行人目标跟踪状态是初始默认状态、确认的状态还是删除的状态；其中，初始默认状态是指第一次检测到某个行人目标新生成的运动轨迹状态，因无法确认检测结果是否正确因而标注为该状态；若在接下来连续的三帧图像中都匹配成功，则将行人目标跟踪状态由初始默认状态变为确认的状态，确定该条运动轨迹为特定行人目标的跟踪轨迹；若在接下来的三帧图像内未匹配成功，则认为是误检测，确定该条运动轨迹为错误跟踪轨迹，由初始默认状态变为删除的状态，并将该条运动轨迹删除；4.3. Match the above-predicted position information with appearance information and motion information using Mahalanobis distance metric to determine whether the pedestrian target tracking state is the initial default state, the confirmed state or the deleted state; wherein, the initial default state refers to the first The newly generated motion trajectory state of a pedestrian target is detected at one time, and it is marked as this state because it cannot confirm whether the detection result is correct; if the matching is successful in the next three consecutive frames of images, the pedestrian target tracking state is changed from the initial state. The default state becomes the confirmed state, and the motion trajectory is determined as the tracking trajectory of a specific pedestrian target; if the next three frames of images do not match successfully, it is considered as a false detection, and the motion trajectory is determined as an incorrect tracking trajectory. Change from the initial default state to the deleted state, and delete the motion track;

4.4.若行人目标跟踪状态处于初始默认状态或确认的状态，进行级联、预测框与真实框的重叠度(IOU)匹配，可能得到匹配成功、未匹配到的跟踪、未匹配到的检测目标三种结果；若匹配成功，则使用卡尔曼滤波方法更新预测值和检测的观测值，更新行人目标的外观特征，更新跟踪轨迹并重复上述步骤；若结果为未匹配到的跟踪，则表明跟踪轨迹中断，删除该跟踪轨迹；若结果为未匹配到的检测目标，表明可能是新的行人目标，将其初始化为新的跟踪轨迹，分配新的跟踪器；4.4. If the pedestrian target tracking state is in the initial default state or the confirmed state, perform cascading and match the overlap between the predicted frame and the real frame (IOU), which may result in successful matching, unmatched tracking, and unmatched detection targets Three results; if the matching is successful, use the Kalman filter method to update the predicted value and the detected observation value, update the appearance feature of the pedestrian target, update the tracking trajectory and repeat the above steps; if the result is an unmatched tracking, it indicates that the tracking If the trajectory is interrupted, delete the tracking trajectory; if the result is an unmatched detection target, it indicates that it may be a new pedestrian target, initialize it as a new tracking trajectory, and assign a new tracker;

4.5.更新输入图像为下一帧图像后，重复步骤4.1、4.2、4.3、4.4，跟踪结束后最终得到行人目标在每一帧图像中的跟踪结果，从而确定出连续的行人跟踪轨迹，最后输出可视化结果。4.5. After updating the input image to the next frame of image, repeat steps 4.1, 4.2, 4.3, and 4.4. After the tracking is completed, the tracking result of the pedestrian target in each frame of image is finally obtained, so as to determine the continuous pedestrian tracking trajectory, and finally output Visualize the results.

本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法具有如下有益效果：The unified method for anchor-free detection and tracking based on adding an attention module provided by the present invention has the following beneficial effects:

(1)采用多任务学习策略，极大降低了网络的训练时间；(1) The multi-task learning strategy is adopted, which greatly reduces the training time of the network;

(2)训练后的网络模型具有较高的精确度与鲁棒性；(2) The trained network model has high accuracy and robustness;

(3)充分利用多尺度信息交互，深度提取、融合更具表达力的行人目标特征，更好地在行人互相遮挡场景下准确跟踪行人目标；(3) Make full use of multi-scale information interaction, deeply extract and fuse pedestrian target features with more expressiveness, and better track pedestrian targets accurately in scenes where pedestrians occlude each other;

(4)利用了二代残差块构成网络模型中的主干网络，同时结合更高效的注意力模块进行信息交互，使得预测方法的检测精度更高，重识别性能更强，能够适用于诸如航站楼等旅客互相遮挡严重场景下的行人目标检测和跟踪。(4) The second-generation residual block is used to form the backbone network in the network model, and combined with a more efficient attention module for information interaction, the prediction method has higher detection accuracy and stronger re-identification performance, which can be applied to such as aviation Pedestrian target detection and tracking in severe scenarios where passengers such as station buildings block each other.

附图说明Description of drawings

图1为本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法流程图。FIG. 1 is a flowchart of a unified method for anchor-free frame detection and tracking based on adding an attention module provided by the present invention.

图2为添加两层注意力模块的基准残差块结构示意图。Figure 2 is a schematic diagram of the structure of the baseline residual block with two layers of attention modules added.

图3为二代残差块与一代残差块的结构对比图。Figure 3 is a structural comparison diagram of the second-generation residual block and the first-generation residual block.

图4为本方法中构建的特征提取网络模型结构示意图。FIG. 4 is a schematic structural diagram of the feature extraction network model constructed in this method.

图5为对行人目标跟踪策略流程图。Figure 5 is a flow chart of the pedestrian target tracking strategy.

具体实施方式Detailed ways

下面结合附图对本发明进行详细描述。The present invention will be described in detail below with reference to the accompanying drawings.

如图1所示，本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法包括按顺序进行的下列步骤：As shown in Figure 1, the unified method for anchor-free frame detection and tracking based on adding an attention module provided by the present invention includes the following steps in sequence:

利用位于航站楼内客流密集区域的监控摄像头，在客流量较大时间段内以固定时间间隔拍摄旅客行走、遮挡过程中的图像，并对图像进行去模糊、降噪和提升分辨率在内的预处理，获得预处理图像，并且每帧预处理图像带有一个标签；该标签内包含当前帧图像内所有行人目标的位置信息。Using surveillance cameras located in the passenger-intensive areas in the terminal building, during the period of high passenger flow, the images of passengers walking and blocking are taken at fixed time intervals, and the images are deblurred, noise-reduced, and resolution improved. The preprocessing image is obtained, and each frame of the preprocessing image has a label; the label contains the position information of all pedestrian targets in the current frame image.

原始特征提取网络模型结构如图4所示，共分为五个阶段：stem、stage1、stage2、stage3、stage4；其中stem为主干网络；stage1至stage4为阶段1至阶段4；斜向上箭头表示上采样操作、斜向下箭头表示用于下采样的跨步卷积；conv表示卷积层，bn表示批归一化层，eca表示注意力模块，bottle2neck和2eca-basicblock分别表示二代残差块和添加两层注意力模块的基准残差块；The original feature extraction network model structure is shown in Figure 4, which is divided into five stages: stem, stage1, stage2, stage3, and stage4; where stem is the backbone network; stage1 to stage4 are stages 1 to 4; Sampling operation, diagonal downward arrows indicate strided convolution for downsampling; conv indicates convolution layer, bn indicates batch normalization layer, eca indicates attention module, bottle2neck and 2eca-basicblock indicate second-generation residual blocks, respectively and a baseline residual block with two layers of attention modules added;

首先主干网络stem通过两个步长为2的3×3的卷积层将预处理图像的高宽变为原来的四分之一，然后使用4个二代残差块bottle2neck进行特征提取，并将输出的特征图输入阶段1中；阶段1-阶段3进行特征提取和融合操作，都是在上一阶段的基础上产生一个低分辨率分支，然后每个低分辨率分支使用4个添加两层注意力模块的基准残差块2eca-basicblock进行特征提取，最后将得到的特征图进行重复多尺度融合并输入阶段4；阶段4为头网络，首先将三个并行的低分辨率分支输出的特征图通过双线性插值方法上采样为高分辨率分支的尺寸大小，然后通过拼接操作和全连接层得到最终的输出特征图，用于检测和重识别，并获得初始特征提取网络模型；First, the backbone network stem changes the height and width of the preprocessed image to a quarter of the original size through two 3×3 convolutional layers with a stride of 2, and then uses four second-generation residual blocks bottle2neck for feature extraction, and Input the output feature map into stage 1; stage 1-stage 3 perform feature extraction and fusion operations, which are based on the previous stage to generate a low-resolution branch, and then each low-resolution branch uses 4 to add two. The benchmark residual block 2eca-basicblock of the layer attention module is used for feature extraction, and finally the obtained feature map is repeatedly multi-scale fusion and input to stage 4; stage 4 is the head network. First, the three parallel low-resolution branches output The feature map is upsampled to the size of the high-resolution branch by the bilinear interpolation method, and then the final output feature map is obtained through the splicing operation and the fully connected layer for detection and re-identification, and the initial feature extraction network model is obtained;

添加两层注意力模块的基准残差块2eca-basicblock结构图如图2所示，其中立方体代表特征图，H、W、C分别表示特征图的高度、宽度、通道维度,GAP表示全局平均池化操作，1×1×C表示一维卷积，k＝5表示卷积核大小。注意力模块采用不降维的局部跨信道交互策略，并根据卷积核大小与通道维数之间的正比关系自适应地选择一维卷积核的大小即局部跨信道交互的覆盖率。注意力模块将初步提取到的特征图与经过了全局平均池化以及卷积操作后得到的局部跨通道信息进行融合，从而达到增强特征表达力的目的。同时，注意力模块由于使用跨域连接方式，带来的额外参数几乎可以忽略不计，因此可以广泛地应用于各类卷积网络中。注意力模块在结构上主要由三层结构组成：平均池化层、卷积层、激活层。其中卷积层为一维卷积，根据卷积核尺寸大小的设置不同，会使得最终提取特征的感受野不同，从而影响实验性能。由于不同深度的网络结构对卷积尺寸的敏感度不同，为此需要通过实验寻找到最佳的卷积和尺寸，以尽可能地提升注意力模块的性能。通过交叉验证的方法去手动调整卷积核的尺寸会极大地提高计算量，造成算力浪费。为此注意力模块中使用了分组卷积(group convolution)，通过定义卷积在不同维度下的比例关系，设置了自适应选择卷积核大小的机制。The structure diagram of the baseline residual block 2eca-basicblock with two layers of attention modules is shown in Figure 2, in which the cube represents the feature map, H, W, and C represent the height, width and channel dimension of the feature map respectively, and GAP represents the global average pool. operation, 1×1×C represents one-dimensional convolution, and k=5 represents the size of the convolution kernel. The attention module adopts a local cross-channel interaction strategy without dimensionality reduction, and adaptively selects the size of the one-dimensional convolution kernel, that is, the coverage of local cross-channel interaction, according to the proportional relationship between the size of the convolution kernel and the channel dimension. The attention module fuses the initially extracted feature map with the local cross-channel information obtained after global average pooling and convolution operations, so as to achieve the purpose of enhancing feature expressiveness. At the same time, due to the use of cross-domain connections, the additional parameters brought by the attention module are almost negligible, so it can be widely used in various convolutional networks. The attention module is mainly composed of three layers in structure: average pooling layer, convolution layer, and activation layer. The convolution layer is a one-dimensional convolution. According to the size of the convolution kernel, the receptive field of the final extracted features will be different, which will affect the experimental performance. Since different depths of network structures have different sensitivities to the convolution size, it is necessary to find the optimal convolution and size through experiments to improve the performance of the attention module as much as possible. Manually adjusting the size of the convolution kernel through cross-validation will greatly increase the amount of computation, resulting in a waste of computing power. For this purpose, group convolution is used in the attention module. By defining the proportional relationship of convolution in different dimensions, a mechanism for adaptively selecting the size of the convolution kernel is set up.

图3为二代残差块与一代残差块的结构对比图。图3中左图为一代残差块bottleneck的网络结构图，由1×1-3×3-1×1三层卷积构成，输入和输出间跳变连接；右图为二代残差块bottle2neck网络结构图，主要改进为在通道维度上将3×3的卷积变为了从没有卷积到三个3×3卷积的四个分支。二代残差块bottle2neck相较于一代残差块bottleneck拥有更强的感受野和特征提取能力，具有较强的泛化性。Figure 3 is a structural comparison diagram of the second-generation residual block and the first-generation residual block. The left picture in Figure 3 is the network structure diagram of the first-generation residual block bottleneck, which is composed of 1×1-3×3-1×1 three-layer convolution, and the input and output are connected by jumping; the right picture is the second-generation residual block The bottle2neck network structure diagram is mainly improved by changing the 3×3 convolution in the channel dimension to four branches from no convolution to three 3×3 convolutions. Compared with the first-generation residual block bottle2neck, the second-generation residual block bottle2neck has stronger receptive field and feature extraction ability, and has strong generalization.

式(1)中，

是预测的热图响应值，M_xy是热图的真实响应值。设行人目标区域的两个角点坐标分别为(x₁,y₁)和(x₂,y₂)，则经过尺寸缩减后行人目标的中心点坐标为(cⁱ _x,cⁱ _y)＝((x₁+x₂)/8,(y₁+y₂)/8)，而行人目标某角点坐标(x,y)关于中心点坐标的热图的真实响应值如式(2)所示：In formula (1),

is the predicted heatmap response value, and M _xy is the true response value of the heatmap. Suppose the coordinates of the two corners of the pedestrian target area are (x ₁ , y ₁ ) and (x ₂ , y ₂ ) respectively, then the coordinates of the center point of the pedestrian target after size reduction are ( ^ci _x , ^ci _y )= ((x ₁ +x ₂ )/8,(y ₁ +y ₂ )/8), and the real response value of the heat map of the coordinates of a certain corner of the pedestrian target (x, y) about the coordinates of the center point is shown in formula (2) shown:

其中N表示图像中行人目标的数量，i表示第几个行人目标，σ_c表示标准方差。where N is the number of pedestrian objects in the image, i is the number of pedestrian objects, and σ _c is the standard deviation.

和

分别表示尺寸和偏移量的预测值，L_box表示由两个分支的损失相加得到的定位损失。Among them, s ⁱ represents the real size of the pedestrian target, o ⁱ represents the real offset of the pedestrian target size,

and

are the predicted values of size and offset, respectively, and L _box represents the localization loss obtained by adding the losses of the two branches.

重识别任务实质上是一个分类任务，因此本发明选择softmax loss作为损失函数，在获取的热图上行人目标的中心点处提取一个身份特征向量进行学习并将其映射为一个类分布向量p(k)，将每个行人目标的独热编码(one-hot)表示为Lⁱ(k)，将类别数记为K，重识别任务的损失函数如式(4)所示：The re-identification task is essentially a classification task, so the present invention selects softmax loss as the loss function, extracts an identity feature vector at the center point of the pedestrian target in the obtained heat map for learning and maps it to a class distribution vector p( k), denote the one-hot encoding (one-hot) of each pedestrian target as Li ( ^k ), denote the number of categories as K, and the loss function of the re-identification task is shown in formula (4):

设置好上述所有损失函数后，选择CUHK-SYSU、PRW、MOT16数据集中的训练集图像作为训练集，2DMOT15数据集中的训练集图像作为验证集，对上述初始特征提取网络模型的参数进行训练。训练迭代次数设置为36轮次，其中前31轮学习率设置为1e-4，随后4轮学习率设为1e-5，最后一轮使用1e-6的学习率训练达到拟合。训练过程中输入的图像尺寸为(1088,608)，批尺寸设为6，利用Adam优化器进行模型优化，使用relu作为激活函数，设置正则化系数为0.001，训练完成后最终获得训练好的特征提取网络模型。After setting all the above loss functions, select the training set images in the CUHK-SYSU, PRW, and MOT16 data sets as the training set, and the training set images in the 2DMOT15 data set as the validation set, and train the parameters of the above initial feature extraction network model. The number of training iterations is set to 36 rounds, of which the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the next 4 rounds is set to 1e-5, and the last round is trained with a learning rate of 1e-6 to achieve fitting. The input image size during the training process is (1088,608), the batch size is set to 6, the Adam optimizer is used for model optimization, relu is used as the activation function, and the regularization coefficient is set to 0.001. After the training is completed, the trained features are finally obtained. Extract the network model.

具体步骤如下：Specific steps are as follows:

Claims

1. A method for detecting and tracking unification without an anchor frame based on an attention adding module is characterized in that: the method for detecting and tracking the unified frame without the anchor frame based on the attention adding module comprises the following steps in sequence:

1) acquiring images of a passenger flow dense area in a terminal building and preprocessing the images to acquire preprocessed images, wherein each preprocessed image is provided with a label, and the label comprises position information of all pedestrian targets in a current frame image;

2) constructing an original feature extraction network model, inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;

3) respectively setting corresponding loss functions aiming at the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, training parameters of the initial feature extraction network model by using a large amount of existing data to obtain a trained feature extraction network model;

4) and continuously detecting and tracking the pedestrian target by using the trained feature extraction network model.

2. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in step 1), the method for acquiring and preprocessing the image of the passenger flow dense area in the terminal building to obtain the preprocessed image comprises the following steps: the method comprises the steps of utilizing a monitoring camera located in a passenger flow dense area in an airport terminal to shoot images of passengers in the walking and shielding processes at fixed time intervals in a time period with larger passenger flow, and carrying out preprocessing including deblurring, noise reduction and resolution improvement on the images to obtain preprocessed images.

3. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in step 2), the method for constructing the original feature extraction network model and then inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain the initial feature extraction network model comprises the following steps:

the original feature extraction network model is divided into five stages: stem, stage1, stage2, stage3, stage 4; wherein stem is a backbone network; stage1 to stage4 are stage1 to stage 4;

firstly, a backbone network stem changes the height and width of a preprocessed image into one fourth of the original width through two convolution layers with the step length of 2 and the convolution layers of 3 multiplied by 3, then uses 4 second-generation residual block 2 notches to carry out feature extraction, and inputs an output feature diagram into a stage 1; the stages 1 to 3 carry out feature extraction and fusion operations, namely, generating a low-resolution branch on the basis of the previous stage, then carrying out feature extraction on each low-resolution branch by using 4 reference residual blocks 2eca-basic blocks added with two layers of attention modules, and finally carrying out repeated multi-scale fusion on the obtained feature graph and inputting the feature graph into the stage 4; and stage4 is a head network, firstly, the feature graph output by the three parallel low-resolution branches is up-sampled into the size of the high-resolution branch by a bilinear interpolation method, and then, a final output feature graph is obtained through splicing operation and a full connection layer and is used for detection and re-identification, and an initial feature extraction network model is obtained.

4. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in the step 3), corresponding loss functions are respectively set for the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, using a large amount of existing data to train the parameters of the initial feature extraction network model, the method for obtaining the trained feature extraction network model comprises the following steps:

the loss function for target center point positioning uses the deformed focal length for calculating the loss between the predicted heat map and the actual real heat map, and can effectively deal with the problem of unbalance between the target center point and the surrounding points, and the formula is shown in formula (1):

in the formula (1), the reaction mixture is,

is a predicted heat map response value, M_xyIs the true response value of the heatmap; let two corner point coordinates of the pedestrian target area be (x) respectively₁,y₁) And (x)₂,y₂) The coordinate of the central point of the pedestrian target after size reduction is

And the real of the heat map of the coordinates (x, y) of a certain corner point of the pedestrian target relative to the coordinates of the central pointThe response value is shown in formula (2):

where N represents the number of pedestrian objects in the image, i represents the fourth pedestrian object, σ_cRepresents the standard deviation;

boundary size and offset error select two₁loss as a loss function, according to the corner coordinates given by each pedestrian target, the loss function is shown as formula (3):

wherein s isⁱRepresenting the true size of the pedestrian object, oⁱA true offset representing a target size of the pedestrian,

and

respectively, the predicted values of the size and the offset, L_boxRepresenting the localization loss resulting from the addition of the losses of the two branches;

the re-identification task is a classification task in essence, so softmax loss is selected as a loss function, an identity feature vector is extracted from the central point of the pedestrian target on the acquired heat map for learning and is mapped into a class distribution vector p (k), and the unique hot code of each pedestrian target is expressed as Lⁱ(k) The number of categories is marked as K, and the loss function of the re-recognition task is shown as the formula (4):

after all the loss functions are set, selecting training set images in CUHK-SYSU, PRW and MOT16 data sets as training sets, selecting training set images in 2DMOT15 data sets as verification sets, and training parameters of the initial feature extraction network model; the number of training iterations is set to 36 rounds, wherein the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the subsequent 4 rounds is set to 1e-5, and the last round of training reaches the fitting by using the learning rate of 1 e-6; the size of an input image in the training process is (1088,608), the batch size is set to be 6, model optimization is carried out by using an Adam optimizer, relu is used as an activation function, a regularization coefficient is set to be 0.001, and a trained feature extraction network model is finally obtained after training is completed.

5. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in step 4), the specific steps of continuously detecting and tracking the pedestrian target by using the trained feature extraction network model are as follows:

4.1. firstly, taking a first frame image as an input image, initializing a distance matrix according to label information of the input image and packaging to obtain appearance information and motion information of a pedestrian target for subsequent data matching;

4.2. each pedestrian target is used as a category, each category is instantiated through a boundary frame to be used as a tracking object, and the position information of the pedestrian target in the next frame of image is predicted by using a Kalman filtering method according to the current frame detection result;

4.3. matching the predicted position information with appearance information and motion information by using Mahalanobis distance measurement to judge whether the pedestrian target tracking state is an initial default state, a confirmed state or a deleted state; the initial default state refers to a state of detecting a newly generated motion track of a certain pedestrian target for the first time, and is marked as the state because whether a detection result is correct or not cannot be confirmed; if the matching is successful in the next three continuous frames of images, changing the tracking state of the pedestrian target from an initial default state to a confirmed state, and determining that the motion track is the tracking track of the specific pedestrian target; if the matching is not successful in the next three frames of images, the detection is regarded as false detection, the motion track is determined to be a false tracking track, the initial default state is changed into a deleting state, and the motion track is deleted;

4.4. if the pedestrian target tracking state is in an initial default state or a confirmed state, cascade connection is carried out, overlapping degree (IOU) matching of a prediction frame and a real frame is carried out, and three results of successfully matched, unmatched tracking and unmatched detection targets can be obtained; if the matching is successful, updating the predicted value and the detected observed value by using a Kalman filtering method, updating the appearance characteristic of the pedestrian target, updating the tracking track and repeating the steps; if the result is unmatched tracking, indicating that the tracking track is interrupted, and deleting the tracking track; if the result is that the detected target is not matched, the detected target is possibly a new pedestrian target, the detected target is initialized to be a new tracking track, and a new tracker is distributed;

4.5. and after the input image is updated to be the next frame image, repeating the steps 4.1, 4.2, 4.3 and 4.4, and finally obtaining the tracking result of the pedestrian target in each frame image after the tracking is finished, so that the continuous pedestrian tracking track is determined, and finally, a visualization result is output.