CN114387265A - Anchor-frame-free detection and tracking unified method based on attention module addition - Google Patents
Anchor-frame-free detection and tracking unified method based on attention module addition Download PDFInfo
- Publication number
- CN114387265A CN114387265A CN202210057161.9A CN202210057161A CN114387265A CN 114387265 A CN114387265 A CN 114387265A CN 202210057161 A CN202210057161 A CN 202210057161A CN 114387265 A CN114387265 A CN 114387265A
- Authority
- CN
- China
- Prior art keywords
- tracking
- feature extraction
- network model
- pedestrian
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000000605 extraction Methods 0.000 claims abstract description 60
- 238000012549 training Methods 0.000 claims abstract description 23
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims description 29
- 230000004044 response Effects 0.000 claims description 9
- 238000010586 diagram Methods 0.000 claims description 8
- 230000004927 fusion Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000004807 localization Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000005549 size reduction Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 230000006872 improvement Effects 0.000 claims 1
- 238000005259 measurement Methods 0.000 claims 1
- 238000004806 packaging method and process Methods 0.000 claims 1
- 239000011541 reaction mixture Substances 0.000 claims 1
- 238000012800 visualization Methods 0.000 claims 1
- 230000003993 interaction Effects 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
- 238000013316 zoning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明属于民用航空技术领域,特别是涉及一种基于添加注意力模块的无锚框检测、跟踪统一方法。The invention belongs to the technical field of civil aviation, in particular to an anchor-free frame detection and tracking unified method based on adding an attention module.
背景技术Background technique
随着智能视频监控在诸如交通枢纽、商业区等公共区域的广泛应用并在安防、客流监测等方面表现出色,其所依托的计算机视觉技术也在高速发展中。行人跟踪作为计算机视觉中的热门技术,能够通过分析获得的视频数据得到行人目标的身份、位置信息及运动轨迹,相对于其他定位方法具有较高的主动性、实时性、实用性,因此在诸如机场航站楼等场所具有指导分区规划、提供旅客私人定制、提醒旅客登机信息、维护秩序与安全等作用,具有一定的应用价值。With the wide application of intelligent video surveillance in public areas such as transportation hubs and commercial areas, and its outstanding performance in security, passenger flow monitoring, etc., the computer vision technology it relies on is also developing rapidly. As a popular technology in computer vision, pedestrian tracking can obtain the identity, location information and motion trajectory of pedestrian targets by analyzing the obtained video data. Compared with other positioning methods, it has higher initiative, real-time, and practicability. Airport terminals and other places have the functions of guiding zoning planning, providing personal customization for passengers, reminding passengers of boarding information, maintaining order and safety, etc., and have certain application value.
随着深度学习广泛应用于计算机视觉领域,基于深度学习的多目标跟踪算法渐渐在行人跟踪领域中占据主体地位。目前主流的行人跟踪算法如FairMOT、JDE等,主体思路为将多目标跟踪问题划分为检测和跟踪两个部分,通过检测网络获取行人位置,然后使用数据关联技术匹配前后帧进行跟踪,因此目标检测网络的优良大体上决定了跟踪性能。With the wide application of deep learning in the field of computer vision, multi-target tracking algorithms based on deep learning gradually occupy a dominant position in the field of pedestrian tracking. The current mainstream pedestrian tracking algorithms such as FairMOT, JDE, etc., the main idea is to divide the multi-target tracking problem into two parts: detection and tracking, obtain the pedestrian position through the detection network, and then use the data association technology to match the frame before and after tracking, so target detection The goodness of the network largely determines the tracking performance.
目前的行人检测、跟踪方法大多仍执行先检测后跟踪的策略,同时检测网络对多个目标检测性能受限,因此存在诸如检测精度较低、遮挡场景目标丢失、网络参数冗杂、算法实时性差等问题而亟待解决。Most of the current pedestrian detection and tracking methods still implement the strategy of first detection and then tracking, and the detection performance of the detection network for multiple targets is limited, so there are problems such as low detection accuracy, loss of targets in occluded scenes, redundant network parameters, and poor real-time performance of the algorithm. problem to be solved urgently.
发明内容SUMMARY OF THE INVENTION
为了解决上述问题,本发明的目的在于提供一种基于添加注意力模块的无锚框检测、跟踪统一方法。In order to solve the above problems, the purpose of the present invention is to provide a unified method for anchor-free frame detection and tracking based on adding an attention module.
为了达到上述目的,本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法包括按顺序进行的下列步骤:In order to achieve the above object, the unified method for anchor-free frame detection and tracking based on adding an attention module provided by the present invention includes the following steps in sequence:
1)获取航站楼内客流密集区域的图像并进行预处理,获得预处理图像,并且每帧预处理图像带有一个标签,该标签内包含当前帧图像内所有行人目标的位置信息;1) Obtain the image of the passenger-intensive area in the terminal building and preprocess it to obtain a preprocessed image, and each frame of the preprocessed image has a label, and the label contains the position information of all pedestrian targets in the current frame image;
2)构建原始特征提取网络模型,然后将上述预处理图像输入原始特征提取网络模型进行特征提取,获得初始特征提取网络模型;2) constructing an original feature extraction network model, and then inputting the above-mentioned preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;
3)针对检测任务的目标中心点定位、边界尺寸、偏移误差以及重识别任务分别设置相应的损失函数;然后使用大量现有数据训练上述初始特征提取网络模型的参数,获得训练好的特征提取网络模型;3) Set the corresponding loss function for the target center point location, boundary size, offset error and re-identification task of the detection task respectively; then use a large amount of existing data to train the parameters of the above initial feature extraction network model to obtain the trained feature extraction network model;
4)利用上述训练后的特征提取网络模型对行人目标进行持续检测和跟踪。4) Use the above trained feature extraction network model to continuously detect and track pedestrian targets.
在步骤1)中,所述获取航站楼内客流密集区域的图像并进行预处理,获得预处理图像的方法是:利用位于航站楼内客流密集区域的监控摄像头,在客流量较大时间段内以固定时间间隔拍摄旅客行走、遮挡过程中的图像,并对图像进行去模糊、降噪和提升分辨率在内的预处理,获得预处理图像。In step 1), the image of the passenger-intensive area in the terminal building is obtained and preprocessed, and the method for obtaining the pre-processed image is: using a surveillance camera located in the passenger-intensive area in the terminal building, when the passenger flow is large During the segment, the images during the walking and occlusion of passengers are taken at fixed time intervals, and the images are preprocessed, including deblurring, noise reduction, and resolution enhancement, to obtain preprocessed images.
在步骤2)中,所述构建原始特征提取网络模型,然后将上述预处理图像输入原始特征提取网络模型进行特征提取,获得初始特征提取网络模型的方法是:In step 2), the described construction of the original feature extraction network model, then the above-mentioned preprocessed image is input into the original feature extraction network model for feature extraction, and the method for obtaining the initial feature extraction network model is:
原始特征提取网络模型共分为五个阶段:stem、stage1、stage2、stage3、stage4;其中stem为主干网络;stage1至stage4为阶段1至阶段4;The original feature extraction network model is divided into five stages: stem, stage1, stage2, stage3, stage4; where stem is the backbone network; stage1 to stage4 are stage 1 to
首先主干网络stem通过两个步长为2的3×3的卷积层将预处理图像的高宽变为原来的四分之一,然后使用4个二代残差块bottle2neck进行特征提取,并将输出的特征图输入阶段1中;阶段1-阶段3进行特征提取和融合操作,都是在上一阶段的基础上产生一个低分辨率分支,然后每个低分辨率分支使用4个添加两层注意力模块的基准残差块2eca-basicblock进行特征提取,最后将得到的特征图进行重复多尺度融合并输入阶段4;阶段4为头网络,首先将三个并行的低分辨率分支输出的特征图通过双线性插值方法上采样为高分辨率分支的尺寸大小,然后通过拼接操作和全连接层得到最终的输出特征图,用于检测和重识别,并获得初始特征提取网络模型。First, the backbone network stem changes the height and width of the preprocessed image to a quarter of the original size through two 3×3 convolutional layers with a stride of 2, and then uses four second-generation residual blocks bottle2neck for feature extraction, and Input the output feature map into stage 1; stage 1-stage 3 perform feature extraction and fusion operations, which are based on the previous stage to generate a low-resolution branch, and then each low-resolution branch uses 4 to add two. The benchmark residual block 2eca-basicblock of the layer attention module is used for feature extraction, and finally the obtained feature map is repeatedly multi-scale fusion and input to
在步骤3)中,所述针对检测任务的目标中心点定位、边界尺寸、偏移误差以及重识别任务分别设置相应的损失函数;然后使用大量现有数据训练上述初始特征提取网络模型的参数,获得训练好的特征提取网络模型的方法是:In step 3), the target center point positioning, boundary size, offset error and re-identification task for the detection task are respectively set with corresponding loss functions; then use a large amount of existing data to train the parameters of the above-mentioned initial feature extraction network model, The way to get a trained feature extraction network model is:
目标中心点定位的损失函数使用变形的focal loss,用于计算预测的热图和实际真实的热图之间的损失,该损失函数能够有效处理目标中心点和周围各点样本不平衡的问题,公式如式(1)所示:The loss function of the target center point positioning uses the deformed focal loss to calculate the loss between the predicted heat map and the actual real heat map. This loss function can effectively deal with the problem of imbalance between the target center point and the surrounding points. The formula is shown in formula (1):
式(1)中,是预测的热图响应值,Mxy是热图的真实响应值;设行人目标区域的两个角点坐标分别为(x1,y1)和(x2,y2),则经过尺寸缩减后行人目标的中心点坐标为(ci x,ci y)=((x1+x2)/8,(y1+y2)/8),而行人目标某角点坐标(x,y)关于中心点坐标的热图的真实响应值如式(2)所示:In formula (1), is the predicted response value of the heat map, and M xy is the real response value of the heat map; let the coordinates of the two corners of the pedestrian target area be (x 1 , y 1 ) and (x 2 , y 2 ) respectively, then after the size reduction The coordinates of the center point of the pedestrian target are ( ci x , ci y )=((x 1 +x 2 )/8,(y 1 +y 2 )/8), while the coordinates of a corner of the pedestrian target (x, y) The true response value of the heat map with respect to the coordinates of the center point is shown in equation (2):
其中N表示图像中行人目标的数量,i表示第几个行人目标,σc表示标准方差;where N represents the number of pedestrian objects in the image, i represents the number of pedestrian objects, and σ c represents the standard deviation;
边界尺寸和偏移误差选择两个l1 loss作为损失函数,根据每个行人目标给出的角点坐标,损失函数如式(3)所示:Two l 1 losses are selected as loss functions for boundary size and offset error. According to the corner coordinates given by each pedestrian target, the loss function is shown in formula (3):
其中,si表示行人目标的真实尺寸,oi表示行人目标尺寸的真实偏移量,和分别表示尺寸和偏移量的预测值,Lbox表示由两个分支的损失相加得到的定位损失;Among them, s i represents the real size of the pedestrian target, o i represents the real offset of the pedestrian target size, and represent the predicted values of size and offset, respectively, and L box represents the localization loss obtained by adding the losses of the two branches;
重识别任务实质上是一个分类任务,因此选择softmax loss作为损失函数,在获取的热图上行人目标的中心点处提取一个身份特征向量进行学习并将其映射为一个类分布向量p(k),将每个行人目标的独热编码表示为Li(k),将类别数记为K,重识别任务的损失函数如式(4)所示:The re-identification task is essentially a classification task, so softmax loss is selected as the loss function, and an identity feature vector is extracted at the center point of the pedestrian target in the obtained heat map for learning and mapped to a class distribution vector p(k) , denote the one-hot encoding of each pedestrian target as Li ( k ), and denote the number of categories as K. The loss function of the re-identification task is shown in formula (4):
设置好上述所有损失函数后,选择CUHK-SYSU、PRW、MOT16数据集中的训练集图像作为训练集,2DMOT15数据集中的训练集图像作为验证集,对上述初始特征提取网络模型的参数进行训练;训练迭代次数设置为36轮次,其中前31轮学习率设置为1e-4,随后4轮学习率设为1e-5,最后一轮使用1e-6的学习率训练达到拟合;训练过程中输入的图像尺寸为(1088,608),批尺寸设为6,利用Adam优化器进行模型优化,使用relu作为激活函数,设置正则化系数为0.001,训练完成后最终获得训练好的特征提取网络模型。After setting all the above loss functions, select the training set images in the CUHK-SYSU, PRW, MOT16 data sets as the training set, and the training set images in the 2DMOT15 data set as the verification set, and train the parameters of the above initial feature extraction network model; training; The number of iterations is set to 36 rounds, in which the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the next 4 rounds is set to 1e-5, and the last round uses the learning rate of 1e-6 to achieve fitting; input during the training process The image size is (1088,608), the batch size is set to 6, the model is optimized using the Adam optimizer, relu is used as the activation function, and the regularization coefficient is set to 0.001. After the training is completed, the trained feature extraction network model is finally obtained.
在步骤4)中,所述利用上述训练后的特征提取网络模型对行人目标进行持续检测和跟踪的具体步骤如下:In step 4), the specific steps for continuously detecting and tracking pedestrian targets using the above-mentioned trained feature extraction network model are as follows:
4.1.首先,将第一帧图像作为输入图像,根据输入图像的标签信息初始化距离矩阵并进行封装,得到行人目标的外观信息和运动信息,用于后续的数据匹配;4.1. First, take the first frame image as the input image, initialize the distance matrix according to the label information of the input image and encapsulate it, and obtain the appearance information and motion information of the pedestrian target for subsequent data matching;
4.2.将每个行人目标作为一个类别,通过边界框对每个类别进行实例化作为一个跟踪对象,并根据当前帧检测结果,利用卡尔曼滤波方法预测出行人目标在下一帧图像中的位置信息;4.2. Take each pedestrian target as a category, instantiate each category as a tracking object through the bounding box, and use the Kalman filtering method to predict the position information of the pedestrian target in the next frame image according to the detection results of the current frame. ;
4.3.将上述预测得到的位置信息与外观信息和运动信息利用马氏距离度量进行匹配,以判断行人目标跟踪状态是初始默认状态、确认的状态还是删除的状态;其中,初始默认状态是指第一次检测到某个行人目标新生成的运动轨迹状态,因无法确认检测结果是否正确因而标注为该状态;若在接下来连续的三帧图像中都匹配成功,则将行人目标跟踪状态由初始默认状态变为确认的状态,确定该条运动轨迹为特定行人目标的跟踪轨迹;若在接下来的三帧图像内未匹配成功,则认为是误检测,确定该条运动轨迹为错误跟踪轨迹,由初始默认状态变为删除的状态,并将该条运动轨迹删除;4.3. Match the above-predicted position information with appearance information and motion information using Mahalanobis distance metric to determine whether the pedestrian target tracking state is the initial default state, the confirmed state or the deleted state; wherein, the initial default state refers to the first The newly generated motion trajectory state of a pedestrian target is detected at one time, and it is marked as this state because it cannot confirm whether the detection result is correct; if the matching is successful in the next three consecutive frames of images, the pedestrian target tracking state is changed from the initial state. The default state becomes the confirmed state, and the motion trajectory is determined as the tracking trajectory of a specific pedestrian target; if the next three frames of images do not match successfully, it is considered as a false detection, and the motion trajectory is determined as an incorrect tracking trajectory. Change from the initial default state to the deleted state, and delete the motion track;
4.4.若行人目标跟踪状态处于初始默认状态或确认的状态,进行级联、预测框与真实框的重叠度(IOU)匹配,可能得到匹配成功、未匹配到的跟踪、未匹配到的检测目标三种结果;若匹配成功,则使用卡尔曼滤波方法更新预测值和检测的观测值,更新行人目标的外观特征,更新跟踪轨迹并重复上述步骤;若结果为未匹配到的跟踪,则表明跟踪轨迹中断,删除该跟踪轨迹;若结果为未匹配到的检测目标,表明可能是新的行人目标,将其初始化为新的跟踪轨迹,分配新的跟踪器;4.4. If the pedestrian target tracking state is in the initial default state or the confirmed state, perform cascading and match the overlap between the predicted frame and the real frame (IOU), which may result in successful matching, unmatched tracking, and unmatched detection targets Three results; if the matching is successful, use the Kalman filter method to update the predicted value and the detected observation value, update the appearance feature of the pedestrian target, update the tracking trajectory and repeat the above steps; if the result is an unmatched tracking, it indicates that the tracking If the trajectory is interrupted, delete the tracking trajectory; if the result is an unmatched detection target, it indicates that it may be a new pedestrian target, initialize it as a new tracking trajectory, and assign a new tracker;
4.5.更新输入图像为下一帧图像后,重复步骤4.1、4.2、4.3、4.4,跟踪结束后最终得到行人目标在每一帧图像中的跟踪结果,从而确定出连续的行人跟踪轨迹,最后输出可视化结果。4.5. After updating the input image to the next frame of image, repeat steps 4.1, 4.2, 4.3, and 4.4. After the tracking is completed, the tracking result of the pedestrian target in each frame of image is finally obtained, so as to determine the continuous pedestrian tracking trajectory, and finally output Visualize the results.
本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法具有如下有益效果:The unified method for anchor-free detection and tracking based on adding an attention module provided by the present invention has the following beneficial effects:
(1)采用多任务学习策略,极大降低了网络的训练时间;(1) The multi-task learning strategy is adopted, which greatly reduces the training time of the network;
(2)训练后的网络模型具有较高的精确度与鲁棒性;(2) The trained network model has high accuracy and robustness;
(3)充分利用多尺度信息交互,深度提取、融合更具表达力的行人目标特征,更好地在行人互相遮挡场景下准确跟踪行人目标;(3) Make full use of multi-scale information interaction, deeply extract and fuse pedestrian target features with more expressiveness, and better track pedestrian targets accurately in scenes where pedestrians occlude each other;
(4)利用了二代残差块构成网络模型中的主干网络,同时结合更高效的注意力模块进行信息交互,使得预测方法的检测精度更高,重识别性能更强,能够适用于诸如航站楼等旅客互相遮挡严重场景下的行人目标检测和跟踪。(4) The second-generation residual block is used to form the backbone network in the network model, and combined with a more efficient attention module for information interaction, the prediction method has higher detection accuracy and stronger re-identification performance, which can be applied to such as aviation Pedestrian target detection and tracking in severe scenarios where passengers such as station buildings block each other.
附图说明Description of drawings
图1为本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法流程图。FIG. 1 is a flowchart of a unified method for anchor-free frame detection and tracking based on adding an attention module provided by the present invention.
图2为添加两层注意力模块的基准残差块结构示意图。Figure 2 is a schematic diagram of the structure of the baseline residual block with two layers of attention modules added.
图3为二代残差块与一代残差块的结构对比图。Figure 3 is a structural comparison diagram of the second-generation residual block and the first-generation residual block.
图4为本方法中构建的特征提取网络模型结构示意图。FIG. 4 is a schematic structural diagram of the feature extraction network model constructed in this method.
图5为对行人目标跟踪策略流程图。Figure 5 is a flow chart of the pedestrian target tracking strategy.
具体实施方式Detailed ways
下面结合附图对本发明进行详细描述。The present invention will be described in detail below with reference to the accompanying drawings.
如图1所示,本发明提供的基于添加注意力模块的无锚框检测、跟踪统一方法包括按顺序进行的下列步骤:As shown in Figure 1, the unified method for anchor-free frame detection and tracking based on adding an attention module provided by the present invention includes the following steps in sequence:
1)获取航站楼内客流密集区域的图像并进行预处理,获得预处理图像,并且每帧预处理图像带有一个标签,该标签内包含当前帧图像内所有行人目标的位置信息;1) Obtain the image of the passenger-intensive area in the terminal building and preprocess it to obtain a preprocessed image, and each frame of the preprocessed image has a label, and the label contains the position information of all pedestrian targets in the current frame image;
利用位于航站楼内客流密集区域的监控摄像头,在客流量较大时间段内以固定时间间隔拍摄旅客行走、遮挡过程中的图像,并对图像进行去模糊、降噪和提升分辨率在内的预处理,获得预处理图像,并且每帧预处理图像带有一个标签;该标签内包含当前帧图像内所有行人目标的位置信息。Using surveillance cameras located in the passenger-intensive areas in the terminal building, during the period of high passenger flow, the images of passengers walking and blocking are taken at fixed time intervals, and the images are deblurred, noise-reduced, and resolution improved. The preprocessing image is obtained, and each frame of the preprocessing image has a label; the label contains the position information of all pedestrian targets in the current frame image.
2)构建原始特征提取网络模型,然后将上述预处理图像输入原始特征提取网络模型进行特征提取,获得初始特征提取网络模型;2) constructing an original feature extraction network model, and then inputting the above-mentioned preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;
原始特征提取网络模型结构如图4所示,共分为五个阶段:stem、stage1、stage2、stage3、stage4;其中stem为主干网络;stage1至stage4为阶段1至阶段4;斜向上箭头表示上采样操作、斜向下箭头表示用于下采样的跨步卷积;conv表示卷积层,bn表示批归一化层,eca表示注意力模块,bottle2neck和2eca-basicblock分别表示二代残差块和添加两层注意力模块的基准残差块;The original feature extraction network model structure is shown in Figure 4, which is divided into five stages: stem, stage1, stage2, stage3, and stage4; where stem is the backbone network; stage1 to stage4 are stages 1 to 4; Sampling operation, diagonal downward arrows indicate strided convolution for downsampling; conv indicates convolution layer, bn indicates batch normalization layer, eca indicates attention module, bottle2neck and 2eca-basicblock indicate second-generation residual blocks, respectively and a baseline residual block with two layers of attention modules added;
首先主干网络stem通过两个步长为2的3×3的卷积层将预处理图像的高宽变为原来的四分之一,然后使用4个二代残差块bottle2neck进行特征提取,并将输出的特征图输入阶段1中;阶段1-阶段3进行特征提取和融合操作,都是在上一阶段的基础上产生一个低分辨率分支,然后每个低分辨率分支使用4个添加两层注意力模块的基准残差块2eca-basicblock进行特征提取,最后将得到的特征图进行重复多尺度融合并输入阶段4;阶段4为头网络,首先将三个并行的低分辨率分支输出的特征图通过双线性插值方法上采样为高分辨率分支的尺寸大小,然后通过拼接操作和全连接层得到最终的输出特征图,用于检测和重识别,并获得初始特征提取网络模型;First, the backbone network stem changes the height and width of the preprocessed image to a quarter of the original size through two 3×3 convolutional layers with a stride of 2, and then uses four second-generation residual blocks bottle2neck for feature extraction, and Input the output feature map into stage 1; stage 1-stage 3 perform feature extraction and fusion operations, which are based on the previous stage to generate a low-resolution branch, and then each low-resolution branch uses 4 to add two. The benchmark residual block 2eca-basicblock of the layer attention module is used for feature extraction, and finally the obtained feature map is repeatedly multi-scale fusion and input to
添加两层注意力模块的基准残差块2eca-basicblock结构图如图2所示,其中立方体代表特征图,H、W、C分别表示特征图的高度、宽度、通道维度,GAP表示全局平均池化操作,1×1×C表示一维卷积,k=5表示卷积核大小。注意力模块采用不降维的局部跨信道交互策略,并根据卷积核大小与通道维数之间的正比关系自适应地选择一维卷积核的大小即局部跨信道交互的覆盖率。注意力模块将初步提取到的特征图与经过了全局平均池化以及卷积操作后得到的局部跨通道信息进行融合,从而达到增强特征表达力的目的。同时,注意力模块由于使用跨域连接方式,带来的额外参数几乎可以忽略不计,因此可以广泛地应用于各类卷积网络中。注意力模块在结构上主要由三层结构组成:平均池化层、卷积层、激活层。其中卷积层为一维卷积,根据卷积核尺寸大小的设置不同,会使得最终提取特征的感受野不同,从而影响实验性能。由于不同深度的网络结构对卷积尺寸的敏感度不同,为此需要通过实验寻找到最佳的卷积和尺寸,以尽可能地提升注意力模块的性能。通过交叉验证的方法去手动调整卷积核的尺寸会极大地提高计算量,造成算力浪费。为此注意力模块中使用了分组卷积(group convolution),通过定义卷积在不同维度下的比例关系,设置了自适应选择卷积核大小的机制。The structure diagram of the baseline residual block 2eca-basicblock with two layers of attention modules is shown in Figure 2, in which the cube represents the feature map, H, W, and C represent the height, width and channel dimension of the feature map respectively, and GAP represents the global average pool. operation, 1×1×C represents one-dimensional convolution, and k=5 represents the size of the convolution kernel. The attention module adopts a local cross-channel interaction strategy without dimensionality reduction, and adaptively selects the size of the one-dimensional convolution kernel, that is, the coverage of local cross-channel interaction, according to the proportional relationship between the size of the convolution kernel and the channel dimension. The attention module fuses the initially extracted feature map with the local cross-channel information obtained after global average pooling and convolution operations, so as to achieve the purpose of enhancing feature expressiveness. At the same time, due to the use of cross-domain connections, the additional parameters brought by the attention module are almost negligible, so it can be widely used in various convolutional networks. The attention module is mainly composed of three layers in structure: average pooling layer, convolution layer, and activation layer. The convolution layer is a one-dimensional convolution. According to the size of the convolution kernel, the receptive field of the final extracted features will be different, which will affect the experimental performance. Since different depths of network structures have different sensitivities to the convolution size, it is necessary to find the optimal convolution and size through experiments to improve the performance of the attention module as much as possible. Manually adjusting the size of the convolution kernel through cross-validation will greatly increase the amount of computation, resulting in a waste of computing power. For this purpose, group convolution is used in the attention module. By defining the proportional relationship of convolution in different dimensions, a mechanism for adaptively selecting the size of the convolution kernel is set up.
图3为二代残差块与一代残差块的结构对比图。图3中左图为一代残差块bottleneck的网络结构图,由1×1-3×3-1×1三层卷积构成,输入和输出间跳变连接;右图为二代残差块bottle2neck网络结构图,主要改进为在通道维度上将3×3的卷积变为了从没有卷积到三个3×3卷积的四个分支。二代残差块bottle2neck相较于一代残差块bottleneck拥有更强的感受野和特征提取能力,具有较强的泛化性。Figure 3 is a structural comparison diagram of the second-generation residual block and the first-generation residual block. The left picture in Figure 3 is the network structure diagram of the first-generation residual block bottleneck, which is composed of 1×1-3×3-1×1 three-layer convolution, and the input and output are connected by jumping; the right picture is the second-generation residual block The bottle2neck network structure diagram is mainly improved by changing the 3×3 convolution in the channel dimension to four branches from no convolution to three 3×3 convolutions. Compared with the first-generation residual block bottle2neck, the second-generation residual block bottle2neck has stronger receptive field and feature extraction ability, and has strong generalization.
3)针对检测任务的目标中心点定位、边界尺寸、偏移误差以及重识别任务分别设置相应的损失函数;然后使用大量现有数据训练上述初始特征提取网络模型的参数,获得训练好的特征提取网络模型;3) Set the corresponding loss function for the target center point location, boundary size, offset error and re-identification task of the detection task respectively; then use a large amount of existing data to train the parameters of the above initial feature extraction network model to obtain the trained feature extraction network model;
目标中心点定位的损失函数使用变形的focal loss,用于计算预测的热图和实际真实的热图之间的损失,该损失函数能够有效处理目标中心点和周围各点样本不平衡的问题,公式如式(1)所示:The loss function of the target center point positioning uses the deformed focal loss to calculate the loss between the predicted heat map and the actual real heat map. This loss function can effectively deal with the problem of imbalance between the target center point and the surrounding points. The formula is shown in formula (1):
式(1)中,是预测的热图响应值,Mxy是热图的真实响应值。设行人目标区域的两个角点坐标分别为(x1,y1)和(x2,y2),则经过尺寸缩减后行人目标的中心点坐标为(ci x,ci y)=((x1+x2)/8,(y1+y2)/8),而行人目标某角点坐标(x,y)关于中心点坐标的热图的真实响应值如式(2)所示:In formula (1), is the predicted heatmap response value, and M xy is the true response value of the heatmap. Suppose the coordinates of the two corners of the pedestrian target area are (x 1 , y 1 ) and (x 2 , y 2 ) respectively, then the coordinates of the center point of the pedestrian target after size reduction are ( ci x , ci y )= ((x 1 +x 2 )/8,(y 1 +y 2 )/8), and the real response value of the heat map of the coordinates of a certain corner of the pedestrian target (x, y) about the coordinates of the center point is shown in formula (2) shown:
其中N表示图像中行人目标的数量,i表示第几个行人目标,σc表示标准方差。where N is the number of pedestrian objects in the image, i is the number of pedestrian objects, and σ c is the standard deviation.
边界尺寸和偏移误差选择两个l1 loss作为损失函数,根据每个行人目标给出的角点坐标,损失函数如式(3)所示:Two l 1 losses are selected as loss functions for boundary size and offset error. According to the corner coordinates given by each pedestrian target, the loss function is shown in formula (3):
其中,si表示行人目标的真实尺寸,oi表示行人目标尺寸的真实偏移量,和分别表示尺寸和偏移量的预测值,Lbox表示由两个分支的损失相加得到的定位损失。Among them, s i represents the real size of the pedestrian target, o i represents the real offset of the pedestrian target size, and are the predicted values of size and offset, respectively, and L box represents the localization loss obtained by adding the losses of the two branches.
重识别任务实质上是一个分类任务,因此本发明选择softmax loss作为损失函数,在获取的热图上行人目标的中心点处提取一个身份特征向量进行学习并将其映射为一个类分布向量p(k),将每个行人目标的独热编码(one-hot)表示为Li(k),将类别数记为K,重识别任务的损失函数如式(4)所示:The re-identification task is essentially a classification task, so the present invention selects softmax loss as the loss function, extracts an identity feature vector at the center point of the pedestrian target in the obtained heat map for learning and maps it to a class distribution vector p( k), denote the one-hot encoding (one-hot) of each pedestrian target as Li ( k ), denote the number of categories as K, and the loss function of the re-identification task is shown in formula (4):
设置好上述所有损失函数后,选择CUHK-SYSU、PRW、MOT16数据集中的训练集图像作为训练集,2DMOT15数据集中的训练集图像作为验证集,对上述初始特征提取网络模型的参数进行训练。训练迭代次数设置为36轮次,其中前31轮学习率设置为1e-4,随后4轮学习率设为1e-5,最后一轮使用1e-6的学习率训练达到拟合。训练过程中输入的图像尺寸为(1088,608),批尺寸设为6,利用Adam优化器进行模型优化,使用relu作为激活函数,设置正则化系数为0.001,训练完成后最终获得训练好的特征提取网络模型。After setting all the above loss functions, select the training set images in the CUHK-SYSU, PRW, and MOT16 data sets as the training set, and the training set images in the 2DMOT15 data set as the validation set, and train the parameters of the above initial feature extraction network model. The number of training iterations is set to 36 rounds, of which the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the next 4 rounds is set to 1e-5, and the last round is trained with a learning rate of 1e-6 to achieve fitting. The input image size during the training process is (1088,608), the batch size is set to 6, the Adam optimizer is used for model optimization, relu is used as the activation function, and the regularization coefficient is set to 0.001. After the training is completed, the trained features are finally obtained. Extract the network model.
4)利用上述训练后的特征提取网络模型对行人目标进行持续检测和跟踪。4) Use the above trained feature extraction network model to continuously detect and track pedestrian targets.
具体步骤如下:Specific steps are as follows:
4.1.首先,将第一帧图像作为输入图像,根据输入图像的标签信息初始化距离矩阵并进行封装,得到行人目标的外观信息和运动信息,用于后续的数据匹配;4.1. First, take the first frame image as the input image, initialize the distance matrix according to the label information of the input image and encapsulate it, and obtain the appearance information and motion information of the pedestrian target for subsequent data matching;
4.2.将每个行人目标作为一个类别,通过边界框对每个类别进行实例化作为一个跟踪对象,并根据当前帧检测结果,利用卡尔曼滤波方法预测出行人目标在下一帧图像中的位置信息;4.2. Take each pedestrian target as a category, instantiate each category as a tracking object through the bounding box, and use the Kalman filtering method to predict the position information of the pedestrian target in the next frame image according to the detection results of the current frame. ;
4.3.将上述预测得到的位置信息与外观信息和运动信息利用马氏距离度量进行匹配,以判断行人目标跟踪状态是初始默认状态、确认的状态还是删除的状态;其中,初始默认状态是指第一次检测到某个行人目标新生成的运动轨迹状态,因无法确认检测结果是否正确因而标注为该状态;若在接下来连续的三帧图像中都匹配成功,则将行人目标跟踪状态由初始默认状态变为确认的状态,确定该条运动轨迹为特定行人目标的跟踪轨迹;若在接下来的三帧图像内未匹配成功,则认为是误检测,确定该条运动轨迹为错误跟踪轨迹,由初始默认状态变为删除的状态,并将该条运动轨迹删除;4.3. Match the above-predicted position information with appearance information and motion information using Mahalanobis distance metric to determine whether the pedestrian target tracking state is the initial default state, the confirmed state or the deleted state; wherein, the initial default state refers to the first The newly generated motion trajectory state of a pedestrian target is detected at one time, and it is marked as this state because it cannot confirm whether the detection result is correct; if the matching is successful in the next three consecutive frames of images, the pedestrian target tracking state is changed from the initial state. The default state becomes the confirmed state, and the motion trajectory is determined as the tracking trajectory of a specific pedestrian target; if the next three frames of images do not match successfully, it is considered as a false detection, and the motion trajectory is determined as an incorrect tracking trajectory. Change from the initial default state to the deleted state, and delete the motion track;
4.4.若行人目标跟踪状态处于初始默认状态或确认的状态,进行级联、预测框与真实框的重叠度(IOU)匹配,可能得到匹配成功、未匹配到的跟踪、未匹配到的检测目标三种结果;若匹配成功,则使用卡尔曼滤波方法更新预测值和检测的观测值,更新行人目标的外观特征,更新跟踪轨迹并重复上述步骤;若结果为未匹配到的跟踪,则表明跟踪轨迹中断,删除该跟踪轨迹;若结果为未匹配到的检测目标,表明可能是新的行人目标,将其初始化为新的跟踪轨迹,分配新的跟踪器;4.4. If the pedestrian target tracking state is in the initial default state or the confirmed state, perform cascading and match the overlap between the predicted frame and the real frame (IOU), which may result in successful matching, unmatched tracking, and unmatched detection targets Three results; if the matching is successful, use the Kalman filter method to update the predicted value and the detected observation value, update the appearance feature of the pedestrian target, update the tracking trajectory and repeat the above steps; if the result is an unmatched tracking, it indicates that the tracking If the trajectory is interrupted, delete the tracking trajectory; if the result is an unmatched detection target, it indicates that it may be a new pedestrian target, initialize it as a new tracking trajectory, and assign a new tracker;
4.5.更新输入图像为下一帧图像后,重复步骤4.1、4.2、4.3、4.4,跟踪结束后最终得到行人目标在每一帧图像中的跟踪结果,从而确定出连续的行人跟踪轨迹,最后输出可视化结果。4.5. After updating the input image to the next frame of image, repeat steps 4.1, 4.2, 4.3, and 4.4. After the tracking is completed, the tracking result of the pedestrian target in each frame of image is finally obtained, so as to determine the continuous pedestrian tracking trajectory, and finally output Visualize the results.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210057161.9A CN114387265A (en) | 2022-01-19 | 2022-01-19 | Anchor-frame-free detection and tracking unified method based on attention module addition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210057161.9A CN114387265A (en) | 2022-01-19 | 2022-01-19 | Anchor-frame-free detection and tracking unified method based on attention module addition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114387265A true CN114387265A (en) | 2022-04-22 |
Family
ID=81203170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210057161.9A Pending CN114387265A (en) | 2022-01-19 | 2022-01-19 | Anchor-frame-free detection and tracking unified method based on attention module addition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114387265A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114820545A (en) * | 2022-05-09 | 2022-07-29 | 天津大学 | A method for primary screening and detection of onychomycosis based on improved residual network |
CN114972805A (en) * | 2022-05-07 | 2022-08-30 | 杭州像素元科技有限公司 | Anchor-free joint detection and embedding-based multi-target tracking method |
CN115082517A (en) * | 2022-05-25 | 2022-09-20 | 华南理工大学 | Horse racing scene multi-target tracking method based on data enhancement |
CN117455955A (en) * | 2023-12-14 | 2024-01-26 | 武汉纺织大学 | Pedestrian multi-target tracking method based on unmanned aerial vehicle visual angle |
CN117576489A (en) * | 2024-01-17 | 2024-02-20 | 华侨大学 | Robust real-time target sensing methods, devices, equipment and media for intelligent robots |
CN117670938A (en) * | 2024-01-30 | 2024-03-08 | 江西方兴科技股份有限公司 | Multi-target space-time tracking method based on super-treatment robot |
CN117952287A (en) * | 2024-03-27 | 2024-04-30 | 飞友科技有限公司 | Prediction method and system for number of passengers in terminal building waiting area |
-
2022
- 2022-01-19 CN CN202210057161.9A patent/CN114387265A/en active Pending
Non-Patent Citations (1)
Title |
---|
张红颖 等: "基于卷积注意力模块和无锚框检测网络的行人跟踪算法", 《HTTPS://KNS.CNKI.NET/KCMS/DETAIL/11.4494.TN.20211008.1500.008.HTML》, 9 October 2021 (2021-10-09) * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114972805A (en) * | 2022-05-07 | 2022-08-30 | 杭州像素元科技有限公司 | Anchor-free joint detection and embedding-based multi-target tracking method |
CN114820545A (en) * | 2022-05-09 | 2022-07-29 | 天津大学 | A method for primary screening and detection of onychomycosis based on improved residual network |
CN115082517A (en) * | 2022-05-25 | 2022-09-20 | 华南理工大学 | Horse racing scene multi-target tracking method based on data enhancement |
CN115082517B (en) * | 2022-05-25 | 2024-04-19 | 华南理工大学 | Multi-target tracking method in horse racing scene based on data enhancement |
CN117455955A (en) * | 2023-12-14 | 2024-01-26 | 武汉纺织大学 | Pedestrian multi-target tracking method based on unmanned aerial vehicle visual angle |
CN117455955B (en) * | 2023-12-14 | 2024-03-08 | 武汉纺织大学 | Pedestrian multi-target tracking method based on unmanned aerial vehicle visual angle |
CN117576489A (en) * | 2024-01-17 | 2024-02-20 | 华侨大学 | Robust real-time target sensing methods, devices, equipment and media for intelligent robots |
CN117576489B (en) * | 2024-01-17 | 2024-04-09 | 华侨大学 | Robust real-time target sensing method, device, equipment and medium for intelligent robot |
CN117670938A (en) * | 2024-01-30 | 2024-03-08 | 江西方兴科技股份有限公司 | Multi-target space-time tracking method based on super-treatment robot |
CN117670938B (en) * | 2024-01-30 | 2024-05-10 | 江西方兴科技股份有限公司 | Multi-target space-time tracking method based on super-treatment robot |
CN117952287A (en) * | 2024-03-27 | 2024-04-30 | 飞友科技有限公司 | Prediction method and system for number of passengers in terminal building waiting area |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114387265A (en) | Anchor-frame-free detection and tracking unified method based on attention module addition | |
Liu et al. | Multiscale U-shaped CNN building instance extraction framework with edge constraint for high-spatial-resolution remote sensing imagery | |
CN107967451B (en) | A Method for Crowd Counting on Still Images | |
CN108830171B (en) | Intelligent logistics warehouse guide line visual detection method based on deep learning | |
CN110660082A (en) | A target tracking method based on graph convolution and trajectory convolution network learning | |
CN110163188B (en) | Video processing and method, device and equipment for embedding target object in video | |
CN111950404B (en) | A single image 3D reconstruction method based on deep learning video supervision | |
Manssor et al. | Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network | |
Xing et al. | Traffic sign recognition using guided image filtering | |
CN111178284A (en) | Pedestrian re-identification method and system based on spatio-temporal union model of map data | |
CN111862145A (en) | A target tracking method based on multi-scale pedestrian detection | |
Tang et al. | Multiple-kernel based vehicle tracking using 3D deformable model and camera self-calibration | |
US20240161315A1 (en) | Accurate and robust visual object tracking approach for quadrupedal robots based on siamese network | |
CN115861619A (en) | Airborne LiDAR (light detection and ranging) urban point cloud semantic segmentation method and system of recursive residual double-attention kernel point convolution network | |
CN114387496A (en) | Target detection method and electronic equipment | |
CN115661611A (en) | Infrared small target detection method based on improved Yolov5 network | |
CN114972182A (en) | Object detection method and device | |
CN116630850A (en) | Siamese object tracking method based on multi-attention task fusion and bounding box encoding | |
JP2019220174A (en) | Image processing using artificial neural network | |
CN114842447B (en) | A fast parking space recognition method based on convolutional neural network | |
CN114419338B (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN113536997B (en) | Intelligent security system and method based on image recognition and behavior analysis | |
Maddileti et al. | Pseudo Trained YOLO R_CNN Model for Weapon Detection with a Real-Time Kaggle Dataset | |
CN114048536A (en) | A road structure prediction and target detection method based on multi-task neural network | |
CN114677330A (en) | An image processing method, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |