CN114387265A - Anchor-frame-free detection and tracking unified method based on attention module addition - Google Patents

Anchor-frame-free detection and tracking unified method based on attention module addition Download PDF

Info

Publication number
CN114387265A
CN114387265A CN202210057161.9A CN202210057161A CN114387265A CN 114387265 A CN114387265 A CN 114387265A CN 202210057161 A CN202210057161 A CN 202210057161A CN 114387265 A CN114387265 A CN 114387265A
Authority
CN
China
Prior art keywords
tracking
feature extraction
network model
pedestrian
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210057161.9A
Other languages
Chinese (zh)
Inventor
张红颖
贺鹏艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation University of China
Original Assignee
Civil Aviation University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation University of China filed Critical Civil Aviation University of China
Priority to CN202210057161.9A priority Critical patent/CN114387265A/en
Publication of CN114387265A publication Critical patent/CN114387265A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

An anchor-frame-free detection and tracking unified method based on an attention adding module. It comprises obtaining a pre-processed image; obtaining an initial feature extraction network model; obtaining a trained feature extraction network model; and continuously detecting and tracking the pedestrian target by utilizing the trained feature extraction network model. The invention has the following effects: by adopting a multi-task learning strategy, the training time of the network is greatly reduced; the trained network model has higher accuracy and robustness; the multi-scale information interaction is fully utilized, the pedestrian target characteristics with better expressive power are deeply extracted and fused, and the pedestrian target is accurately tracked under the scene that pedestrians are mutually shielded; the second-generation residual block is utilized to form a backbone network in a network model, and meanwhile, a more efficient attention module is combined for information interaction, so that the detection precision of the prediction method is higher, the re-identification performance is stronger, and the method can be suitable for detecting and tracking the pedestrian target in the scene that passengers shelter from each other seriously, such as a terminal building.

Description

Anchor-frame-free detection and tracking unified method based on attention module addition
Technical Field
The invention belongs to the technical field of civil aviation, and particularly relates to an anchor-frame-free detection and tracking unified method based on an attention module.
Background
With the wide application of intelligent video monitoring in public areas such as transportation hubs and commercial districts and the excellence in security protection, passenger flow monitoring and the like, the computer vision technology relied on by the intelligent video monitoring is also developing at a high speed. The pedestrian tracking is used as a hot technology in computer vision, the identity, the position information and the motion trail of a pedestrian target can be obtained by analyzing the obtained video data, and the method has higher initiative, real-time performance and practicability compared with other positioning methods, so that the method has the effects of guiding partition planning, providing private customization for passengers, reminding the passengers of boarding information, maintaining order and safety and the like in places such as airport terminal buildings and the like, and has certain application value.
With the wide application of deep learning in the field of computer vision, the multi-target tracking algorithm based on deep learning gradually occupies a major position in the field of pedestrian tracking. At present, the mainstream pedestrian tracking algorithm is, for example, FairMOT, JDE and the like, and the main idea is to divide the multi-target tracking problem into two parts of detection and tracking, acquire the position of a pedestrian through a detection network, and then track by matching previous and next frames by using a data association technology, so that the excellent performance of the target detection network substantially determines the tracking performance.
Most of the existing pedestrian detection and tracking methods still implement a strategy of detecting first and then tracking, and meanwhile, the detection performance of a detection network on a plurality of targets is limited, so that problems such as low detection precision, target loss in an occlusion scene, complex network parameters, poor algorithm real-time performance and the like exist, and urgent solution is needed.
Disclosure of Invention
In order to solve the above problems, the present invention aims to provide a method for detecting and tracking without anchor frame based on adding attention module.
In order to achieve the above purpose, the method for detecting and tracking without anchor frame based on the attention adding module provided by the invention comprises the following steps in sequence:
1) acquiring images of a passenger flow dense area in a terminal building and preprocessing the images to acquire preprocessed images, wherein each preprocessed image is provided with a label, and the label comprises position information of all pedestrian targets in a current frame image;
2) constructing an original feature extraction network model, inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;
3) respectively setting corresponding loss functions aiming at the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, training parameters of the initial feature extraction network model by using a large amount of existing data to obtain a trained feature extraction network model;
4) and continuously detecting and tracking the pedestrian target by using the trained feature extraction network model.
In step 1), the method for acquiring and preprocessing the image of the passenger flow dense area in the terminal building to obtain the preprocessed image comprises the following steps: the method comprises the steps of utilizing a monitoring camera located in a passenger flow dense area in an airport terminal to shoot images of passengers in the walking and shielding processes at fixed time intervals in a time period with larger passenger flow, and carrying out preprocessing including deblurring, noise reduction and resolution improvement on the images to obtain preprocessed images.
In step 2), the method for constructing the original feature extraction network model and then inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain the initial feature extraction network model comprises the following steps:
the original feature extraction network model is divided into five stages: stem, stage1, stage2, stage3, stage 4; wherein stem is a backbone network; stage1 to stage4 are stage1 to stage 4;
firstly, a backbone network stem changes the height and width of a preprocessed image into one fourth of the original width through two convolution layers with the step length of 2 and the convolution layers of 3 multiplied by 3, then uses 4 second-generation residual block 2 notches to carry out feature extraction, and inputs an output feature diagram into a stage 1; the stages 1 to 3 carry out feature extraction and fusion operations, namely, generating a low-resolution branch on the basis of the previous stage, then carrying out feature extraction on each low-resolution branch by using 4 reference residual blocks 2eca-basic blocks added with two layers of attention modules, and finally carrying out repeated multi-scale fusion on the obtained feature graph and inputting the feature graph into the stage 4; and stage4 is a head network, firstly, the feature graph output by the three parallel low-resolution branches is up-sampled into the size of the high-resolution branch by a bilinear interpolation method, and then, a final output feature graph is obtained through splicing operation and a full connection layer and is used for detection and re-identification, and an initial feature extraction network model is obtained.
In the step 3), corresponding loss functions are respectively set for the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, using a large amount of existing data to train the parameters of the initial feature extraction network model, the method for obtaining the trained feature extraction network model comprises the following steps:
the loss function for target center point positioning uses the deformed focal length for calculating the loss between the predicted heat map and the actual real heat map, and can effectively deal with the problem of unbalance between the target center point and the surrounding points, and the formula is shown in formula (1):
Figure BDA0003476851000000031
in the formula (1), the reaction mixture is,
Figure BDA0003476851000000032
is a predicted heat map response value, MxyIs the true response value of the heatmap; let two corner point coordinates of the pedestrian target area be (x) respectively1,y1) And (x)2,y2) The coordinate of the center point of the pedestrian target after size reduction is (c)i x,ci y)=((x1+x2)/8,(y1+y2) And/8), and the real response value of the heatmap of the coordinates (x, y) of a certain corner point of the pedestrian target with respect to the coordinates of the center point is shown as the formula (2):
Figure BDA0003476851000000033
where N represents the number of pedestrian objects in the image, i represents the fourth pedestrian object, σcRepresents the standard deviation;
boundary size and offset error select two1loss as a loss function, according to the corner coordinates given by each pedestrian target, the loss function is shown as formula (3):
Figure BDA0003476851000000041
wherein s isiRepresenting the true size of the pedestrian object, oiA true offset representing a target size of the pedestrian,
Figure BDA0003476851000000042
and
Figure BDA0003476851000000043
respectively, the predicted values of the size and the offset, LboxRepresenting the localization loss resulting from the addition of the losses of the two branches;
the re-identification task is a classification task in essence, so softmax loss is selected as a loss function, an identity feature vector is extracted from the central point of the pedestrian target on the acquired heat map for learning and is mapped into a class distribution vector p (k), and the unique hot code of each pedestrian target is expressed as Li(k) The number of categories is marked as K, and the loss function of the re-recognition task is shown as the formula (4):
Figure BDA0003476851000000044
after all the loss functions are set, selecting training set images in CUHK-SYSU, PRW and MOT16 data sets as training sets, selecting training set images in 2DMOT15 data sets as verification sets, and training parameters of the initial feature extraction network model; the number of training iterations is set to 36 rounds, wherein the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the subsequent 4 rounds is set to 1e-5, and the last round of training reaches the fitting by using the learning rate of 1 e-6; the size of an input image in the training process is (1088,608), the batch size is set to be 6, model optimization is carried out by using an Adam optimizer, relu is used as an activation function, a regularization coefficient is set to be 0.001, and a trained feature extraction network model is finally obtained after training is completed.
In step 4), the specific steps of continuously detecting and tracking the pedestrian target by using the trained feature extraction network model are as follows:
4.1. firstly, taking a first frame image as an input image, initializing a distance matrix according to label information of the input image and packaging to obtain appearance information and motion information of a pedestrian target for subsequent data matching;
4.2. each pedestrian target is used as a category, each category is instantiated through a boundary frame to be used as a tracking object, and the position information of the pedestrian target in the next frame of image is predicted by using a Kalman filtering method according to the current frame detection result;
4.3. matching the predicted position information with appearance information and motion information by using Mahalanobis distance measurement to judge whether the pedestrian target tracking state is an initial default state, a confirmed state or a deleted state; the initial default state refers to a state of detecting a newly generated motion track of a certain pedestrian target for the first time, and is marked as the state because whether a detection result is correct or not cannot be confirmed; if the matching is successful in the next three continuous frames of images, changing the tracking state of the pedestrian target from an initial default state to a confirmed state, and determining that the motion track is the tracking track of the specific pedestrian target; if the matching is not successful in the next three frames of images, the detection is regarded as false detection, the motion track is determined to be a false tracking track, the initial default state is changed into a deleting state, and the motion track is deleted;
4.4. if the pedestrian target tracking state is in an initial default state or a confirmed state, cascade connection is carried out, overlapping degree (IOU) matching of a prediction frame and a real frame is carried out, and three results of successfully matched, unmatched tracking and unmatched detection targets can be obtained; if the matching is successful, updating the predicted value and the detected observed value by using a Kalman filtering method, updating the appearance characteristic of the pedestrian target, updating the tracking track and repeating the steps; if the result is unmatched tracking, indicating that the tracking track is interrupted, and deleting the tracking track; if the result is that the detected target is not matched, the detected target is possibly a new pedestrian target, the detected target is initialized to be a new tracking track, and a new tracker is distributed;
4.5. and after the input image is updated to be the next frame image, repeating the steps 4.1, 4.2, 4.3 and 4.4, and finally obtaining the tracking result of the pedestrian target in each frame image after the tracking is finished, so that the continuous pedestrian tracking track is determined, and finally, a visualization result is output.
The method for detecting and tracking the non-anchor frame based on the attention adding module has the following beneficial effects that:
(1) by adopting a multi-task learning strategy, the training time of the network is greatly reduced;
(2) the trained network model has higher accuracy and robustness;
(3) the multi-scale information interaction is fully utilized, the pedestrian target characteristics with better expressive power are deeply extracted and fused, and the pedestrian target is accurately tracked under the scene that pedestrians are mutually shielded;
(4) the second-generation residual block is utilized to form a backbone network in a network model, and meanwhile, a more efficient attention module is combined for information interaction, so that the detection precision of the prediction method is higher, the re-identification performance is stronger, and the method can be suitable for detecting and tracking the pedestrian target in the scene that passengers shelter from each other seriously, such as a terminal building.
Drawings
FIG. 1 is a flow chart of a unified method for detecting and tracking without an anchor frame based on an attention module according to the present invention.
Fig. 2 is a schematic diagram of a reference residual block structure with two layers of attention modules added.
Fig. 3 is a diagram comparing the structure of the second generation residual block and the first generation residual block.
Fig. 4 is a schematic structural diagram of a feature extraction network model constructed in the method.
FIG. 5 is a flow chart of a pedestrian target tracking strategy.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
As shown in FIG. 1, the method for detecting and tracking without anchor frame based on attention adding module provided by the invention comprises the following steps in sequence:
1) acquiring images of a passenger flow dense area in a terminal building and preprocessing the images to acquire preprocessed images, wherein each preprocessed image is provided with a label, and the label comprises position information of all pedestrian targets in a current frame image;
utilizing a monitoring camera positioned in a passenger flow dense area in an airport terminal to shoot images of passengers in the walking and shielding processes at fixed time intervals in a time period with larger passenger flow volume, and carrying out preprocessing including deblurring, noise reduction and resolution improvement on the images to obtain preprocessed images, wherein each preprocessed image is provided with a label; the label contains the position information of all the pedestrian targets in the current frame image.
2) Constructing an original feature extraction network model, inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;
the structure of the original feature extraction network model is shown in fig. 4, and is divided into five stages: stem, stage1, stage2, stage3, stage 4; wherein stem is a backbone network; stage1 to stage4 are stage1 to stage 4; the up-arrow represents the up-sampling operation and the down-arrow represents the step-wise convolution for down-sampling; conv represents a convolutional layer, bn represents a batch normalization layer, eca represents an attention module, and a bottle2neck and a 2eca-basic block represent a second generation residual block and a reference residual block added with two layers of attention modules respectively;
firstly, a backbone network stem changes the height and width of a preprocessed image into one fourth of the original width through two convolution layers with the step length of 2 and the convolution layers of 3 multiplied by 3, then uses 4 second-generation residual block 2 notches to carry out feature extraction, and inputs an output feature diagram into a stage 1; the stages 1 to 3 carry out feature extraction and fusion operations, namely, generating a low-resolution branch on the basis of the previous stage, then carrying out feature extraction on each low-resolution branch by using 4 reference residual blocks 2eca-basic blocks added with two layers of attention modules, and finally carrying out repeated multi-scale fusion on the obtained feature graph and inputting the feature graph into the stage 4; stage4 is a head network, firstly, the feature graph output by three parallel low-resolution branches is up-sampled into the size of the high-resolution branch by a bilinear interpolation method, and then the final output feature graph is obtained through splicing operation and a full connection layer and is used for detection and re-identification, and an initial feature extraction network model is obtained;
a structure diagram of a reference residual block 2eca-basic block with two layers of attention modules is shown in fig. 2, where a cube represents a feature map, H, W, C represents height, width, and channel dimensions of the feature map, respectively, GAP represents a global average pooling operation, 1 × 1 × C represents a one-dimensional convolution, and k ═ 5 represents a convolution kernel size. The attention module adopts a local cross-channel interaction strategy without dimension reduction, and adaptively selects the size of a one-dimensional convolution kernel, namely the coverage rate of local cross-channel interaction according to the proportional relation between the size of the convolution kernel and the channel dimension. And the attention module fuses the preliminarily extracted feature map and the local cross-channel information obtained after global average pooling and convolution operation, so that the purpose of enhancing feature expression is achieved. Meanwhile, the attention module uses a cross-domain connection mode, so that additional parameters brought by the attention module can be almost ignored, and the attention module can be widely applied to various convolutional networks. The attention module is structurally mainly composed of a three-layer structure: average pooling layer, convolution layer, active layer. The convolution layer is a one-dimensional convolution, and the setting of the size of a convolution kernel is different, so that the receptive field of the finally extracted features is different, and the experimental performance is influenced. Because the sensitivity of the network structures at different depths to the convolution size is different, the optimal convolution and size need to be found through experiments to improve the performance of the attention module as much as possible. Manually adjusting the size of the convolution kernel through a cross validation method can greatly improve the calculation amount and cause the waste of calculation power. For this purpose, a grouping convolution (group convolution) is used in the attention module, and a mechanism for adaptively selecting the size of a convolution kernel is provided by defining the proportional relation of the convolution in different dimensions.
Fig. 3 is a diagram comparing the structure of the second generation residual block and the first generation residual block. The left diagram in fig. 3 is a network structure diagram of a generation residual block bottleeck, which is formed by three layers of convolution of 1 × 1-3 × 3-1 × 1, and the input and the output are connected in a jumping way; the right diagram is a structure diagram of a second generation residual block (tile 2 nack) network, which is mainly modified to change the 3 × 3 convolution into four branches from no convolution to three 3 × 3 convolutions in the channel dimension. Compared with the first-generation residual block, the second-generation residual block 2neck has stronger receptive field and characteristic extraction capability and stronger generalization.
3) Respectively setting corresponding loss functions aiming at the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, training parameters of the initial feature extraction network model by using a large amount of existing data to obtain a trained feature extraction network model;
the loss function for target center point positioning uses the deformed focal length for calculating the loss between the predicted heat map and the actual real heat map, and can effectively deal with the problem of unbalance between the target center point and the surrounding points, and the formula is shown in formula (1):
Figure BDA0003476851000000091
in the formula (1), the reaction mixture is,
Figure BDA0003476851000000092
is a predicted heat map response value, MxyIs the true response value of the heatmap. Is provided withThe coordinates of two corner points of the pedestrian target area are respectively (x)1,y1) And (x)2,y2) The coordinate of the center point of the pedestrian target after size reduction is (c)i x,ci y)=((x1+x2)/8,(y1+y2) And/8), and the real response value of the heatmap of the coordinates (x, y) of a certain corner point of the pedestrian target with respect to the coordinates of the center point is shown as the formula (2):
Figure BDA0003476851000000093
where N represents the number of pedestrian objects in the image, i represents the fourth pedestrian object, σcThe standard deviation is indicated.
Boundary size and offset error select two1loss as a loss function, according to the corner coordinates given by each pedestrian target, the loss function is shown as formula (3):
Figure BDA0003476851000000101
wherein s isiRepresenting the true size of the pedestrian object, oiA true offset representing a target size of the pedestrian,
Figure BDA0003476851000000102
and
Figure BDA0003476851000000103
respectively, the predicted values of the size and the offset, LboxIndicating the positioning loss resulting from the addition of the losses of the two branches.
The re-identification task is a classification task, so the invention selects softmax loss as a loss function, extracts an identity characteristic vector at the central point of the pedestrian target on the acquired heat map for learning and maps the identity characteristic vector into a class distribution vector p (k), and expresses the one-hot code (one-hot) of each pedestrian target as Li(k) Recording the number of categories as K, and re-identifying the loss function of the taskThe number is shown in formula (4):
Figure BDA0003476851000000104
after all the loss functions are set, training set images in CUHK-SYSU, PRW and MOT16 data sets are selected as training sets, training set images in a 2DMOT15 data set are selected as verification sets, and parameters of the initial feature extraction network model are trained. The number of training iterations was set to 36 runs, with the first 31 runs of learning rate set to 1e-4, followed by 4 runs of learning rate set to 1e-5, and the last run of training using a learning rate of 1e-6 to achieve the fit. The size of an input image in the training process is (1088,608), the batch size is set to be 6, model optimization is carried out by using an Adam optimizer, relu is used as an activation function, a regularization coefficient is set to be 0.001, and a trained feature extraction network model is finally obtained after training is completed.
4) And continuously detecting and tracking the pedestrian target by using the trained feature extraction network model.
The method comprises the following specific steps:
4.1. firstly, taking a first frame image as an input image, initializing a distance matrix according to label information of the input image and packaging to obtain appearance information and motion information of a pedestrian target for subsequent data matching;
4.2. each pedestrian target is used as a category, each category is instantiated through a boundary frame to be used as a tracking object, and the position information of the pedestrian target in the next frame of image is predicted by using a Kalman filtering method according to the current frame detection result;
4.3. matching the predicted position information with appearance information and motion information by using Mahalanobis distance measurement to judge whether the pedestrian target tracking state is an initial default state, a confirmed state or a deleted state; the initial default state refers to a state of detecting a newly generated motion track of a certain pedestrian target for the first time, and is marked as the state because whether a detection result is correct or not cannot be confirmed; if the matching is successful in the next three continuous frames of images, changing the tracking state of the pedestrian target from an initial default state to a confirmed state, and determining that the motion track is the tracking track of the specific pedestrian target; if the matching is not successful in the next three frames of images, the detection is regarded as false detection, the motion track is determined to be a false tracking track, the initial default state is changed into a deleting state, and the motion track is deleted;
4.4. if the pedestrian target tracking state is in an initial default state or a confirmed state, cascade connection is carried out, overlapping degree (IOU) matching of a prediction frame and a real frame is carried out, and three results of successfully matched, unmatched tracking and unmatched detection targets can be obtained; if the matching is successful, updating the predicted value and the detected observed value by using a Kalman filtering method, updating the appearance characteristic of the pedestrian target, updating the tracking track and repeating the steps; if the result is unmatched tracking, indicating that the tracking track is interrupted, and deleting the tracking track; if the result is that the detected target is not matched, the detected target is possibly a new pedestrian target, the detected target is initialized to be a new tracking track, and a new tracker is distributed;
4.5. and after the input image is updated to be the next frame image, repeating the steps 4.1, 4.2, 4.3 and 4.4, and finally obtaining the tracking result of the pedestrian target in each frame image after the tracking is finished, so that the continuous pedestrian tracking track is determined, and finally, a visualization result is output.

Claims (5)

1. A method for detecting and tracking unification without an anchor frame based on an attention adding module is characterized in that: the method for detecting and tracking the unified frame without the anchor frame based on the attention adding module comprises the following steps in sequence:
1) acquiring images of a passenger flow dense area in a terminal building and preprocessing the images to acquire preprocessed images, wherein each preprocessed image is provided with a label, and the label comprises position information of all pedestrian targets in a current frame image;
2) constructing an original feature extraction network model, inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain an initial feature extraction network model;
3) respectively setting corresponding loss functions aiming at the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, training parameters of the initial feature extraction network model by using a large amount of existing data to obtain a trained feature extraction network model;
4) and continuously detecting and tracking the pedestrian target by using the trained feature extraction network model.
2. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in step 1), the method for acquiring and preprocessing the image of the passenger flow dense area in the terminal building to obtain the preprocessed image comprises the following steps: the method comprises the steps of utilizing a monitoring camera located in a passenger flow dense area in an airport terminal to shoot images of passengers in the walking and shielding processes at fixed time intervals in a time period with larger passenger flow, and carrying out preprocessing including deblurring, noise reduction and resolution improvement on the images to obtain preprocessed images.
3. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in step 2), the method for constructing the original feature extraction network model and then inputting the preprocessed image into the original feature extraction network model for feature extraction to obtain the initial feature extraction network model comprises the following steps:
the original feature extraction network model is divided into five stages: stem, stage1, stage2, stage3, stage 4; wherein stem is a backbone network; stage1 to stage4 are stage1 to stage 4;
firstly, a backbone network stem changes the height and width of a preprocessed image into one fourth of the original width through two convolution layers with the step length of 2 and the convolution layers of 3 multiplied by 3, then uses 4 second-generation residual block 2 notches to carry out feature extraction, and inputs an output feature diagram into a stage 1; the stages 1 to 3 carry out feature extraction and fusion operations, namely, generating a low-resolution branch on the basis of the previous stage, then carrying out feature extraction on each low-resolution branch by using 4 reference residual blocks 2eca-basic blocks added with two layers of attention modules, and finally carrying out repeated multi-scale fusion on the obtained feature graph and inputting the feature graph into the stage 4; and stage4 is a head network, firstly, the feature graph output by the three parallel low-resolution branches is up-sampled into the size of the high-resolution branch by a bilinear interpolation method, and then, a final output feature graph is obtained through splicing operation and a full connection layer and is used for detection and re-identification, and an initial feature extraction network model is obtained.
4. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in the step 3), corresponding loss functions are respectively set for the target center point positioning, the boundary size, the offset error and the re-identification task of the detection task; then, using a large amount of existing data to train the parameters of the initial feature extraction network model, the method for obtaining the trained feature extraction network model comprises the following steps:
the loss function for target center point positioning uses the deformed focal length for calculating the loss between the predicted heat map and the actual real heat map, and can effectively deal with the problem of unbalance between the target center point and the surrounding points, and the formula is shown in formula (1):
Figure FDA0003476850990000021
in the formula (1), the reaction mixture is,
Figure FDA0003476850990000031
is a predicted heat map response value, MxyIs the true response value of the heatmap; let two corner point coordinates of the pedestrian target area be (x) respectively1,y1) And (x)2,y2) The coordinate of the central point of the pedestrian target after size reduction is
Figure FDA0003476850990000032
And the real of the heat map of the coordinates (x, y) of a certain corner point of the pedestrian target relative to the coordinates of the central pointThe response value is shown in formula (2):
Figure FDA0003476850990000033
where N represents the number of pedestrian objects in the image, i represents the fourth pedestrian object, σcRepresents the standard deviation;
boundary size and offset error select two1loss as a loss function, according to the corner coordinates given by each pedestrian target, the loss function is shown as formula (3):
Figure FDA0003476850990000034
wherein s isiRepresenting the true size of the pedestrian object, oiA true offset representing a target size of the pedestrian,
Figure FDA0003476850990000035
and
Figure FDA0003476850990000036
respectively, the predicted values of the size and the offset, LboxRepresenting the localization loss resulting from the addition of the losses of the two branches;
the re-identification task is a classification task in essence, so softmax loss is selected as a loss function, an identity feature vector is extracted from the central point of the pedestrian target on the acquired heat map for learning and is mapped into a class distribution vector p (k), and the unique hot code of each pedestrian target is expressed as Li(k) The number of categories is marked as K, and the loss function of the re-recognition task is shown as the formula (4):
Figure FDA0003476850990000041
after all the loss functions are set, selecting training set images in CUHK-SYSU, PRW and MOT16 data sets as training sets, selecting training set images in 2DMOT15 data sets as verification sets, and training parameters of the initial feature extraction network model; the number of training iterations is set to 36 rounds, wherein the learning rate of the first 31 rounds is set to 1e-4, the learning rate of the subsequent 4 rounds is set to 1e-5, and the last round of training reaches the fitting by using the learning rate of 1 e-6; the size of an input image in the training process is (1088,608), the batch size is set to be 6, model optimization is carried out by using an Adam optimizer, relu is used as an activation function, a regularization coefficient is set to be 0.001, and a trained feature extraction network model is finally obtained after training is completed.
5. The anchor-frame-free detection and tracking unified method based on attention-added module as claimed in claim 1, wherein: in step 4), the specific steps of continuously detecting and tracking the pedestrian target by using the trained feature extraction network model are as follows:
4.1. firstly, taking a first frame image as an input image, initializing a distance matrix according to label information of the input image and packaging to obtain appearance information and motion information of a pedestrian target for subsequent data matching;
4.2. each pedestrian target is used as a category, each category is instantiated through a boundary frame to be used as a tracking object, and the position information of the pedestrian target in the next frame of image is predicted by using a Kalman filtering method according to the current frame detection result;
4.3. matching the predicted position information with appearance information and motion information by using Mahalanobis distance measurement to judge whether the pedestrian target tracking state is an initial default state, a confirmed state or a deleted state; the initial default state refers to a state of detecting a newly generated motion track of a certain pedestrian target for the first time, and is marked as the state because whether a detection result is correct or not cannot be confirmed; if the matching is successful in the next three continuous frames of images, changing the tracking state of the pedestrian target from an initial default state to a confirmed state, and determining that the motion track is the tracking track of the specific pedestrian target; if the matching is not successful in the next three frames of images, the detection is regarded as false detection, the motion track is determined to be a false tracking track, the initial default state is changed into a deleting state, and the motion track is deleted;
4.4. if the pedestrian target tracking state is in an initial default state or a confirmed state, cascade connection is carried out, overlapping degree (IOU) matching of a prediction frame and a real frame is carried out, and three results of successfully matched, unmatched tracking and unmatched detection targets can be obtained; if the matching is successful, updating the predicted value and the detected observed value by using a Kalman filtering method, updating the appearance characteristic of the pedestrian target, updating the tracking track and repeating the steps; if the result is unmatched tracking, indicating that the tracking track is interrupted, and deleting the tracking track; if the result is that the detected target is not matched, the detected target is possibly a new pedestrian target, the detected target is initialized to be a new tracking track, and a new tracker is distributed;
4.5. and after the input image is updated to be the next frame image, repeating the steps 4.1, 4.2, 4.3 and 4.4, and finally obtaining the tracking result of the pedestrian target in each frame image after the tracking is finished, so that the continuous pedestrian tracking track is determined, and finally, a visualization result is output.
CN202210057161.9A 2022-01-19 2022-01-19 Anchor-frame-free detection and tracking unified method based on attention module addition Pending CN114387265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210057161.9A CN114387265A (en) 2022-01-19 2022-01-19 Anchor-frame-free detection and tracking unified method based on attention module addition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210057161.9A CN114387265A (en) 2022-01-19 2022-01-19 Anchor-frame-free detection and tracking unified method based on attention module addition

Publications (1)

Publication Number Publication Date
CN114387265A true CN114387265A (en) 2022-04-22

Family

ID=81203170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210057161.9A Pending CN114387265A (en) 2022-01-19 2022-01-19 Anchor-frame-free detection and tracking unified method based on attention module addition

Country Status (1)

Country Link
CN (1) CN114387265A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972805A (en) * 2022-05-07 2022-08-30 杭州像素元科技有限公司 Anchor-free joint detection and embedding-based multi-target tracking method
CN115082517A (en) * 2022-05-25 2022-09-20 华南理工大学 Horse racing scene multi-target tracking method based on data enhancement
CN117455955A (en) * 2023-12-14 2024-01-26 武汉纺织大学 Pedestrian multi-target tracking method based on unmanned aerial vehicle visual angle
CN117576489A (en) * 2024-01-17 2024-02-20 华侨大学 Robust real-time target sensing method, device, equipment and medium for intelligent robot
CN117670938A (en) * 2024-01-30 2024-03-08 江西方兴科技股份有限公司 Multi-target space-time tracking method based on super-treatment robot
CN117952287A (en) * 2024-03-27 2024-04-30 飞友科技有限公司 Prediction method and system for number of passengers in terminal building waiting area

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972805A (en) * 2022-05-07 2022-08-30 杭州像素元科技有限公司 Anchor-free joint detection and embedding-based multi-target tracking method
CN115082517A (en) * 2022-05-25 2022-09-20 华南理工大学 Horse racing scene multi-target tracking method based on data enhancement
CN115082517B (en) * 2022-05-25 2024-04-19 华南理工大学 Horse racing scene multi-target tracking method based on data enhancement
CN117455955A (en) * 2023-12-14 2024-01-26 武汉纺织大学 Pedestrian multi-target tracking method based on unmanned aerial vehicle visual angle
CN117455955B (en) * 2023-12-14 2024-03-08 武汉纺织大学 Pedestrian multi-target tracking method based on unmanned aerial vehicle visual angle
CN117576489A (en) * 2024-01-17 2024-02-20 华侨大学 Robust real-time target sensing method, device, equipment and medium for intelligent robot
CN117576489B (en) * 2024-01-17 2024-04-09 华侨大学 Robust real-time target sensing method, device, equipment and medium for intelligent robot
CN117670938A (en) * 2024-01-30 2024-03-08 江西方兴科技股份有限公司 Multi-target space-time tracking method based on super-treatment robot
CN117670938B (en) * 2024-01-30 2024-05-10 江西方兴科技股份有限公司 Multi-target space-time tracking method based on super-treatment robot
CN117952287A (en) * 2024-03-27 2024-04-30 飞友科技有限公司 Prediction method and system for number of passengers in terminal building waiting area

Similar Documents

Publication Publication Date Title
CN114387265A (en) Anchor-frame-free detection and tracking unified method based on attention module addition
Ko et al. Key points estimation and point instance segmentation approach for lane detection
CN111666921B (en) Vehicle control method, apparatus, computer device, and computer-readable storage medium
CN111832655B (en) Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN108830171B (en) Intelligent logistics warehouse guide line visual detection method based on deep learning
CN111797716A (en) Single target tracking method based on Siamese network
CN111626128A (en) Improved YOLOv 3-based pedestrian detection method in orchard environment
CN110659664B (en) SSD-based high-precision small object identification method
CN116258608B (en) Water conservancy real-time monitoring information management system integrating GIS and BIM three-dimensional technology
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
CN117274749B (en) Fused 3D target detection method based on 4D millimeter wave radar and image
CN113092807B (en) Urban overhead road vehicle speed measuring method based on multi-target tracking algorithm
Li et al. Enhancing 3-D LiDAR point clouds with event-based camera
CN117523514A (en) Cross-attention-based radar vision fusion data target detection method and system
CN114820765A (en) Image recognition method and device, electronic equipment and computer readable storage medium
CN113724293A (en) Vision-based intelligent internet public transport scene target tracking method and system
CN113536997A (en) Intelligent security system and method based on image recognition and behavior analysis
AU2019100967A4 (en) An environment perception system for unmanned driving vehicles based on deep learning
CN114820931B (en) Virtual reality-based CIM (common information model) visual real-time imaging method for smart city
CN116597419A (en) Vehicle height limiting scene identification method based on parameterized mutual neighbors
CN116468950A (en) Three-dimensional target detection method for neighborhood search radius of class guide center point
CN114820723A (en) Online multi-target tracking method based on joint detection and association
CN115187959A (en) Method and system for landing flying vehicle in mountainous region based on binocular vision
CN114897939A (en) Multi-target tracking method and system based on deep path aggregation network
CN113920733A (en) Traffic volume estimation method and system based on deep network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination