CN116342645A - Multi-target tracking method for natatorium scene - Google Patents

Multi-target tracking method for natatorium scene Download PDF

Info

Publication number
CN116342645A
CN116342645A CN202310128609.6A CN202310128609A CN116342645A CN 116342645 A CN116342645 A CN 116342645A CN 202310128609 A CN202310128609 A CN 202310128609A CN 116342645 A CN116342645 A CN 116342645A
Authority
CN
China
Prior art keywords
frame
tracking
human body
human
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310128609.6A
Other languages
Chinese (zh)
Inventor
王晓航
朱鹏飞
郭东岩
张剑华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310128609.6A priority Critical patent/CN116342645A/en
Publication of CN116342645A publication Critical patent/CN116342645A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30204Marker
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/80Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in fisheries management
    • Y02A40/81Aquaculture, e.g. of fish

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-target tracking method under a natatorium scene, which is used for constructing a target detection network model, wherein the target detection network model comprises a basic target detection module and a matching module, a trained target detection network model is used for carrying out target detection and matching on a video image sequence acquired by a camera in the natatorium, and a paired human head frame and a human body frame are obtained and are used as target detection results; and then carrying out multi-target tracking on the paired head frame and the human frame object by adopting a multi-target tracking algorithm on the target detection result of the target detection network model to obtain a target track. The model frame is simple, can effectively classify and track personnel on the bank in the natatorium and swimming personnel, and detect and track the swimming personnel in real time to acquire position information and behavior information.

Description

Multi-target tracking method for natatorium scene
Technical Field
The application belongs to the technical field of computer vision, and particularly relates to a multi-target tracking method for a natatorium scene.
Background
Target tracking is one of the hot spots studied in the field of computer vision and is divided into single target tracking and multi-target tracking. The former tracks a single target in a video picture, and the latter tracks a plurality of targets in the video picture at the same time, so as to obtain the motion trail of the targets. Vision-based multi-objective tracking has become an important research focus in the field of computer vision in recent years, mainly because of its important applications in the fields of intelligent monitoring, action and behavior analysis, autopilot, virtual reality, entertainment interaction, etc.
The multi-target tracking problem may be divided into on-line tracking (Online tracking) and off-line tracking (Batch tracking) according to whether to use a frame image after a current frame. In industrial application, online tracking can gradually generate a new track according to the current detection result and the historical track in time sequence. And the online multi-target tracking can process the input video sequence in real time, so that the method can be applied to a real scene. Currently online multi-target tracking follows two main paradigms: detect tracking TBD (tracking-by-detection) and joint detection and tracking JDT (joint detection and tracking). Detection Tracking (TBD) treats detection and tracking as two independent tasks, typically using an existing object detector to detect an object for each frame of a video sequence, clipping the object according to bounding boxes, resulting in all objects in the image. Then, the target association problem between the front frame and the rear frame is converted, a similarity matrix is constructed through IoU, appearance and the like, and then the similarity matrix is solved through a Hungary algorithm, a greedy algorithm and the like. Joint Detection and Tracking (JDT) is a multi-target tracking method that uses an end-to-end trainable detection frame paradigm, jointly learns detection and appearance features, and can jointly perform detection and tracking in a single neural network framework. Although the combined detection and tracking module can be integrated with the target detection network from the structural view of the deep neural network, the online multi-target tracking algorithm based on detection is still a mainstream mode from the viewpoint of industrial landing use at present.
In the online multi-target tracking algorithm based on detection, a detection part is added with a deep learning target detector with higher performance, such as R-CNN, SSD, YOLO, so that the performance of the online multi-target tracking algorithm can be greatly improved. The typical SORT algorithm is based on the traditional Hungary matching algorithm, and the fast R-CNN target detection network is used for replacing the original aggregation channel detection, so that the tracking accuracy and the speed are greatly improved. Subsequent studies have also shown that there is a high positive correlation between detection accuracy and online multi-target tracking performance. And taking a good target detection result as a front input, and respectively improving algorithms of a target feature prediction module, an apparent feature extraction module and a data association module, wherein the algorithms belong to a multi-target tracking algorithm based on detection. The apparent features of the targets are widely used for multi-target tracking correlation because they are more stable and can more robustly connect long-lived occluded targets than some baseline trackers that use only simple motion modeling. For example, on the basis of the SORT algorithm, the deep SORT uses a ResNet network pre-trained on a ReID data set to extract the apparent characteristics of the target, and then the apparent characteristic similarity measurement is fused into the association cost to carry out association, so that the problem of target identity switching in the tracking process is greatly reduced.
At present, an online multi-target tracking algorithm is applied to multiple fields, and tracking of all targets in a natatorium scene belongs to the field of intelligent monitoring. In swimming pools, beginners may lose balance in the water due to lack of training, drowning and causing death. Some swimming personnel may drown due to injury, cramps, sudden illness, etc. At present, the safety of swimming pools mostly adopts a traditional manual supervision mode, but a life-saving person cannot monitor all areas of the swimming pool at any time in 24 hours. However, the existing drowning prevention method comprises the following steps: pressure sensor, heartbeat monitoring sensor, motion sensor, camera etc.. These devices have extremely high maintenance costs and are difficult to monitor for all personnel. The vision-based tracking of the targets in the natatorium has the following significance: 1. the passenger flow of the natatorium can be counted, so that accident risks such as overguard of the number of people and limb conflicts are prevented; 2. all swimming personnel in the video can be detected and tracked in real time, the position information, the behavior information and the like of the tracked personnel are obtained from the sequence images frame by frame, and whether drowning occurs or not is judged according to behavior recognition and analysis, so that swimming drowning accidents are prevented; 3. can detect whether the security officer in the natatorium is on duty in real time, prevent the sudden of a drowning incident, the lifesaving officer leaves duty and is not timely.
Current vision-based methods for tracking objects in natatorium are for example: the Poseidon developed by Vision IQ company in France monitors the activities of the swimmer in real time by using a camera above a swimming pool and a network system at the water bottom, the device determines the track of the swimmer through image processing, and the system takes the drowning behavior of the human body immersed in the water bottom as the basis for judging whether the swimmer drowns or not; DEWS researched by the university of south-ocean technology drowning alarm system group of Singapore judges whether a swimmer drowns or not by analyzing the characteristics and differences of human drowning behavior and normal swimming behavior; the method comprises the steps that video is collected by using an underwater camera at Beijing science and technology university in China, and detection of swimmers is completed; the taiwan university of science and technology adopts Haar characteristics and Adaboost algorithm to detect swimmers, and simultaneously adopts kalman filtering to track detected swimmers, but the characteristics used for detection are single, so that the swimmers can not be well separated from the swimming pool background, and only the single swimmer is tracked.
According to the research method, the technical difficulty of multi-target tracking in the scene of aiming at the natatorium is mainly found in the following points:
(1) How to distinguish swimmers from onshore people within the natatorium. Cameras at different angles in the natatorium can monitor personnel covering the whole natatorium. If only the swimming personnel is drowned, the swimming personnel and the personnel on the shore are required to be distinguished, and preparation is made for subsequent behavior identification.
(2) How to accurately pair the human head frame and the human body frame of target detection. In the target detection network, only different types of detection are carried out on the head and the human body in the natatorium, but the association operation of the head frame and the human body frame under the same target is lacked. The correlation result can integrally track the head frame and the body frame of the same target in a subsequent multi-target tracking algorithm.
(3) Only the characteristic information of human body is used when the swimmer is detected and tracked, and important head information is absent. If background information interference occurs and drowning behavior is not obvious, the behavior recognition network is difficult to judge the behavior characteristics of the human body. At this time, the judgment can be made by combining the characteristic information of the head on water or under water.
(4) The water bloom and the water surface reflect light. The water spray generated when the swimmer swims in the water and the reflection phenomenon appears due to the influence of illumination and light. A robust detection network is required and a data set marked for swimming personnel in different illumination scenes is produced.
Disclosure of Invention
The purpose of the application is to provide a multi-target tracking method in a natatorium scene. By an online multi-target tracking paradigm based on detection, detectors and trackers for pool scenes are improved. When detecting and tracking, the head and body information of all targets in the natatorium can be acquired in real time, and the head and the body are paired, so that an effective multi-target tracking method in the natatorium scene is realized.
In order to achieve the above purpose, the technical scheme of the application is as follows:
a method for multi-target tracking in a natatorium scenario, comprising:
the method comprises the steps of acquiring images acquired by cameras in a natatorium, marking targets in the images, and marking head frames and body frames and categories of the head frames and the body frames, wherein the categories are respectively water or underwater, the head frames are contained in the body frames, and the head frames and the body frames belonging to the same person have the same identification number to form a training data set;
training a built target detection network model by using a training data set, wherein the target detection network model comprises a basic target detection module and a matching module, and performing target detection and matching on a video image sequence acquired by a camera in a natatorium by using the trained target detection network model to obtain a paired human head frame and a human body frame as target detection results;
and carrying out multi-target tracking on the paired head frame and the paired body frame object by adopting a multi-target tracking algorithm on the target detection result of the target detection network model to obtain a target track.
Further, the target detection network model performs target detection and matching on video images collected by cameras in the natatorium, and the target detection network model comprises:
performing target detection by adopting a basic target detection module to obtain a feature map output by a backbone network of the basic target detection module, and a prediction detection frame and a prediction classification result which are finally output by the basic target detection module;
inputting the feature map, the prediction detection frame and the prediction classification result into a matching module, and executing the following operations:
in the matching module, mapping the prediction detection frame into a feature map through a region-of-interest pooling layer, and extracting human head features and human body features in the feature map according to category information provided by a prediction classification result;
the cosine distance between the human head characteristic and the human body characteristic is calculated by adopting the extracted human head characteristic and human body characteristic;
calculating the intersection specific distance between the human head frame and the human body frame;
carrying out weighted summation on the cosine distance and the cross-ratio distance to obtain a cost matrix as a correlation measure between the human head frame and the human body frame;
and taking the association measurement between the human head frame and the human body frame as a matching weight, and matching the matching relation between the human head frame and the human body frame through a Hungary matching algorithm.
Further, the performing multi-target tracking on the paired head frame and the paired frame object by using the multi-target tracking algorithm to obtain a target track according to the target detection result of the target detection network model, including:
dividing the target detection result of the t frame image into a high-confidence human body frame and a low-confidence human body frame according to the confidence level of the human body frame;
for all tracks in a tracking track set T corresponding to a T-1 frame image, predicting a tracking frame in the T frame image by using a Kalman filter, matching a high-confidence human frame of the T frame image with the tracking frame, putting the unmatched high-confidence human frame into a residual human frame set according to a matching result, and putting the unmatched successfully tracked frame into a first residual track set;
matching the low-confidence human body frame of the t frame image with the tracking frames in the first residual track set, putting the tracking frames which are not matched successfully into the second residual track set, and deleting the low-confidence human body frame which is not matched successfully;
and performing tracking track management, and outputting a tracking track corresponding to the t-frame image.
Further, the matching the high confidence human body frame of the t frame image with the tracking frame includes:
calculating the intersection ratio of the human body frame with high confidence coefficient and the tracking frame of the t frame image, extracting the corresponding characteristic information according to the human body frame and the tracking frame, and calculating the cosine similarity between the characteristic information of the human body frame and the tracking frame;
and fusing the intersection ratio with the cosine similarity to obtain a similarity matrix of the human body frame and the tracking frame, taking the similarity matrix as a matching weight, calculating the association metric between the human body frame and the tracking frame through a Hungary matching algorithm, and obtaining the matched human body frame and the tracking frame through the association metric.
Further, the method for multi-target tracking in natatorium scene further comprises:
and for the tracks in the second residual track set, deleting the tracks as inactive tracks from the track set T if the preset time exists, otherwise, continuing to store the tracks in the track set T.
Further, the method for multi-target tracking in natatorium scene further comprises:
for human frames in the remaining set of human frames, if the human frame confidence is higher than θ and survives more than two frames, a new trajectory is initialized.
The multi-target tracking method for the natatorium scene has the beneficial effects that:
(1) The frame is simple and the performance is strong. The target detection method aims at the target detection of the head and the human body respectively, and achieves the completion of pairing of the head and the human body of the same person with higher accuracy.
(2) The detection tracking technology for the natatorium scene can effectively classify and track people on the inner bank of the natatorium and the natatorium. And judging whether the head of the swimming personnel is under water or not by using a visual method.
(3) Aiming at the detection tracking technology in the natatorium scene, the natatorium passenger flow can be counted in real time; the swimming personnel can be detected and tracked in real time to acquire position information and behavior information; whether the security officer is on duty or not can be supervised in real time.
Drawings
Fig. 1 is a flowchart of a multi-target tracking method in a natatorium scenario according to the present application.
Fig. 2 is a schematic diagram of a target detection network model according to the present application.
Fig. 3 is a schematic structural diagram of a basic object detection module according to an embodiment of the present application.
Fig. 4 is a schematic diagram of a matching module according to an embodiment of the present application.
Fig. 5 is a schematic diagram of multi-target tracking according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a multi-target tracking method for a natatorium scene is provided, which includes:
s1, acquiring an image acquired by a camera in a natatorium, marking targets in the image, marking a head frame and a body frame, and respectively classifying the head frame and the body frame, wherein the classification is respectively on water or under water, the head frame is contained in the body frame, and the head frame and the body frame belonging to the same person have the same identification number, so as to form a training data set.
In this embodiment, images acquired using cameras at different angles within the natatorium are used as the training dataset. The data set is in a VOC format, 1000 frames are extracted from videos shot in the morning, in the middle and in the evening for a single camera, and the video frame extraction time interval is 2 minutes; total 1000 x 6 cameras = 6000 images. Labeling of the target in the image is started next: for a labeling object, only one person head or human body is visible or approximately visible, and labeling of a person head frame or a human body frame is required; the classification of the human body frame is marked as water or underwater, the judgment standard is that the human body is marked as "underwater" under water, and the human body is not marked as "water" in the swimming pool; the classification of the head frame is marked as water or underwater, and the judgment standard is that 90% of the head is marked as "underwater" under water, and the other is marked as "water". Namely, the head frame is respectively provided with two kinds of labels of head water and head water; the human body frame is provided with two kinds of labels respectively on the water and the water.
Note that: the human body frame and the human head frame are marked as horizontal frames; the head frames are contained in the human body frame, and the head frames and the human body frame belonging to the same person have the same identification number (ID) as the associated information. The ID information is used as an association metric in the subsequent target detection network and a loss value is calculated.
In order to reduce interference of background information on target detection, the embodiment performs visualization and aspect ratio calculation on the labeling frames in the training dataset, and obtains the minimum size, the maximum size and the average size of the labeling frames in the dataset. And adjusting multiscale parameters of anchors according to the counted size information of the marked frames so as to improve the accuracy of generating a proposal (candidate frame) in an RPN (candidate region generating network) layer.
And S2, training the constructed target detection network model by using a training data set, wherein the target detection network model comprises a basic target detection module and a matching module, and the trained target detection network model is used for carrying out target detection and matching on a video image sequence acquired by a camera in the natatorium to obtain a matched human head frame and a human body frame as a target detection result.
The target detection network model of this embodiment is shown in fig. 2, and includes a basic target detection module and a matching module, where the basic target detection module may be some target detection models commonly used today, such as fast R-CNN, and its structure diagram is shown in fig. 3. And then the matching module is used for completing the matching of the human head frame and the human body frame which belong to one target.
Taking a fast R-CNN as an example of a basic target detection module, when a target detection network model is trained, scaling the original image size of an image acquired by a camera to a fixed size of 800 multiplied by 600, and extracting through a backbone network of the fast R-CNN to obtain a high-dimensional feature map, wherein the feature map size is 50 multiplied by 38 multiplied by 256.
Next, the feature map passes through the RPN layer to calculate the accurate candidate frame proposal (x) 1 ,y 1 ,x 2 ,y 2 ). In the RPN layer, the feature map is changed in size to 50×38×36×1 through the reconstruction layer. This is a dimension more than one alone to normalize the exponential function layer for classification. And obtaining foreground information (namely detection targets) and background information through normalized exponential function layer classification. The foreground information is then restored to its original size again through the reconstruction layer and input to the candidate layer to obtain the correct anchors. Meanwhile, the feature map calculates regression frame offset according to preset anchors multiscale parameters. The calculation process is as follows: let us assume that given achora= (a x ,A y ,A w ,A h ) And the real frame group trutggt= (G) x ,G y ,G w ,G h ) How to make the predicted anchor more connectedNear real box, transform is needed for the anchor:
G x =A w ·d x (A)+A x
G y =A h ·d y (A)+A y
G w =A w ·exp(d w (A))
G h =A h ·exp(d h (A))
obtaining regression frame offset [ d ] by linear regression x (A),d y (A),d w (A),d h (A)]。
In the candidate layer, regression frame offset [ d ] is used x (A),d y (A),d w (A),d h (A)]Performing prediction frame regression on foreground information anchors, correcting and removing the obtained prediction regression frame according to all information of original image scaling stored in image information, and calculating accurate proposal (x) 1 ,y 1 ,x 2 ,y 2 )。
The region of interest pooling layer then calculates proposal feature map from the proposal and feature map and feeds it into the subsequent network. Proposal feature map calculating the category of each proposal through the full connection layer and the normalized exponential function layer and outputting a prediction classification result; and proposal feature map acquires the prediction detection frame of each proposal through the full connection layer.
The fast R-CNN is a relatively mature technology in the art, and is not described in detail herein.
As shown in fig. 4, the target detection network model constructed in the present application adds a matching module on the basis of the fast R-CNN, and inputs the feature map, the prediction detection frame and the prediction classification result into the matching module, so as to execute the following operations:
in the matching module, a prediction detection frame is mapped into a feature map through a region-of-interest pooling layer, and according to category information provided by a prediction classification result, human head features and human body features in the feature map are extracted and used for calculating cosine distances.
Suppose that the predicted detection frame coordinates (x 1 ,y 1 ,x 2 ,y 2 ) Corresponds to the coordinates (x 'in the feature map' 1 ,y′ 1 ,x′ 2 ,y′ 2 ) The coordinate conversion calculation formula comprises the following steps:
Figure SMS_1
Figure SMS_2
and extracting the characteristic information in the corresponding detection frame area according to the coordinates, and extracting the characteristics of the human head and the human body characteristics in the characteristic map. Then, the cosine distance between the human head characteristic and the human body characteristic is calculated by adopting the extracted human head characteristic and human body characteristic:
let the human head feature vector be h i The human body characteristic vector is b i Cosine similarity cos (h i ,b i ) And cosine distance cosine matrix (h) i ,b i ) The calculation formulas are respectively as follows:
Figure SMS_3
cosine_matrix(h i ,b i )=1-cos(h i ,b i )。
in the matching module, the intersection specific distance (IoU distance) between the human head frame and the human body frame is calculated at the same time:
assume that the human head frame coordinates are
Figure SMS_4
The coordinates of the human body frame are->
Figure SMS_5
Calculating an intersection coordinate point as (x) 1 ,y 1 ,x 2 ,y 2 )。
Wherein the method comprises the steps of
Figure SMS_6
The intersection part area is then intersection=max (x 2 -x 1 +1.0,0)·max(y 2 -y 1 +1.0, 0) and the areas of the human head frame and the human body frame are:
Figure SMS_7
Figure SMS_8
thus, the cross-over distance can be obtained:
Figure SMS_9
finally, the cosine distance iou_matrix and the cross-over distance cosine_matrix are weighted and summed to obtain a cost matrix cost_matrix:
cost_matrix(h i ,b i )=λcosine_matrix(h i ,b i )+(1-λ)iou_matrix(h i ,b i )
wherein the parameter lambda is a preset weight.
In the matching module, a cost matrix between the human head frame and the human body frame is used as a matching weight, the correlation measurement between the human head frame and the human body frame is calculated through a Hungary matching algorithm, and then the matching relation between the human head frame and the human body frame is matched through the correlation measurement.
Specifically, when the association metric is greater than or equal to 0.6, the head frame is considered to be successfully matched with the human frame, the head frame and the human frame are bound, and the association metric less than 0.6 is considered to be failed in matching.
The embodiment adopts a Hungary matching algorithm to match, and finally obtains a successful matching result, wherein the matching result comprises a human head frame, a human head category, a human body frame, a human body category and a human body frame confidence coefficient:
Detection:{head_bbox,head_class,body_bbox,body_class,confidence}。
the embodiment performs the same operation on each frame of image in the video image sequence, so as to obtain the paired human head frame and human body frame in each frame of image.
During training, loss calculation is carried out by calculating the predicted matching relationship and the matching relationship between the human head frame and the human body frame, and the calculated loss value is used for measuring the pairing quality between the human head frame and the human body frame. If the loss value is smaller, the matching quality between the human head frame and the human body frame is higher, otherwise, the matching quality is poorer. By continuously adjusting the model parameters, the loss value reaches the minimum, the accurate matching between the head frame and the human body frame is realized, and the training of the network model is completed.
It should be noted that, the matching module in this embodiment calculates the cross-correlation between the head frame and the body frame of the prediction target, which is actually the minimum weight matching problem of the bipartite graph, and can calculate the matching relationship by using the hungarian algorithm. The hungarian matching algorithm is a relatively mature technique in the art and will not be described in detail here. When the target detection network model is trained, the loss value is calculated by using the correlation measurement obtained by matching and the correlation measurement of the real frame, then the back propagation is carried out, and the parameters of the network are updated, so that the training of the network model is realized, and the details are not repeated here.
The labeling frame of the head of the training data set is contained in the human body frame, and the ID information of the head frame and the human body frame belonging to the same person is the same. However, in the result of target detection, there may be a case where the human head frame is not completely contained in the human body frame, and a case where the human head and the human body are erroneously matched when the targets are blocked from each other. Common cross-correlation matching operations, such as calculating the box-to-box IoU distance, do not solve the problem of objects occluding each other. The embodiment utilizes the cosine distance to calculate the feature similarity, and can effectively solve the problem of mismatching of the human head and the human body when the target is shielded.
And S3, carrying out multi-target tracking on the paired head frame and the paired body frame object by adopting a multi-target tracking algorithm on the target detection result of the target detection network model to obtain a target track.
In the tracking process, a target data structure containing a human head frame or a human body frame is constructed for a target detection result of each frame image, and a Kalman filter is used for carrying out position prediction on the target detection frame and giving a target identification number. Finally, the output tracking result is ensured to be as far as possible, and the human body tracking frame contains the matched human head frame information.
As shown in fig. 5, the present embodiment performs multi-target tracking by using a multi-target tracking algorithm according to the detection result of the target detection network model. Three threshold values tau are set in the multi-target tracking algorithm low ,τ high And θ. The first two are confidence thresholds for human body frame detection, and the later one is a confidence threshold for generating a new track for the human body frame. Because of the target detection result, the human head frame is paired with the human body frame, and the human head frame is positioned in the human body frame, and the human body frame is used for description in the later steps.
In this embodiment, the steps of the multi-target tracking algorithm are as follows:
and F1, dividing the target detection result of the t frame image into a high-confidence human body frame and a low-confidence human body frame according to the confidence level of the human body frame.
For example, the target detection result of the t-th frame image is divided into two parts according to the confidence of the human frame: confidence of human body frame is higher than threshold tau high Human body frame classified as high confidence human body frame D high Confidence of human body frame is within tau high And τ low Human frames in between are classified as low confidence human frames D low
And F2, predicting a tracking frame in the T-th frame image by using a Kalman filter for all tracks in the tracking track set T corresponding to the T-1-th frame image, matching the high-confidence human frame of the T-th frame image with the tracking frame, putting the unmatched high-confidence human frame into the rest human frame set according to a matching result, and putting the unmatched successful tracking frame into the first rest track set.
In the step, a Kalman filter is utilized to predict a tracking frame in a T-th frame image for all tracks in a tracking track set T corresponding to the T-1-th frame image. Then human body frame D with high confidence high And carrying out first association with the tracking track set T, namely matching the human body frame with high confidence coefficient of the T frame image with the tracking frame.
In this embodiment, matching a high confidence human body frame of a t-th frame image with a tracking frame includes:
calculating the intersection ratio of the human body frame with high confidence coefficient and the tracking frame of the t frame image, extracting the corresponding characteristic information according to the human body frame and the tracking frame, and calculating the cosine similarity between the characteristic information of the human body frame and the tracking frame;
and fusing the intersection ratio with the cosine similarity to obtain a similarity matrix of the human body frame and the tracking frame, taking the similarity matrix as a matching weight, calculating the association metric between the human body frame and the tracking frame through a Hungary matching algorithm, and obtaining the matched human body frame and the tracking frame through the association metric.
The method comprises the steps of obtaining a matched human body frame and a tracking frame through correlation measurement, placing an unmatched high-confidence human body frame into a residual human body frame set, and placing an unmatched successful tracking frame into a first residual track set.
Specifically, the cross-ratio distance between the two frames is calculated according to the human body frame output by the target detection network and the tracking frame predicted by the Kalman filter. Wherein the tracking frame predicted by the kalman filter is a human frame of the current frame by inputting the human frame of the previous frame and assuming that the human frame changes at a uniform speed. And then, according to the human body frame of the previous frame and the predicted human body frame, carrying out linear weighting on the states of the two frames, and finally outputting the accurate position of the human body frame (namely the tracking frame) of the current frame.
ResNeSt50 in the target re-identification open source library FastRIID is used as a depth apparent feature extractor to extract feature information in human frames. And the cosine distance is used to calculate the cosine similarity between the human body characteristics in the tracking frame in the previous frame and the characteristics in the human body frame of the current frame.
And performing weighted addition on the calculated intersection ratio distance and cosine similarity in the first two steps to obtain a similarity matrix with the distance information and the appearance characteristic information.
And then matching the information in the similarity matrix by using a Hungary algorithm, and finally calculating to obtain the association measurement of the human body frame and the tracking frame. For the association metric of the human body frame and the tracking frame, the association metric is more than 0.2 and is considered to be successful in matching and is input into the tracking track management module, and the association metric is less than 0.2 and is rejected in matching. For the followingThe unmatched human body frames are stored in D remain The tracking frames which are not successfully matched are stored in T remain
And F3, matching the low-confidence human body frame of the t frame image with the tracking frames in the first residual track set, putting the tracking frames which are not successfully matched into the second residual track set, and deleting the low-confidence human body frame which is not successfully matched.
For low confidence human body frame D low And unmatched tracks T remain Performing a second matching, wherein the similarity matching is calculated only by using the cross-correlation distance information, namely directly calculating the correlation measure by the cross-correlation by adopting a Hungary matching algorithm, and storing the tracks which are not successfully matched for the second time in T re-remain The unmatched low-score human body frames are deleted directly.
And F4, performing tracking track management, and outputting a tracking track corresponding to the t-frame image.
And managing the tracking frame successfully matched for the first time and the tracking frame successfully matched for the second time. The method comprises the step of replacing and updating human body characteristics in a tracking frame in the previous frame by the characteristic information in the human body frame of the current frame. And combining and updating the tracking frames successfully matched for two times by using a Kalman filter, and adding the tracking frames into T.
In a specific embodiment, the multi-target tracking method for the natatorium scene further includes:
and for the tracks in the second residual track set, deleting the tracks as inactive tracks from the track set T if the preset time exists, otherwise, continuing to store the tracks in the track set T.
Specifically T re-remain The track in (a) is considered to be a temporary loss of the target, and is put into T lost If T lost If the track exists for more than a certain time (30 frames), deleting the track from T as an inactive track, otherwise, keeping the track in T as a T-1 frame.
In a specific embodiment, the multi-target tracking method for the natatorium scene further includes:
for human frames in the remaining set of human frames, if the human frame confidence is higher than θ and survives more than two frames, a new trajectory is initialized.
I.e. for D remain If the human frame confidence is higher than θ and survives for more than two frames, then a new trajectory is initialized.
The final output of the multi-target tracking algorithm is a video track T, wherein each track comprises a human head detection frame, a human head frame type (water and underwater), a human body detection frame, a human body frame type (water and underwater) and an identity id of a target:
Track:{id,head_bbox,head_class,body_bbox,body_class}。
when the present application predicts a target box using a Kalman filter, a discrete Kalman filter with a constant velocity model is typically used. However, during the course of experiments, it was found that the human body frame detected by the object detection network frequently changed in width and height while the person was swimming. When predicted using a discrete kalman filter, the output tracking frame cannot fit the swimmer body area accurately. Therefore, the method directly estimates the width and the height of the target frame by using the Kalman filter, and adds the width and the height in the state vector and the measurement vector of the offline Kalman filter.
First define the state vector x of KF k The state vector is 8-tuple; and measuring vector z k
Figure SMS_10
Also, when covariance is calculated using Kalman filtering, processing noise covariance Q is calculated simultaneously k And measuring the noise covariance R k
Figure SMS_11
R k =diag((σ m z w (k)) 2 ,(σ m z h (k)) 2 ,
m z w (k)) 2 ,(σ m z h (k)) 2 )
Wherein the noise factor sigma p =0.05,σ v =0.00625 and σ m =0.05。
End use processing noise covariance Q k And measuring the noise covariance R k And calculating the width and the height of the human body frame of the current frame, and obtaining a predicted tracking frame. According to experimental results, the state vector and the measurement vector of the Kalman filter are improved, so that the frame width and height of the human body of the swimming personnel can be accurately estimated, and the fitting of the width of the tracking frame to the human body area can be improved.
The application defines a data structure which simultaneously contains human body frame and human head frame information as tracks, wherein the track comprises a target track id; the human head frame head bbox and the human body frame bbox belong to the same person; head class and body class; a detection score (confidence); the mean and variance covariance of the Kalman filter predictions, and so on. When the tracking frame track is predicted, the algorithm preferentially uses the mean and the variance to calculate and predict the human frame. Because the human body information features are more obvious relative to the human head features, and the human body movement deformation is serious relative to the human head, the BoT SORT is required to improve the fitting of the tracking frame to the human body region by improving the Kalman filter and calculating the width and the height of the tracking frame when the tracking frame is predicted.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (6)

1. The multi-target tracking method for the natatorium scene is characterized by comprising the following steps of:
the method comprises the steps of acquiring images acquired by cameras in a natatorium, marking targets in the images, and marking head frames and body frames and categories of the head frames and the body frames, wherein the categories are respectively water or underwater, the head frames are contained in the body frames, and the head frames and the body frames belonging to the same person have the same identification number to form a training data set;
training a built target detection network model by using a training data set, wherein the target detection network model comprises a basic target detection module and a matching module, and performing target detection and matching on a video image sequence acquired by a camera in a natatorium by using the trained target detection network model to obtain a paired human head frame and a human body frame as target detection results;
and carrying out multi-target tracking on the paired head frame and the paired body frame object by adopting a multi-target tracking algorithm on the target detection result of the target detection network model to obtain a target track.
2. The multi-target tracking method for natatorium scene according to claim 1, wherein the target detection network model performs target detection and matching on video images collected by cameras in natatorium, comprising:
performing target detection by adopting a basic target detection module to obtain a feature map output by a backbone network of the basic target detection module, and a prediction detection frame and a prediction classification result which are finally output by the basic target detection module;
inputting the feature map, the prediction detection frame and the prediction classification result into a matching module, and executing the following operations:
in the matching module, mapping the prediction detection frame into a feature map through a region-of-interest pooling layer, and extracting human head features and human body features in the feature map according to category information provided by a prediction classification result;
the cosine distance between the human head characteristic and the human body characteristic is calculated by adopting the extracted human head characteristic and human body characteristic;
calculating the intersection specific distance between the human head frame and the human body frame;
carrying out weighted summation on the cosine distance and the cross-ratio distance to obtain a cost matrix as a correlation measure between the human head frame and the human body frame;
and taking the association measurement between the human head frame and the human body frame as a matching weight, and matching the matching relation between the human head frame and the human body frame through a Hungary matching algorithm.
3. The multi-target tracking method for natatorium scene according to claim 1, wherein the performing multi-target tracking on the paired head frame and body frame object by using the multi-target tracking algorithm on the target detection result of the target detection network model to obtain a target track comprises:
dividing the target detection result of the t frame image into a high-confidence human body frame and a low-confidence human body frame according to the confidence level of the human body frame;
for all tracks in a tracking track set T corresponding to a T-1 frame image, predicting a tracking frame in the T frame image by using a Kalman filter, matching a high-confidence human frame of the T frame image with the tracking frame, putting the unmatched high-confidence human frame into a residual human frame set according to a matching result, and putting the unmatched successfully tracked frame into a first residual track set;
matching the low-confidence human body frame of the t frame image with the tracking frames in the first residual track set, putting the tracking frames which are not matched successfully into the second residual track set, and deleting the low-confidence human body frame which is not matched successfully;
and performing tracking track management, and outputting a tracking track corresponding to the t-frame image.
4. The method for multi-target tracking in natatorium scene according to claim 3, wherein the matching the high confidence human frame of the t-th frame image with the tracking frame comprises:
calculating the intersection ratio of the human body frame with high confidence coefficient and the tracking frame of the t frame image, extracting the corresponding characteristic information according to the human body frame and the tracking frame, and calculating the cosine similarity between the characteristic information of the human body frame and the tracking frame;
and fusing the intersection ratio with the cosine similarity to obtain a similarity matrix of the human body frame and the tracking frame, taking the similarity matrix as a matching weight, calculating the association metric between the human body frame and the tracking frame through a Hungary matching algorithm, and obtaining the matched human body frame and the tracking frame through the association metric.
5. The multi-target tracking method for use in a natatorium scenario of claim 3, further comprising:
and for the tracks in the second residual track set, deleting the tracks as inactive tracks from the track set T if the preset time exists, otherwise, continuing to store the tracks in the track set T.
6. The multi-target tracking method for use in a natatorium scenario of claim 3, further comprising:
for human frames in the remaining set of human frames, if the human frame confidence is higher than θ and survives more than two frames, a new trajectory is initialized.
CN202310128609.6A 2023-02-16 2023-02-16 Multi-target tracking method for natatorium scene Pending CN116342645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310128609.6A CN116342645A (en) 2023-02-16 2023-02-16 Multi-target tracking method for natatorium scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310128609.6A CN116342645A (en) 2023-02-16 2023-02-16 Multi-target tracking method for natatorium scene

Publications (1)

Publication Number Publication Date
CN116342645A true CN116342645A (en) 2023-06-27

Family

ID=86892093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310128609.6A Pending CN116342645A (en) 2023-02-16 2023-02-16 Multi-target tracking method for natatorium scene

Country Status (1)

Country Link
CN (1) CN116342645A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152689A (en) * 2023-10-31 2023-12-01 易启科技(吉林省)有限公司 River channel target detection method and system based on vision

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152689A (en) * 2023-10-31 2023-12-01 易启科技(吉林省)有限公司 River channel target detection method and system based on vision
CN117152689B (en) * 2023-10-31 2024-01-19 易启科技(吉林省)有限公司 River channel target detection method and system based on vision

Similar Documents

Publication Publication Date Title
CN109819208B (en) Intensive population security monitoring management method based on artificial intelligence dynamic monitoring
CN107818571B (en) Ship automatic tracking method and system based on deep learning network and average drifting
WO2020042419A1 (en) Gait-based identity recognition method and apparatus, and electronic device
CN108921107B (en) Pedestrian re-identification method based on sequencing loss and Simese network
CN103971386B (en) A kind of foreground detection method under dynamic background scene
WO2017185688A1 (en) Method and apparatus for tracking on-line target
Foedisch et al. Adaptive real-time road detection using neural networks
CN114022910B (en) Swimming pool drowning prevention supervision method and device, computer equipment and storage medium
CN113011367A (en) Abnormal behavior analysis method based on target track
CN109657592A (en) A kind of face identification system and method for intelligent excavator
CN114972418A (en) Maneuvering multi-target tracking method based on combination of nuclear adaptive filtering and YOLOX detection
Salehi et al. An automatic video-based drowning detection system for swimming pools using active contours
Huang et al. Fish tracking and segmentation from stereo videos on the wild sea surface for electronic monitoring of rail fishing
CN108776974A (en) A kind of real-time modeling method method suitable for public transport scene
CN116343330A (en) Abnormal behavior identification method for infrared-visible light image fusion
CN113378649A (en) Identity, position and action recognition method, system, electronic equipment and storage medium
CN112989889A (en) Gait recognition method based on posture guidance
CN114926859A (en) Pedestrian multi-target tracking method in dense scene combined with head tracking
CN104778699A (en) Adaptive object feature tracking method
CN116342645A (en) Multi-target tracking method for natatorium scene
Batool et al. Telemonitoring of daily activities based on multi-sensors data fusion
CN116152928A (en) Drowning prevention early warning method and system based on lightweight human body posture estimation model
CN106056078A (en) Crowd density estimation method based on multi-feature regression ensemble learning
CN114627339A (en) Intelligent recognition and tracking method for border crossing personnel in dense jungle area and storage medium
CN109887004A (en) A kind of unmanned boat sea area method for tracking target based on TLD algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination