CN113256680A - High-precision target tracking system based on unsupervised learning - Google Patents
High-precision target tracking system based on unsupervised learning Download PDFInfo
- Publication number
- CN113256680A CN113256680A CN202110523935.8A CN202110523935A CN113256680A CN 113256680 A CN113256680 A CN 113256680A CN 202110523935 A CN202110523935 A CN 202110523935A CN 113256680 A CN113256680 A CN 113256680A
- Authority
- CN
- China
- Prior art keywords
- tracker
- target
- tracking
- frame
- selection module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010586 diagram Methods 0.000 claims abstract description 10
- 230000004913 activation Effects 0.000 claims abstract description 7
- 230000004044 response Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000005055 memory storage Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000000295 complement Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drugs Drugs 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006011 modification reaction Methods 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000004642 transportation engineering Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Abstract
The invention relates to a high-precision target tracking system based on unsupervised learning, which comprises an image acquisition module, a target tracking module and a target tracking module, wherein the image acquisition module is used for acquiring video images; the tracking module comprises a tracker 1 and a tracker 2 and is used for obtaining the characteristics of the image and the target rectangular frame; the selection module comprises two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and taking the characteristic diagram of the tracker to be selected and the tracker result as input, and outputting the best tracking result through the selector. The invention obtains the result through two different trackers and obtains the optimal result output through the judgment of the selector, and the tracking is continued in the subsequent frames so as to adapt to the target tracking under different scenes.
Description
Technical Field
The invention relates to the technical field of computer vision tracking, in particular to a high-precision target tracking system based on unsupervised learning.
Background
Target tracking is a basic task in computer vision, the purpose of which is to locate a target object in a video given a bounding box annotation in the first frame. The current target tracking has wide application fields, such as an intelligent transportation system, a medical field, man-machine interaction, athlete match analysis and the like.
Current advanced depth tracking methods typically use pre-processed CNN models for feature extraction, these models are trained in a supervised fashion, require a large number of annotated labels, are expensive and time consuming to manually label, and unlabeled videos are more readily available on the internet. Therefore, it is worth exploring how to visually track using unmarked video sequences.
The current data is extremely easy to obtain in the Internet, the problem of manual labeling is solved by the development of an unsupervised technology, and the method plays a great role in target tracking of deep learning. Unsupervised learning on video has resulted in a great deal of research effort, and in order to learn visual features from unlabeled data, unsupervised methods explore intrinsic information inside images or video from different perspectives as surveillance signals, and train through aspects such as designing loss functions and agent tasks. In the prior art, for example, a patent of an unsupervised related filtering target tracking method and system based on a jigsaw task, a prediction task training network for indexing image block positions is adopted to extract feature capability, and one tracker is mainly trained to adapt to different scenes, but usually, a certain tracker has inherent defects so that the tracker is difficult to adapt to different scenes. Therefore, the prior art has certain defects and shortcomings, and further has a space for further promotion and improvement.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a high-precision target tracking system based on unsupervised learning, which adapts to target tracking in different scenes by selecting two different trackers, and a selection module selects an optimal result for continuous tracking of a subsequent frame according to comparison of tracking results, and has the advantages of simple structure, improved precision and improved robustness.
The technical scheme adopted by the invention is as follows:
the invention provides a high-precision target tracking system based on unsupervised learning, which comprises the following modules:
the image acquisition module is used for acquiring a video image;
the tracking module comprises a tracker 1 and a tracker 2 and is used for obtaining the characteristics of the image and the target rectangular frame;
the selection module comprises two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and taking the characteristic diagram of the tracker to be selected and the tracker result as input, and outputting the best tracking result through the selector.
Further, tracker 1 and tracker 2 in the tracking module adopt two different trackers, are suitable for tracking under different scenes and respectively use two different loss functions for training, and specifically include the following steps:
Lm=∑izi (2)
wherein L isCIs a loss function selected by the tracker 1, RTIs a label for cropping the template patch from the initial frame, which is a Gaussian response centered at the center of the initial bounding box, ZTIs a response graph generated by a second search frame in the back tracking, and utilizes cycle consistency training; l ismIs the Huber loss function of the tracker 2,is a reconstructed frame of the video sequence to be displayed,is a real frame, trained using the consistency of pixel reorganization.
Further, the specific content of the selection module is as follows:
the selection module aims at selecting the most appropriate target result according to the tracking result; in the selection module, the results of two trackers need to be operated simultaneously, so that the preferred selection in the tracker 1 and the tracker 2 can be realized;
(1) acquiring an overlapping value IOU between the candidate box and the pseudo label; after the two trackers track forward to obtain a predicted target position, the predicted target position is used as a pseudo label to track reversely to skip an interval n frame, a new predicted position is obtained in an initial frame, and a new estimated target frame and an annotation frame in the initial frame are used for calculating the degree of coincidence IOU; the label P required by selector training is obtained through the IOU value, and the calculation formula of the label P is as (4)
(2) The selection module consists of two fully-connected layers and a softmax layer, wherein the fully-connected layers comprise a linear fully-connected layer and a Relu activation function; inputting a feature map obtained by a tracker to be selected into a selection module, and respectively obtaining the probability values of the precision estimation of the two target frames by the feature map through the selection module; this selector is trained for this selection module using a cross-entropy loss function, which is equation (5):
Ls=-∑plna+(1-p)ln(1-a) (5)
wherein a is a probability value of the target box precision estimation obtained by the characteristics through a selector, and p is a label required by the selector for training;
(3) in the tracking stage, a target is positioned by using the tracker, the characteristic diagram and the positioning result obtained by the tracker are used as the input of the selection module, and the result of the optimal tracker can be directly judged and output by the selection module.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, a combined model is designed to comprise an image acquisition module, a tracking module and a selection module, results obtained by two different trackers are judged by a selector to obtain the optimal result output, and tracking is continued in subsequent frames; the target tracking algorithm faces huge challenges such as motion blur, shielding and the like, and the method has the advantages that the two trackers have different applied target motion scenes, the result of the proper tracker is selected for tracking, the structure is simple, and the precision and the robustness can be effectively improved.
Drawings
FIG. 1 is a block flow diagram of the system of the present invention;
FIG. 2 is a schematic diagram of a training method of the tracker 1;
fig. 3 is a schematic diagram of a training method of the tracker 2.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
As shown in fig. 1, the high-accuracy target tracking system based on unsupervised learning proposed by the present invention includes the following modules:
the image acquisition module is used for acquiring a video image;
the tracking module comprises a tracker 1 and a tracker 2 and is used for obtaining the characteristics of the image and the target rectangular frame;
the selection module comprises two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and taking the characteristic diagram of the tracker to be selected and the tracker result as input, and outputting the best tracking result through the selector.
The tracking module part comprises two branches, namely a tracker 1 and a tracker 2, wherein the tracker 1 can track a target without sudden change in scene with high precision, good robustness and high speed; the tracker 2 has a memory storage body, so that moving targets such as occlusion or target loss can be tracked with higher precision; the two trackers can be automatically adjusted in different scenes by means of complementary advantages so as to ensure higher target tracking precision. The training steps for both trackers are as follows:
the tracker 1 adopts the idea of cycle consistency as shown in fig. 2 for training, and comprises the following specific steps: randomly selecting three patches in continuous 10 frames of a video, wherein any one patch is set as a first frame template, and the rest patches are set as search patches; giving a target object annotated on a template frame, sequentially carrying out forward tracking twice in the subsequent two frames, then directly carrying out backward tracking on the first frame by using a predicted position in the last frame as an initial target annotation, wherein the initial annotation of the first frame is consistent with the target position predicted by the backward tracking in the first frame in principle, and then utilizing the error of the tracking result to train between the initial annotation and the backward tracking, and the training function is formula (1)
Wherein L isCIs a loss function selected by the tracker 1, RTIs the label of the initial frame cropping template patch, which is a Gaussian response centered at the center of the initial bounding box, ZTIs a response graph generated from the second search frame in back-tracking, using cyclic consistency training.
The tracker 2 uses fine-grained based pixel matching, as shown in fig. 3, using a memory storage for storing a plurality of frames of information, which has the advantage that it can use more feature information. The tracker is designed with long-term and short-term storage, and in a video sequence, the target can be changed with the time, so that when the target is changed during tracking, the target is not utilized by previous good features, and errors can be amplified and even lost during subsequent tracking easily. The use of this memory storage, which is configured to store 5 frames of information, fix the 0 th and 5 th frames as long-term memory, ensures that the previous feature information can always be stored in the long-term video sequence, and then use It-5,It-3,It-1As short-term memory, the latest feature information is obtained. The tracker 2 derives the reference frame (I) by linear combinationt-1) And training the pixel recombination target frame. In particular, for each input frame ItThere is one triplet (Qt, Kt,vt), i.e., Query, Key, Value. Taking the current frame and multiple past frames in the memory bank as input, using a trained feature encoder to calculate an affinity matrix between Q in the target frame and K in the reference frame, and reconstructing the pixel formula in the t frame as (2)
Wherein the content of the first and second substances,<·,·>is the dot product of two vectors, Q and K are the target frame ItThe feature representation after the twin network, K corresponds to the features of a plurality of reference frames, Q is the feature of the current target frame, AtIs a pixelAndv is the original reference frame.
By reconstructing framesAnd the original frameTraining the tracker with the loss therebetween, the loss function being as in equation (4)
Lm=∑izi (4)
Wherein L ismIs the Huber loss function of the tracker 2, trained with the consistency of the pixel reorganization.
The purpose of the selection module is to select according to the tracking resultSelecting the most suitable target result; in the selection module, two tracker results need to be run simultaneously, and the implementation in the tracker T can be realized1,T2Selecting the Chinese medicines preferentially; the specific contents are as follows:
(1) acquiring an overlapping value IOU between the candidate box and the pseudo label; because unsupervised learning has no pseudo label, after the two trackers track forward to obtain the predicted target position, the predicted target position is used as the pseudo label to track reversely to skip the interval n frames, a new predicted position is obtained in the initial frame, and the overlap ratio IOU is calculated between the new predicted target frame and the annotation frame in the initial frame. The label P calculation formula required by selector training is obtained through the IOU value as (6)
(2) The selection module consists of two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and inputting the feature map obtained by the tracker to be selected into a selection module, and respectively obtaining the probability values of the precision estimation of the two target frames by the feature map through the selection module. This selector is trained using a cross entropy loss function. The loss function is of the formula (7)
Ls=-∑plna+(1-p)ln(1-a) (7)
Wherein a is the probability value of the target box precision estimation of the features obtained by the selector, and p is the label required by the selector training.
(3) In the tracking stage, a target is positioned by using the tracker, the characteristic diagram and the positioning result obtained by the tracker are used as the input of the selection module, and the result of the optimal tracker can be directly judged and output by the selection module.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.
Claims (3)
1. A high-precision target tracking system based on unsupervised learning is characterized in that: the system comprises the following modules:
the image acquisition module is used for acquiring a video image;
the tracking module comprises a tracker 1 and a tracker 2 and is used for obtaining the characteristics of the image and the target rectangular frame;
the selection module comprises two full connection layers and a softmax layer, wherein the full connection layers comprise a linear full connection layer and a Relu activation function; and taking the characteristic diagram of the tracker to be selected and the tracker result as input, and outputting the best tracking result through the selector.
2. The unsupervised learning high accuracy-based target tracking system of claim 1, wherein: tracker 1 and tracker 2 in the tracking module adopt two kinds of different trackers, are suitable for and track and use two different loss function training respectively under different scenes, specifically as follows:
Lm=∑izi (2)
wherein L isCIs a loss function selected by the tracker 1, RTIs a label for cropping the template patch from the initial frame, which is a Gaussian response centered at the center of the initial bounding box, ZTIs a response graph generated by a second search frame in the back tracking, and utilizes cycle consistency training; l ismIs the Huber loss function of the tracker 2,is a reconstructed frame of the video sequence to be displayed,is a real frame, trained using the consistency of pixel reorganization.
3. The unsupervised learning high accuracy-based target tracking system of claim 1, wherein: the specific contents of the selection module are as follows:
the selection module aims at selecting the most appropriate target result according to the tracking result; in the selection module, the results of two trackers need to be operated simultaneously, so that the preferred selection in the tracker 1 and the tracker 2 can be realized;
(1) acquiring an overlapping value IOU between the candidate box and the pseudo label; after the two trackers track forward to obtain a predicted target position, the predicted target position is used as a pseudo label to track reversely to skip an interval n frame, a new predicted position is obtained in an initial frame, and a new estimated target frame and an annotation frame in the initial frame are used for calculating the degree of coincidence IOU; the label P required by selector training is obtained through the IOU value, and the calculation formula of the label P is as (4)
(2) The selection module consists of two fully-connected layers and a softmax layer, wherein the fully-connected layers comprise a linear fully-connected layer and a Relu activation function; inputting a feature map obtained by a tracker to be selected into a selection module, and respectively obtaining the probability values of the precision estimation of the two target frames by the feature map through the selection module; this selector is trained for this selection module using a cross-entropy loss function, which is equation (5):
Ls=-∑p ln a+(1-p)ln(1-a) (5)
wherein a is a probability value of the target box precision estimation obtained by the characteristics through a selector, and p is a label required by the selector for training;
(3) in the tracking stage, a target is positioned by using the tracker, the characteristic diagram and the positioning result obtained by the tracker are used as the input of the selection module, and the result of the optimal tracker can be directly judged and output by the selection module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110523935.8A CN113256680A (en) | 2021-05-13 | 2021-05-13 | High-precision target tracking system based on unsupervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110523935.8A CN113256680A (en) | 2021-05-13 | 2021-05-13 | High-precision target tracking system based on unsupervised learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113256680A true CN113256680A (en) | 2021-08-13 |
Family
ID=77181801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110523935.8A Pending CN113256680A (en) | 2021-05-13 | 2021-05-13 | High-precision target tracking system based on unsupervised learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113256680A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682696A (en) * | 2016-12-29 | 2017-05-17 | 华中科技大学 | Multi-example detection network based on refining of online example classifier and training method thereof |
CN109978045A (en) * | 2019-03-20 | 2019-07-05 | 深圳市道通智能航空技术有限公司 | A kind of method for tracking target, device and unmanned plane |
US20190228266A1 (en) * | 2018-01-22 | 2019-07-25 | Qualcomm Incorporated | Failure detection for a neural network object tracker |
US20190347806A1 (en) * | 2018-05-09 | 2019-11-14 | Figure Eight Technologies, Inc. | Video object tracking |
CN110569793A (en) * | 2019-09-09 | 2019-12-13 | 西南交通大学 | Target tracking method for unsupervised similarity discrimination learning |
CN111161558A (en) * | 2019-12-16 | 2020-05-15 | 华东师范大学 | Method for judging forklift driving position in real time based on deep learning |
CN111950367A (en) * | 2020-07-08 | 2020-11-17 | 中国科学院大学 | Unsupervised vehicle re-identification method for aerial images |
-
2021
- 2021-05-13 CN CN202110523935.8A patent/CN113256680A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106682696A (en) * | 2016-12-29 | 2017-05-17 | 华中科技大学 | Multi-example detection network based on refining of online example classifier and training method thereof |
US20190228266A1 (en) * | 2018-01-22 | 2019-07-25 | Qualcomm Incorporated | Failure detection for a neural network object tracker |
US20190347806A1 (en) * | 2018-05-09 | 2019-11-14 | Figure Eight Technologies, Inc. | Video object tracking |
CN109978045A (en) * | 2019-03-20 | 2019-07-05 | 深圳市道通智能航空技术有限公司 | A kind of method for tracking target, device and unmanned plane |
CN110569793A (en) * | 2019-09-09 | 2019-12-13 | 西南交通大学 | Target tracking method for unsupervised similarity discrimination learning |
CN111161558A (en) * | 2019-12-16 | 2020-05-15 | 华东师范大学 | Method for judging forklift driving position in real time based on deep learning |
CN111950367A (en) * | 2020-07-08 | 2020-11-17 | 中国科学院大学 | Unsupervised vehicle re-identification method for aerial images |
Non-Patent Citations (3)
Title |
---|
BINENG ZHONG等: "Visual tracking via weakly supervised learning from multiple imperfect oracles", 《2010 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
NING WANG等: "Unsupervised deep tracking", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
ZIHANG LAI等: "MAST: A Memory-Augmented Self-Supervised Tracker", 《2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103336957B (en) | A kind of network homology video detecting method based on space-time characteristic | |
Ge et al. | An attention mechanism based convolutional LSTM network for video action recognition | |
Liu et al. | Super-resolution-based change detection network with stacked attention module for images with different resolutions | |
US20200117906A1 (en) | Space-time memory network for locating target object in video content | |
CN110781744A (en) | Small-scale pedestrian detection method based on multi-level feature fusion | |
CN112446342A (en) | Key frame recognition model training method, recognition method and device | |
Porav et al. | Don’t worry about the weather: Unsupervised condition-dependent domain adaptation | |
Oh et al. | Space-time memory networks for video object segmentation with user guidance | |
CN111695457A (en) | Human body posture estimation method based on weak supervision mechanism | |
Hsu et al. | Adaptive fusion of multi-scale YOLO for pedestrian detection | |
Ullah et al. | Event-oriented 3D convolutional features selection and hash codes generation using PCA for video retrieval | |
CN111182364B (en) | Short video copyright detection method and system | |
Lu et al. | MoNet: Motion-based point cloud prediction network | |
Li et al. | Collaborative convolution operators for real-time coarse-to-fine tracking | |
CN113256680A (en) | High-precision target tracking system based on unsupervised learning | |
Yang et al. | SAM-Net: Semantic probabilistic and attention mechanisms of dynamic objects for self-supervised depth and camera pose estimation in visual odometry applications | |
CN112101344B (en) | Video text tracking method and device | |
Venator et al. | Self-Supervised Learning of Domain-Invariant Local Features for Robust Visual Localization Under Challenging Conditions | |
CN112801068A (en) | Video multi-target tracking and segmenting system and method | |
Long et al. | Detail preserving residual feature pyramid modules for optical flow | |
Ma et al. | Scene image retrieval with siamese spatial attention pooling | |
CN112115786A (en) | Monocular vision odometer method based on attention U-net | |
Li et al. | Traffic4d: Single view reconstruction of repetitious activity using longitudinal self-supervision | |
Xi et al. | Implicit Motion-Compensated Network for Unsupervised Video Object Segmentation | |
CN114882403B (en) | Video space-time action positioning method based on progressive attention hypergraph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |