CN110009661B - Video target tracking method - Google Patents
Video target tracking method Download PDFInfo
- Publication number
- CN110009661B CN110009661B CN201910249323.7A CN201910249323A CN110009661B CN 110009661 B CN110009661 B CN 110009661B CN 201910249323 A CN201910249323 A CN 201910249323A CN 110009661 B CN110009661 B CN 110009661B
- Authority
- CN
- China
- Prior art keywords
- classifier network
- target
- network
- classifier
- particles
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of image video target tracking, and provides a video target tracking method, which can continuously track a specific single target in a video and relates to the related knowledge of image processing. Firstly, a rapid target tracker is trained by using a deep mutual learning and knowledge distillation method. Secondly, when each frame comes, many particles are scattered around the previous frame, and the distribution of the particles is random. We then select a large image area that can contain all the particles. And sending the relative positions of the image area and the particles to a target tracker to obtain the final score, and selecting the result with the highest score. And the final result is used as a final result after the surrounding frame regression. Finally, the tracker is updated online each time tracking fails or after a certain time. The invention has the advantages that the traditional sampling method is changed, the sampling in the image layer is changed into the sampling in the characteristic layer, the speed is greatly improved, and the speed is improved under the condition of ensuring the precision.
Description
Technical Field
The invention belongs to the technical field of image video target tracking, can continuously track a specific single target in a video, and relates to the related knowledge of image processing.
Background
With the continuous development of image processing technology, video target tracking plays an important role in daily life due to the practicability of the video target tracking.
Video target tracking is mainly divided into two main categories: particle filter methods and related filter methods. The correlation filter method is to use the target characteristics to perform correlation matching around the previous frame target, wherein the obtained final corresponding highest place is the current frame target position. In the method, a High-speed tracking with kernel correlation filters' method is published in PAMI journal in 2014 by j.f. henriques et al, so that the calculation can be converted into a fourier domain, the calculation speed is greatly accelerated, and the speed real-time property is High. Then Danelljan M et al published in the ICCV conference in 2015 for Learning spatial Regularized Correlation Filters for Visual Tracking, which suppresses the edge information of the filter, so that the filter can more accurately find the target position, thereby further improving the precision. In the ECCV conference, Danelljan M et al in 2016 proposed Beyond correction Filters, Learning Continuous conversion Operators for Visual Tracking, which interpolates feature maps with different resolutions into a Continuous spatial domain through cubic interpolation, and then applies a Hessian matrix to obtain a target position with sub-pixel precision. In 2017, Danelljan M et al further proposed an ECO (Efficient Convolition Operators for Tracking) in the CVPR conference, which adopts Convolution operation of factorization and simplifies feature extraction, and finally obtains a faster and stronger tracker. The particle filter method is to spread a large number of particles around the target position of the previous frame and to judge whether the image block in the particles is the target to know the current position of the target. Because the scattered particles are enough, the obtained position information is more accurate, and the precision of the particle filter method is generally high. However, because a large number of particles need to be scattered, the calculation amount is large, and the speed of the method is generally slow, so that the expansibility of the method is poor. A typical representation is the Learning Multi-Domain conference for Visual Tracking published by Nam H et al in 2016 at the CVPR conference.
Although current particle filter algorithms achieve good results, there are several problems to be solved. First, the accuracy of the particle filter algorithm is not apprehended, and the particle filter algorithm exhibits an elegance in the accuracy of standing for ao due to the vigorous development of the correlation filter algorithm. Secondly, the conventional particle filter needs to make a judgment for each image block, and in order to maintain the accuracy, the particles must be enough, so that the particle filter algorithm is slow.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: for any given video, the target can be continuously tracked under the condition that only the position of the target given by the first frame is known and no other information exists, and the tracking is kept in the subsequent video sequence. Moreover, the method can also process the situations of complex scenes, light changes, similar objects, occlusion and the like in the video, which means that even if the object is in a complex scene and the situation that the object is occluded by the similar object occurs, the object can still be tracked.
The technical scheme of the invention is based on an observed conclusion that: in video target tracking, the position offset of a target in two frames of images before and after the target does not change greatly, and the shape change is not obvious. Therefore, the position of the target in the current image frame can be found by scattering particles around the target in the previous image frame and judging the image blocks in the particles, so that the target is continuously tracked. Moreover, the calculation complexity of the particle filter is very high, the tracking speed is very low and only 1FPS can be achieved, the traditional sampling method is changed, sampling on an image layer is changed into sampling on a characteristic layer, the speed is greatly improved, and the speed is improved under the condition of ensuring the precision. The method comprises the following specific steps:
a method for tracking video target comprises the following steps:
firstly, an off-line training stage:
step 1: training a classifier network by using a public database, wherein the input of the classifier network is an image block, and the output of the classifier network is the score of the image block, wherein the output of the image block is 1 if the image block is a target foreground, and the output of the image block is 0 if the image block is a background; the classifier network is used for distinguishing whether an input image block is a target foreground or a background and scoring the image block;
step 2: the method of deep mutual learning is adopted, two same classifier networks are used for training at the same time, connection is established in a network layer before a result, and KL divergence is used for mutual supervision, so that the two classifier networks obtain stronger classification capability;
and step 3: a knowledge distillation method is adopted, and the classifier network obtained by training is used as a teacher for guiding and training a new classifier network; the input of the new classifier network is the coordinates and shapes of the image blocks and the particles, and the output is the score of each particle frame; in the training process, each layer of the classifier network is connected, and MSE loss is used for mutual supervision, so that the new classifier network learns the capacity of the original classifier network;
and 4, step 4: similar to the step 2, deep mutual learning training is carried out on the new classifier network learned in the step 3 as a basis, and connection is established in a network layer before the result for mutual supervision, so that the new classifier network has excellent speed and obtains more accurate precision to obtain a final classifier network;
secondly, an online tracking stage:
and 5: for a given first frame image, a plurality of particles are taken around a target truth value, and the trained classifier network in the step 4 is fine-tuned by utilizing the obtained target truth value, so that the classifier can better adapt to the video; training a surrounding frame regressor capable of finely adjusting the final result according to the target truth values, wherein the input of the surrounding frame regressor is the characteristic of the particles, and the output is the adjusted target position;
step 6: scattering a large number of particles around a first frame image of a given video, wherein the particles are different in size and shape; because the target positions in two adjacent frames do not change suddenly, some particles in the scattered particles can well surround the target object;
and 7: inputting a large image block containing all particles and the coordinates of the particles into a classifier network, obtaining the score of each particle, and obtaining the five particles with the highest score, as shown in formula (8),the score of the ith particle in the t frame of the video sequence is shown, and a plurality of particles with the highest score are selectedTaking the average value; sending the average value into a surrounding frame regressor, and finally taking an output result of the surrounding frame regressor as a tracking result;
and 8: storing the classifier network characteristics of the output result obtained each time, and when the score of the classifier network is lower than 0.5, finely adjusting the classifier network by using the stored characteristics and expanding resampling; when the score of the classifier network is low, the classifier network is not suitable for the frame at the moment, retraining is needed to better adapt to the target, or the target position moves greatly, and sampling is needed to obtain the target position at the moment;
and step 9: fine-tuning the output result by using the stored network characteristics every 20 frames; since the shape of the target changes significantly after too long, it cannot be tracked again using the result of the initial training, and a new classifier network needs to be trained to better adapt to the target.
The invention has the beneficial effects that: the method can accurately and quickly track the single target, and has excellent performance even in poor external environment conditions. Compared with common particle filtering, the method has the advantages that the speed is greatly improved and the real-time performance is guaranteed under the condition of almost equal precision.
Drawings
FIG. 1 is a block diagram of offline training. Fig. 1(a) shows two target trackers performing deep mutual learning training, where the upper and lower networks are the same. Fig. 1(b) is a diagram for guiding the training of a fast target tracker (lower target tracker) by using a trained target tracker (upper target tracker) as a teacher and adopting a knowledge distillation method. Fig. 1(c) shows two fast target trackers performing deep mutual learning training, and the upper and lower networks are the same.
Fig. 2 is the result of the target tracker on some videos. The first picture in each row is the first frame of the video. Where the green bounding box is the true value and the red bounding box is the trace result of our invention.
Detailed Description
The following further describes the specific embodiments of the present invention in combination with the technical solutions.
A method for tracking video target comprises the following steps:
firstly, an off-line training stage:
step 1: training a classifier network by using a public database, wherein the input of the classifier network is an image block, and the output of the classifier network is the score of the image block, wherein the output of the image block is 1 if the image block is a target foreground, and the output of the image block is 0 if the image block is a background; the classifier network is used for distinguishing input image blocksThe target foreground or the background is still the background, and the image blocks are scored; as in the formula (1), wherein,representing the ith image block taken at the t-th frame of the video sequence,representing image blocksIn the features of class m in the classifier network 1, m takes values of 1 and 2;represents the output of class m in the softmax layer in the classifier network 1; equation (2) represents the supervision loss of a classifier network, where, the true value label of the ith image block obtained from the t frame in the video sequence is represented, when the true value label is the same as the type output, the result is 1, otherwise, the result is 0;represents the loss of the classifier network 1 in taking N image blocks in the image sequence;
step 2: training two same classifier networks simultaneously by adopting a deep mutual learning method, establishing connection in a network layer before a result, and performing mutual supervision by using KL divergenceThereby enabling the two classifier networks to obtain stronger classification capability; formula (3) DKL(p2‖p1) Representing mutual supervision by KL divergence, wherein,respectively representing the m-th class output in the softmax layer in the classifier network 1 and the classifier network 2, wherein the KL divergence is calculated by taking N image blocks from the image sequence; formula (4)Respectively representing the final loss of the classifier network 1, the classifier network 2, whereinRespectively representing the loss of N image blocks taken by the classifier network 1 and the classifier network 2 in the image sequence; lambda [ alpha ]1、λ2Is a hyper-parameter for adjusting the relationship between losses;
and step 3: a knowledge distillation method is adopted, and the classifier network obtained by training is used as a teacher for guiding and training a new classifier network; the input of the new classifier network is the coordinates and shapes of the image blocks and the particles, and the output is the score of each particle frame; in the training process, each layer of the classifier network is connected, and MSE loss is used for mutual supervision, so that the new classifier network learns the capacity of the original classifier network; as shown in the formula (5), in the formula,respectively representing image blocksIn the classifier network Θ1、Θ2The output of the coordinate k in the ith layer, W, H respectively represent the width and height of the output of the ith layer in the classifier network,representing an MSE loss function of the l layer of the classifier network; formula (6) Respectively representing a classifier network Θ1、Θ2Network MSE loss obtained by superposing L layer loss, wherein alpha and beta are hyper-parameters and are used for adjusting loss ratio; the final supervision loss is the superposition of classification loss and MSE loss as shown in the following formula (7);
and 4, step 4: similar to the step 2, deep mutual learning training is carried out on the new classifier network learned in the step 3 as a basis, and connection is established in a network layer before the result for mutual supervision, so that the new classifier network has excellent speed and obtains more accurate precision to obtain a final classifier network;
secondly, an online tracking stage:
and 5: for a given first frame image, a plurality of particles are taken around a target truth value, and the trained classifier network in the step 4 is fine-tuned by utilizing the obtained target truth value, so that the classifier can better adapt to the video; training a surrounding frame regressor capable of finely adjusting the final result according to the target truth values, wherein the input of the surrounding frame regressor is the characteristic of the particles, and the output is the adjusted target position;
step 6: scattering a large number of particles around a first frame image of a given video, wherein the particles are different in size and shape; because the target positions in two adjacent frames do not change suddenly, some particles in the scattered particles can well surround the target object;
and 7: inputting a large image block containing all particles and the coordinates of the particles into a classifier network, obtaining the score of each particle, and obtaining the five particles with the highest score, as shown in formula (8),the score of the ith particle in the t frame of the video sequence is shown, and a plurality of particles with the highest score are selectedTaking the average value; sending the average value into a surrounding frame regressor, and finally taking an output result of the surrounding frame regressor as a tracking result;
and 8: storing the classifier network characteristics of the output result obtained each time, and when the score of the classifier network is lower than 0.5, finely adjusting the classifier network by using the stored characteristics and expanding resampling; when the score of the classifier network is low, the classifier network is not suitable for the frame at the moment, retraining is needed to better adapt to the target, or the target position moves greatly, and sampling is needed to obtain the target position at the moment;
and step 9: fine-tuning the output result by using the stored network characteristics every 20 frames; since the shape of the target changes significantly after too long, it cannot be tracked again using the result of the initial training, and a new classifier network needs to be trained to better adapt to the target.
Claims (1)
1. A method for tracking video target is characterized by comprising the following steps:
firstly, an off-line training stage:
step 1: training a classifier network by using a public database, wherein the input of the classifier network is an image block, and the output of the classifier network is the score of the image block, wherein the output of the image block is 1 if the image block is a target foreground, and the output of the image block is 0 if the image block is a background; the classifier network is used for distinguishing whether an input image block is a target foreground or a background and scoring the image block; as in the formula (1), wherein,representing the ith image block taken at the t-th frame of the video sequence,representing image blocksIn the features of class m in the classifier network 1, m takes values of 1 and 2;represents the output of class m in the softmax layer in the classifier network 1; equation (2) represents the supervision loss of a classifier network, where, the true value label of the ith image block obtained from the t frame in the video sequence is represented, when the true value label is the same as the type output, the result is 1, otherwise, the result is 0;shows a classifier network 1 in the figureLoss of N image blocks in the image sequence;
step 2: using deep mutual learning method, training is carried out simultaneously by two same classifier networks
Establishing connection in a network layer before the result, and performing mutual supervision by using KL divergence, so that two classifier networks obtain stronger classification capability; formula (3) DKL(p2‖p1) Representing mutual supervision by KL divergence, wherein,respectively representing the m-th class output in the softmax layer in the classifier network 1 and the classifier network 2, wherein the KL divergence is calculated by taking N image blocks from the image sequence; formula (4)Respectively representing the final supervision loss of the classifier network 1, 2, whereinRespectively representing the loss of N image blocks taken by the classifier network 1 and the classifier network 2 in the image sequence; lambda [ alpha ]1、λ2Is a hyper-parameter for adjusting the relationship between losses;
and step 3: a knowledge distillation method is adopted, and the classifier network obtained by training is used as a teacher for guiding and training a new classifier network; the input of the new classifier network is the coordinates, shape and image blocks of the particles, and the output is the score of each particle box; in the training process, each layer of the classifier network is connected, and MSE loss is used for mutual supervision, so that the new classifier network learns the capacity of the original classifier network; as shown in the formula (5), in the formula,respectively representing image blocksIn the classifier network Θ1、Θ2The output of the coordinate k in the ith layer, W, H respectively represent the width and height of the output of the ith layer in the classifier network,representing an MSE loss function of the l layer of the classifier network; formula (6)Respectively representing a classifier network Θ1、Θ2Network MSE loss obtained by superposing L layer loss, wherein alpha and beta are hyper-parameters and are used for adjusting loss ratio; the final supervision loss is the superposition of classification loss and MSE loss as shown in the following formula (7);
and 4, step 4: similar to the step 2, deep mutual learning training is carried out on the new classifier network learned in the step 3 as a basis, and connection is established in a network layer before the result for mutual supervision, so that the new classifier network has excellent speed and obtains more accurate precision to obtain a final classifier network;
secondly, an online tracking stage:
and 5: for a given first frame image, a plurality of particles are taken around a target truth value, and the trained classifier network in the step 4 is fine-tuned by utilizing the obtained target truth value, so that the classifier can better adapt to the video; training a surrounding frame regressor capable of finely adjusting the final result according to the target truth values, wherein the input of the surrounding frame regressor is the characteristic of the particles, and the output is the adjusted target position;
step 6: scattering a large number of particles around a first frame image of a given video, wherein the particles are different in size and shape; because the target positions in two adjacent frames do not change suddenly, some particles in the scattered particles can well surround the target object;
and 7: inputting a large image block containing all particles into the classifier network, inputting the coordinates of all particles into the classifier network to obtain the score of each particle, selecting the five particles with the highest score, as in equation (8),the score of the ith particle in the t frame of the video sequence is shown, and five particles with the highest score are selectedTaking the average value; sending the average value into a surrounding frame regressor, and finally taking an output result of the surrounding frame regressor as a tracking result;
and 8: storing the classifier network characteristics of the output result obtained each time, and when the score of the classifier network is lower than 0.5, finely adjusting the classifier network by using the stored characteristics and expanding resampling; when the score of the classifier network is low, the classifier network is not suitable for the frame at the moment, retraining is needed to better adapt to the target, or the target position moves greatly, and sampling is needed to obtain the target position at the moment;
and step 9: fine-tuning the output result by using the stored network characteristics every 20 frames; since the shape of the target changes significantly after too long, it cannot be tracked again using the result of the initial training, and a new classifier network needs to be trained to better adapt to the target.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910249323.7A CN110009661B (en) | 2019-03-29 | 2019-03-29 | Video target tracking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910249323.7A CN110009661B (en) | 2019-03-29 | 2019-03-29 | Video target tracking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110009661A CN110009661A (en) | 2019-07-12 |
CN110009661B true CN110009661B (en) | 2022-03-29 |
Family
ID=67168857
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910249323.7A Active CN110009661B (en) | 2019-03-29 | 2019-03-29 | Video target tracking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110009661B (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750540B (en) * | 2012-06-12 | 2015-03-11 | 大连理工大学 | Morphological filtering enhancement-based maximally stable extremal region (MSER) video text detection method |
WO2015016787A2 (en) * | 2013-07-29 | 2015-02-05 | Galbavy Vladimir | Board game for teaching body transformation principles |
US10402701B2 (en) * | 2017-03-17 | 2019-09-03 | Nec Corporation | Face recognition system for face recognition in unlabeled videos with domain adversarial learning and knowledge distillation |
CN107452025A (en) * | 2017-08-18 | 2017-12-08 | 成都通甲优博科技有限责任公司 | Method for tracking target, device and electronic equipment |
CN109389621B (en) * | 2018-09-11 | 2021-04-06 | 淮阴工学院 | RGB-D target tracking method based on multi-mode depth feature fusion |
-
2019
- 2019-03-29 CN CN201910249323.7A patent/CN110009661B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110009661A (en) | 2019-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110443827B (en) | Unmanned aerial vehicle video single-target long-term tracking method based on improved twin network | |
CN109410242B (en) | Target tracking method, system, equipment and medium based on double-current convolutional neural network | |
CN108648161B (en) | Binocular vision obstacle detection system and method of asymmetric kernel convolution neural network | |
CN112435282B (en) | Real-time binocular stereo matching method based on self-adaptive candidate parallax prediction network | |
CN110782477A (en) | Moving target rapid detection method based on sequence image and computer vision system | |
CN111161317A (en) | Single-target tracking method based on multiple networks | |
CN108573221A (en) | A kind of robot target part conspicuousness detection method of view-based access control model | |
Li et al. | ADTrack: Target-aware dual filter learning for real-time anti-dark UAV tracking | |
CN110399840B (en) | Rapid lawn semantic segmentation and boundary detection method | |
CN110321937B (en) | Motion human body tracking method combining fast-RCNN with Kalman filtering | |
CN113706581B (en) | Target tracking method based on residual channel attention and multi-level classification regression | |
CN111583279A (en) | Super-pixel image segmentation method based on PCBA | |
CN113744311A (en) | Twin neural network moving target tracking method based on full-connection attention module | |
CN113011329A (en) | Pyramid network based on multi-scale features and dense crowd counting method | |
CN114037938B (en) | NFL-Net-based low-illumination target detection method | |
CN111199245A (en) | Rape pest identification method | |
Yuan et al. | Self-supervised object tracking with cycle-consistent siamese networks | |
CN110544267B (en) | Correlation filtering tracking method for self-adaptive selection characteristics | |
Ruiz et al. | IDA: Improved data augmentation applied to salient object detection | |
CN108280845B (en) | Scale self-adaptive target tracking method for complex background | |
CN112767440B (en) | Target tracking method based on SIAM-FC network | |
Casagrande et al. | Abnormal motion analysis for tracking-based approaches using region-based method with mobile grid | |
CN110322479B (en) | Dual-core KCF target tracking method based on space-time significance | |
CN110009661B (en) | Video target tracking method | |
CN111091583B (en) | Long-term target tracking method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |