CN110706253B - Target tracking method, system and device based on apparent feature and depth feature - Google Patents

Target tracking method, system and device based on apparent feature and depth feature Download PDF

Info

Publication number
CN110706253B
CN110706253B CN201910884524.4A CN201910884524A CN110706253B CN 110706253 B CN110706253 B CN 110706253B CN 201910884524 A CN201910884524 A CN 201910884524A CN 110706253 B CN110706253 B CN 110706253B
Authority
CN
China
Prior art keywords
target
depth
region
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910884524.4A
Other languages
Chinese (zh)
Other versions
CN110706253A (en
Inventor
胡卫明
李晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201910884524.4A priority Critical patent/CN110706253B/en
Publication of CN110706253A publication Critical patent/CN110706253A/en
Application granted granted Critical
Publication of CN110706253B publication Critical patent/CN110706253B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computer vision tracking, and particularly relates to a target tracking method, a system and a device based on apparent features and depth features, aiming at solving the problem of low tracking precision caused by neglecting depth information of a target scene in the existing target tracking method. The method comprises the steps of obtaining a target area and a search area of a target to be tracked in a t frame image according to a target position of a t-1 frame and a preset target size; respectively extracting the apparent features and the depth features of the target area and the search area through an apparent feature extraction network and a depth feature extraction network; respectively carrying out weighted average on the apparent characteristics and the depth characteristics of the target area and the search area based on preset weight to obtain respective fusion characteristics; obtaining a response graph of the target through a relevant filter according to the fusion characteristics of the target area and the search area; and taking the position corresponding to the peak value of the response map as the target position of the t-th frame. The invention extracts the depth information of the target scene and improves the target tracking precision.

Description

Target tracking method, system and device based on apparent feature and depth feature
Technical Field
The invention belongs to the technical field of computer vision tracking, and particularly relates to a target tracking method, system and device based on apparent characteristics and depth characteristics.
Background
Object tracking is one of the most fundamental problems in the field of computer vision, whose task is to estimate the motion trajectory of objects or image areas in a video sequence. Object tracking has a very wide range of applications in real scenes, often serving as a component in larger computer vision systems. For example, both autonomous driving and vision-based active safety systems rely on tracking the position of vehicles, bicyclists, and pedestrians. In robotic systems, tracking an object of interest is a very important aspect in visual perception, thereby extracting high-level information from camera sensors for decision-making and navigation. In addition to robot-related applications, target tracking is also often used for automatic video analysis, where information is extracted, first by detecting and tracking players and objects involved in a game. Other applications include augmented reality techniques and dynamic structure techniques, which are often tasked with tracking different local image regions. As can be seen from the diversity of applications, the target tracking problem itself is very diverse.
In recent decades, the field of target tracking has made a breakthrough progress, resulting in numerous classical research results. However, many theoretical and technical problems still exist in the field of target tracking, and particularly, complex problems encountered in an open environment such as background interference, illumination change, scale change and shielding in the tracking process still exist. Therefore, how to be capable of tracking the target in a self-adaptive, real-time and robust manner in a complex scene is always a problem to be solved by a great number of researchers, and the research value and the research space are still large.
For single target tracking, the quality of the characteristics directly determines the quality of tracking performance. The early discriminant model based on manual features can only extract some shallow features of the target object and cannot well describe the essence of the target object, and the recently appeared convolutional neural network can learn the feature expressions of different levels of the target object through a hierarchical structure, but ignores the global depth information of a target scene. The depth information can serve as an auxiliary feature to provide global information for the target, and the problems that the target is shielded and the like are solved, so that the robustness of the model in a complex scene is improved. Therefore, the invention provides a target tracking method based on the appearance characteristic and the depth characteristic.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem that the tracking accuracy is low due to the fact that the depth information of the target scene is ignored in the existing target tracking method, in a first aspect of the present invention, a method for tracking a target based on an apparent feature and a depth feature is provided, the method including:
step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;
step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network;
step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain fusion features of the target region and the search region;
step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;
wherein the content of the first and second substances,
the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;
the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.
In some preferred embodiments, in step S10, "acquiring a region of the target to be tracked in the image of the t-th frame according to the target position of the t-1 frame and the preset target size, and taking the region as the target region", the method includes: if t is equal to 1, acquiring a target area of a target to be tracked according to a preset target position and a preset target size; and if t is larger than 1, acquiring a target area of the target to be tracked according to the target position of the t-1 frame and the preset target size.
In some preferred embodiments, the apparent feature extraction network has a structure of: the filter comprises two convolutional layers and a relevant filter layer, wherein a maximum pooling layer and a ReLU activation function are connected behind each convolutional layer; the network is trained in a training process by adopting a back propagation algorithm.
In some preferred embodiments, the deep feature extraction network has a structure of: 5 convolutional layers, 5 deconvolution layers; the depth feature extraction network is trained through mutual reconstruction of binocular images in the training process.
In some preferred embodiments, in the process of extracting the depth features, if t is equal to 1, the depth feature extraction network comprises:
acquiring the depth feature of the first frame image based on a depth feature extraction network;
and acquiring the depth features of the target area and the search area based on the depth feature of the first frame image and the preset target position.
In some preferred embodiments, when filtering the target region, the correlation filter obtains different scales by a scale transformation method, enlarges or reduces the target region according to the different scales, and then performs filtering. The scale transformation method comprises the following steps:
Figure BDA0002206894790000031
wherein a is a scale coefficient, S is a scale pool, S is a preset scale degree, asIs a scale.
In some preferred embodiments, after step S40, the method further includes updating the state value of the relevant filter, and includes:
acquiring the state value of a correlation filter in a t-1 frame;
and updating the state value of the t frame of the relevant filter based on the state value, the target position of the t frame and a preset learning rate.
In a second aspect of the present invention, a system for tracking a target based on an appearance feature and a depth feature is provided, the system comprising an acquisition region module, a feature extraction module, a feature fusion module, and an output position module;
the acquisition region module is configured to acquire a region of a target to be tracked in a t-th frame image according to the target position of the t-1 frame and a preset target size, and the region is used as a target region; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;
the feature extraction module is configured to extract the apparent features and the depth features corresponding to the target area and the search area respectively through an apparent feature extraction network and a depth feature extraction network;
the feature fusion module is configured to perform weighted average on the target region and the apparent feature and the depth feature corresponding to the search region respectively based on preset weights to obtain a fusion feature of the target region and a fusion feature of the search region;
the output position module is configured to obtain a response map of the target to be tracked through a relevant filter according to the fusion feature of the target area and the fusion feature of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;
wherein the content of the first and second substances,
the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;
the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the program applications being loaded and executed by a processor to implement the above-mentioned object tracking method based on apparent features and depth features.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method described above.
The invention has the beneficial effects that:
the invention extracts the depth information of the target scene and improves the target tracking precision. According to the method, the learned convolution characteristics can be closely coupled with the correlation filtering by integrating the correlation filtering into the convolution neural network, so that the method is more suitable for a target tracking task. Because the related filtering is derived in the frequency domain, higher efficiency is kept, and the tracking effect can be greatly improved on the premise of ensuring real-time tracking of the algorithm.
In addition, the depth feature and the apparent feature are fused, so that a complementary feature is provided for the property that a single feature cannot well express the target, the depth feature is extracted from the whole frame of the target scene, the depth feature has global information and contains depth information without the apparent feature, the problems that the target is partially shielded and deformed and the like are solved, and the tracking algorithm has better robustness.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a schematic flow chart of a target tracking method based on appearance features and depth features according to an embodiment of the present invention;
FIG. 2 is a block diagram of a target tracking method based on appearance features and depth features according to an embodiment of the present invention;
FIG. 3 is a frame diagram of the training process of the target tracking method based on the appearance feature and the depth feature according to an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a practical application of the target tracking method based on the appearance feature and the depth feature according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention discloses a target tracking method based on an apparent characteristic and a depth characteristic, which comprises the following steps:
step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;
step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network;
step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain fusion features of the target region and the search region;
step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;
wherein the content of the first and second substances,
the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;
the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.
In order to more clearly describe the object tracking method based on the appearance feature and the depth feature of the present invention, the following will describe each step in an embodiment of the method of the present invention in detail with reference to fig. 1.
In the invention, a computer with a 2.8 GHz central processing unit and a 1G byte memory is adopted, the training process of the network is realized under a Pythrch frame, the training and testing processes of the whole network are processed in parallel by adopting a plurality of NVIDIA TITAN XP GPUs, and a working program of the whole target tracking technology is compiled by using a python language, so that the method is realized.
In the following preferred embodiment, an apparent feature extraction network, a depth feature extraction network, and a correlation filter are first detailed, and then a target tracking method based on an apparent feature and a depth feature, in which the position of a target to be tracked is acquired by using the apparent feature extraction network, the depth feature extraction network, and the correlation filter, is detailed.
1. Apparent feature extraction network, deep feature extraction network, and training of correlation filter
Step A1, constructing a training data set
In the invention, the data of the training set is derived from an OTB100 data set, wherein the data comprises 100 videos labeled frame by frame, 11 target apparent change attributes and 2 evaluation indexes. The 11 attributes are respectively: illumination changes, scale changes, occlusion, non-rigid deformation, motion blur, fast motion, horizontal rotation, vertical rotation, out-of-view, background clutter, and low resolution.
The two evaluation indexes are the Center Location Error (CLE) and the rectangular frame overlapping rate (CLE), respectively. For a first evaluation index based on the center point position error, namely, the precision map, defined as the average euclidean distance between the center position of the tracked object and the center point of the artificially marked rectangular frame position, the mathematical expression is as shown in formula (1):
Figure BDA0002206894790000081
wherein (x)g,yg) The positions (ground route), (x) of the rectangular boxes representing the manual marksp,yp) Is the predicted target position in the current frame.
If deltagpBelow a given threshold, the result of this frame is deemed to be successful. In the accuracy map, δgpIs set to 20 pixels. The accuracy map does not give a comparison of the estimated target size and shape because the center position error quantifies the difference in pixels. Therefore, we often use a more robust success rate map to evaluate the algorithm. For the second evaluation criterion based on the overlap ratio, namely the success ratio map: suppose the predicted rectangular box is rtThe manually labeled rectangular frame is raThen, the calculation formula of the Overlap ratio (Overlap Score, OS) is shown in formula (2):
S=|rt∩ra|/|rt∪ra| (2)
where, U and ^ respectively represent the union and intersection of the two regions, | · | represents the number of pixels in the region. The OS is used to determine whether the tracking algorithm has successfully tracked the target in the current video frame. Frames with an OS score greater than a threshold are referred to as frames that successfully track the target. In the success rate graph, the threshold value varies between 0 and 1, and thus the resulting graph is a variation graph. We use the area under the curve of the accuracy map and the success rate map to represent the performance of the algorithm.
Step A2, off-line training apparent feature extraction network
First, training data, the VID data set in ImageNet, is prepared, containing 3000 video segments. Secondly, designing a Network structure, because the real-time performance of target tracking is also an important index of an evaluation algorithm, the invention designs a lightweight Network, uses a Simase Network as a basic structure of the Network, comprises two convolution layers in total,the input data size is 125 x 125, and after each layer convolution the max pooling layer and the Relu activation function are connected. On this basis, a correlation filtering layer is added and the back propagation of the network is deduced. The whole process can be described as follows: features giving search areas
Figure BDA0002206894790000091
Obtaining a desired response
Figure BDA0002206894790000092
The highest response is achieved at the true location. The solution of the objective function is shown in the formulas (3), (4) and (5):
Figure BDA0002206894790000093
Figure BDA0002206894790000094
Figure BDA0002206894790000095
wherein theta is a network parameter, y is a standard Gaussian response, gamma is a regularization coefficient, L (theta) is a target loss function, D is a total frame number of the video, L is a filter channel,
Figure BDA0002206894790000096
for the learned correlation filter, an indication of a dot product of the matrix,
Figure BDA0002206894790000097
for the extracted search region features, z is the search region,
Figure BDA0002206894790000098
in order to be a correlation filter, the correlation filter,
Figure BDA0002206894790000099
for inversion of FourierIn the alternative,
Figure BDA00022068947900000910
is the complex conjugate of the discrete fourier transform of the standard gaussian response, k is the current filter,
Figure BDA00022068947900000911
is a feature of the target area and,
Figure BDA00022068947900000912
is the complex conjugate of the target region feature, λ is the regularization coefficient ·*Representing the number of complex conjugates taken.
The objective function should contain regularization of the display, otherwise the objective will get a non-convergence condition. This regularization is implicit using a weight decay approach in conventional parameter optimization. In addition, to limit the size of the feature map values and increase the stability of the training process, a Local Response Normalization (LRN) layer is added at the end of the convolutional layer. And the detection branch and the learning branch are subjected to back propagation based on a deep learning frame Pythrch, and when the error is propagated backwards to the real-valued feature mapping, the rest of back propagation can be conducted as the traditional CNN optimization. Since all operations of back propagation in the correlation filter layer are still Hadamard operations in the fourier frequency domain, the efficiency of the (correlation filter) DCF can be maintained and offline training applied to large-scale datasets. After the off-line training is completed, a specific feature extractor is obtained for the on-line discriminant correlation filtering tracking algorithm.
Step A3, training a deep feature network
Given a single image I under test, the goal is to learn a function f that can predict the scene depth for each pixel, as shown in equation (6):
Figure BDA0002206894790000101
wherein the content of the first and second substances,
Figure BDA0002206894790000102
is depth information.
Most existing learning-based approaches treat them as a supervised learning problem. Where they typically have a color input image and its corresponding target depth value at the time of training. As an alternative, the depth estimation can be used as an image reconstruction problem during training. Specifically, we can input two pictures, namely a left color image I and a right color image I which are acquired by a standard binocular camera at the same timelAnd IrAnd l and r represent left and right. Try to find a corresponding relation drWhen applied to the left image, the right image can be reconstructed instead of trying to predict the depth information directly. The reconstructed image Il (dr) is referred to as
Figure BDA0002206894790000103
Also, a left image may be estimated using a given right image,
Figure BDA0002206894790000104
assuming the image is corrected, d corresponds to the image disparity, i.e., the scalar value per pixel that the model will learn to predict. This would require only a single left image as input for the convolutional layer, while the right image is used only during training. Achieving consistency between two disparity maps using this novel left and right consistency loss can yield more accurate results. The structure of the deep feature network is composed of ResNet networks of an encoder and a decoder, and the ResNet network comprises 5 convolutional layers and 5 deconvolution layers in total. The decoder uses a skip-wise transfer from the encoder active block, enabling it to resolve higher resolution details. With respect to the loss function of the training, a loss function C is defined at each scale ffThe total penalty is the arithmetic sum on each scale,
Figure BDA0002206894790000111
the loss module is again a combination of three main loss functions, as shown in equation (7):
Figure BDA0002206894790000112
wherein, CapIs an apparent matching loss function that represents how similar the reconstructed image is to the corresponding training input, CdsIs a differential smoothness loss function, ClrIs a function of the loss of consistency of the left and right differences,
Figure BDA0002206894790000113
representing the apparent matching loss function of the left and right images,
Figure BDA0002206894790000114
a difference smoothness loss function representing the left and right images,
Figure BDA0002206894790000115
left-right difference consistency loss function, alpha, representing left and right imagesap、αds、αlrAre the weighting coefficients of the three loss functions. Each of the main loss functions contains left and right image variants, but only the left image is input to the convolutional layer.
Step A4, train the correlation filter
As shown in fig. 3, based on the training set constructed in step a1, the apparent features and the depth features corresponding to the data set are obtained through the apparent feature extraction network trained in step a2 and the depth feature extraction network trained in step A3.
In order to increase robustness of tracking performance and prevent a target from being interfered by a background, in fig. 3, a target template, i.e., a target area, is obtained according to a target position of a t-1 frame and a preset target size, a range 2 times around the target area is selected as a search area, and the search area is input into an apparent feature extraction network. The purpose of this is to add more background information, prevent the target from drifting and increase the discriminability of the model. And similarly, extracting the depth features of the target template and the search area in the depth feature extraction network respectively. In order to acquire more depth features, in the first frame of a video, the whole image is firstly input into a depth feature extraction network, the depth information of the whole image is extracted, and then the depth information is clipped at a target area, namely the depth features required by us.
Because the extracted apparent feature and depth feature have mismatched dimensions, the apparent feature is 32 dimensions, and the depth feature is 1 dimension, therefore, in order to avoid weakening the influence of the depth feature on the algorithm due to large dimension difference in the feature fusion process, a weight coefficient α is introduced to fuse the features, and the fusion process is shown as formula (8):
Figure BDA0002206894790000121
wherein the content of the first and second substances,
Figure BDA0002206894790000122
the apparent features extracted for the k-th frame,
Figure BDA0002206894790000123
depth feature extracted for the kth frame, ψ (x)kRepresenting the fusion feature of the k-th frame.
The fusion features are circularly shifted to generate positive and negative samples, and the positive and negative samples are obtained through a formula
Figure BDA0002206894790000124
And training a relevant filter template. Wherein, f (x)i) For the estimated target response, yiIs a standard gaussian response and W is the correlation filter.
To train the correlation filter, the optimization formula for the difference between the predicted response map and the ideal response map is shown in equation (9):
Figure BDA0002206894790000125
wherein the content of the first and second substances,
Figure BDA0002206894790000126
for the feature after the fusion of the target region,
Figure BDA0002206894790000127
for the feature after the search area fusion,
Figure BDA0002206894790000128
the complex conjugate of the features after fusion of the target region.
When a new frame comes, in order to detect the target, the apparent feature and the depth feature are firstly extracted from the target region estimated in the previous frame and then combined together to obtain a uniform feature. With the help of the correlation filter layer, the response map of the template over the search area can be calculated by equation (10):
Figure BDA0002206894790000129
wherein the content of the first and second substances,
Figure BDA0002206894790000131
is the complex conjugate of the correlation filter.
The position of the object in the current video frame is then obtained by the maximum value on the response map. To ensure the robustness of the proposed model, the filter h is updated with a predefined learning rate η, as shown in equation (11):
Figure BDA0002206894790000132
wherein the content of the first and second substances,
Figure BDA0002206894790000133
the correlation filter for the (k + 1) th frame,
Figure BDA0002206894790000134
is the correlation filter for the k frame.
As for the scale change of the target, we use pyramid image blocks with scale factors for scale filtering
Figure BDA0002206894790000135
Wherein a is a scale coefficient, S is a scale pool, and S is the number of scales to be taken. For example, the algorithm uses 3 scales, that is, S is 3, we can obtain a scale pool S is (-1,0,1), then the power S of a is the scale we want to take, that is, (0.97,1,1.03), then this scale is multiplied to the target area, 0.97 means that the range is reduced by 0.97 times, the target becomes smaller, 1.03 means that the range is enlarged, then filtering is performed, and the maximum response value is the scale we want.
The obtained target response graph corresponds to the probability value that the point is the target, therefore, the point with the maximum in the response graph is selected as the estimated target position, and when the position of the target in a new frame is obtained, the motion model is updated. During online tracking, the filter is updated only over time. The optimization problem for the target can be represented in an incremental mode, as shown in equation (12):
Figure BDA0002206894790000136
where ε is the output loss, p is the p-th sample, t is the current sample, βtAs the impact factor for the current sample,
Figure BDA0002206894790000138
for the correlation filter of the p-th sample,
Figure BDA0002206894790000137
is the characteristic of the t-th sample.
Parameter betat> 0 is sample xtWhile the closed form solution in the equation can also be extended to time series, as shown in equation (13):
Figure BDA0002206894790000141
wherein the content of the first and second substances,
Figure BDA0002206894790000142
for the correlation filter of the p-th sample,
Figure BDA0002206894790000143
is a sample xtIs characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA0002206894790000144
sample x obtained for the k-th filtertIs characterized in that it is a mixture of two or more of the above-mentioned components,
Figure BDA0002206894790000145
sample x obtained for the k-th filtertThe complex conjugate of the feature of (a).
The advantage of such incremental updating is that we do not need to save a large sample set and therefore take up little space.
2. Target tracking method based on apparent features and depth features
The embodiment of the invention provides a target tracking method based on an apparent characteristic and a depth characteristic, which comprises the following steps:
step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; and acquiring a region with the size N times of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region.
In this embodiment, the target position and target size are given by a rectangular box in the first frame of the video. In the case of another frame, a position area of the target in the current frame image is acquired based on the target position of the previous frame and a preset target size, and the position area is used as a target area, and an area 2 times the size of the target area is acquired with the target area of the previous frame as the center and used as a search area. Therefore, the robustness of the tracking performance can be increased, and the target is prevented from being interfered by the background.
The target size in the present embodiment is set according to the actual application.
And step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network.
In this embodiment, the target region and the search region acquired in step S10 are input into the apparent feature extraction network and the depth feature extraction network, respectively, to obtain an apparent feature and a depth feature corresponding to each other.
When the depth features of the first frame image of the video are extracted, the whole image is firstly input into a depth feature extraction network, the depth information of the whole image is extracted, and then the depth information at the cutting position at the target position, namely the depth features required by people, is extracted.
Wherein the apparent features collectively comprise 32 dimensions, the size is 125 x 32, and the size of the depth features is limited to 125 x 1.
And step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain the fusion features of the target region and the search region.
In this embodiment, because the dimensions of the extracted apparent feature and the extracted depth feature are not matched, the apparent feature is 32 dimensions, and the depth feature is 1 dimension, so that in order to avoid weakening the influence of the depth feature on the algorithm due to a large difference in dimensions in the feature fusion process, a weight coefficient is introduced to fuse the features, and fusion features of the target region and the search region are obtained respectively.
Step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; and taking the position corresponding to the peak value of the response map as the target position of the t-th frame.
In this embodiment, a response map is obtained through a trained correlation filter based on the fusion characteristics of the target region and the search region, and the position of the target to be tracked in the current video frame is obtained through the maximum value on the response map.
And updating the relevant filter based on the state value of the filter of the previous frame and a preset learning rate when the target position of the current frame is obtained.
Step S10-step S40, reference may be made to fig. 4, in which the apparent feature and the depth feature are obtained through an apparent feature extraction network (ANET) and a depth feature extraction network (DNET) respectively based on the target area image (Current image) and the Search area image (Search area image), the fusion features of the Current image and the Search area image are obtained by performing weighted averaging based on a preset weight α, the Response map (Response map) is obtained based on a correlation filter (DCF), and the position of the target is obtained by the maximum value on the Response map.
A second embodiment of the present invention is a target tracking system based on appearance features and depth features, as shown in fig. 2, including: the system comprises an acquisition region module 100, an extraction feature module 200, a feature fusion module 300 and an output position module 400;
an obtaining region module 100, configured to obtain a region of a target to be tracked in a t-th frame image according to a target position of a t-1 frame and a preset target size, and take the region as a target region; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;
an extraction feature module 200 configured to extract an apparent feature and a depth feature corresponding to the target region and the search region respectively through an apparent feature extraction network and a depth feature extraction network;
the feature fusion module 300 is configured to perform weighted average on the apparent features and the depth features corresponding to the target region and the search region respectively based on preset weights to obtain fusion features of the target region and fusion features of the search region;
an output position module 400, configured to obtain a response map of the target to be tracked through a correlation filter according to the fusion feature of the target region and the fusion feature of the search region; taking the position corresponding to the peak value of the response image as the target position of the t frame;
wherein the content of the first and second substances,
the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;
the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the target tracking system based on the apparent feature and the depth feature provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described object tracking method based on apparent features and depth features.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method described above.
It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. A method for object tracking based on appearance features and depth features, the method comprising:
step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;
step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network;
step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain fusion features of the target region and the search region;
step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;
wherein the content of the first and second substances,
the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;
the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.
2. The method for tracking the target according to claim 1, wherein in step S10, "acquiring the region of the target to be tracked in the image of the t-th frame according to the target position of the t-1 frame and the preset target size, and using the region as the target region", the method comprises: if t is equal to 1, acquiring a target area of a target to be tracked according to a preset target position and a preset target size; and if t is larger than 1, acquiring a target area of the target to be tracked according to the target position of the t-1 frame and the preset target size.
3. The method for tracking an object based on the appearance features and the depth features according to claim 1, wherein the appearance feature extraction network has a structure that: the filter comprises two convolutional layers and a relevant filter layer, wherein a maximum pooling layer and a ReLU activation function are connected behind each convolutional layer; the network is trained in a training process by adopting a back propagation algorithm.
4. The method for tracking the target based on the appearance feature and the depth feature of claim 1, wherein the depth feature extraction network has a structure that: 5 convolutional layers, 5 deconvolution layers; the depth feature extraction network is trained through mutual reconstruction of binocular images in the training process.
5. The target tracking method based on the appearance features and the depth features according to claim 2, wherein in the process of extracting the depth features, if t is equal to 1, the extraction method of the depth feature extraction network is as follows:
acquiring the depth feature of the first frame image based on a depth feature extraction network;
and acquiring the depth features of the target area and the search area based on the depth feature of the first frame image and the preset target position.
6. The object tracking method based on the appearance feature and the depth feature of claim 1, wherein the correlation filter obtains different scales by a scaling method when filtering the object region, and enlarges or reduces the object region according to the different scales, and then filters the object region, and the scaling method is as follows:
Figure FDA0003387759490000021
wherein a is a scale coefficient, S is a scale pool, S is a preset scale degree, asIs a scale.
7. The method for tracking an object based on an apparent feature and a depth feature according to any one of claims 1 to 6, further comprising updating the state values of the correlation filters after step S40 by:
acquiring the state value of a correlation filter in a t-1 frame;
and updating the state value of the t frame of the relevant filter based on the state value, the target position of the t frame and a preset learning rate.
8. A target tracking system based on appearance features and depth features is characterized by comprising an acquisition region module, a feature extraction module, a feature fusion module and an output position module;
the acquisition region module is configured to acquire a region of a target to be tracked in a t-th frame image according to the target position of the t-1 frame and a preset target size, and the region is used as a target region; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;
the feature extraction module is configured to extract the apparent features and the depth features corresponding to the target area and the search area respectively through an apparent feature extraction network and a depth feature extraction network;
the feature fusion module is configured to perform weighted average on the target region and the apparent feature and the depth feature corresponding to the search region respectively based on preset weights to obtain a fusion feature of the target region and a fusion feature of the search region;
the output position module is configured to obtain a response map of the target to be tracked through a relevant filter according to the fusion feature of the target area and the fusion feature of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;
wherein the content of the first and second substances,
the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;
the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.
9. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method of any of claims 1-7.
10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the apparent and depth feature based object tracking method of any of claims 1-7.
CN201910884524.4A 2019-09-19 2019-09-19 Target tracking method, system and device based on apparent feature and depth feature Active CN110706253B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910884524.4A CN110706253B (en) 2019-09-19 2019-09-19 Target tracking method, system and device based on apparent feature and depth feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910884524.4A CN110706253B (en) 2019-09-19 2019-09-19 Target tracking method, system and device based on apparent feature and depth feature

Publications (2)

Publication Number Publication Date
CN110706253A CN110706253A (en) 2020-01-17
CN110706253B true CN110706253B (en) 2022-03-08

Family

ID=69194485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910884524.4A Active CN110706253B (en) 2019-09-19 2019-09-19 Target tracking method, system and device based on apparent feature and depth feature

Country Status (1)

Country Link
CN (1) CN110706253B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767450A (en) * 2021-01-25 2021-05-07 开放智能机器(上海)有限公司 Multi-loss learning-based related filtering target tracking method and system
CN113592899A (en) * 2021-05-28 2021-11-02 北京理工大学重庆创新中心 Method for extracting correlated filtering target tracking depth features
CN113327273B (en) * 2021-06-15 2023-12-19 中国人民解放军火箭军工程大学 Infrared target tracking method based on variable window function correlation filtering

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794385A (en) * 2010-03-23 2010-08-04 上海交通大学 Multi-angle multi-target fast human face tracking method used in video sequence
CN105719292A (en) * 2016-01-20 2016-06-29 华东师范大学 Method of realizing video target tracking by adopting two-layer cascading Boosting classification algorithm
CN106780542A (en) * 2016-12-29 2017-05-31 北京理工大学 A kind of machine fish tracking of the Camshift based on embedded Kalman filter
CN107862680A (en) * 2017-10-31 2018-03-30 西安电子科技大学 A kind of target following optimization method based on correlation filter
CN108596951A (en) * 2018-03-30 2018-09-28 西安电子科技大学 A kind of method for tracking target of fusion feature
CN109308713A (en) * 2018-08-02 2019-02-05 哈尔滨工程大学 A kind of improvement core correlation filtering Method for Underwater Target Tracking based on Forward-looking Sonar
CN109344725A (en) * 2018-09-04 2019-02-15 上海交通大学 A kind of online tracking of multirow people based on space-time attention rate mechanism
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN109858326A (en) * 2018-12-11 2019-06-07 中国科学院自动化研究所 Based on classification semantic Weakly supervised online visual tracking method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5227639B2 (en) * 2008-04-04 2013-07-03 富士フイルム株式会社 Object detection method, object detection apparatus, and object detection program

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794385A (en) * 2010-03-23 2010-08-04 上海交通大学 Multi-angle multi-target fast human face tracking method used in video sequence
CN105719292A (en) * 2016-01-20 2016-06-29 华东师范大学 Method of realizing video target tracking by adopting two-layer cascading Boosting classification algorithm
CN106780542A (en) * 2016-12-29 2017-05-31 北京理工大学 A kind of machine fish tracking of the Camshift based on embedded Kalman filter
CN107862680A (en) * 2017-10-31 2018-03-30 西安电子科技大学 A kind of target following optimization method based on correlation filter
CN108596951A (en) * 2018-03-30 2018-09-28 西安电子科技大学 A kind of method for tracking target of fusion feature
CN109308713A (en) * 2018-08-02 2019-02-05 哈尔滨工程大学 A kind of improvement core correlation filtering Method for Underwater Target Tracking based on Forward-looking Sonar
CN109344725A (en) * 2018-09-04 2019-02-15 上海交通大学 A kind of online tracking of multirow people based on space-time attention rate mechanism
CN109858326A (en) * 2018-12-11 2019-06-07 中国科学院自动化研究所 Based on classification semantic Weakly supervised online visual tracking method and system
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges

Also Published As

Publication number Publication date
CN110706253A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
Teed et al. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras
CN108053419B (en) Multi-scale target tracking method based on background suppression and foreground anti-interference
CN108734723B (en) Relevant filtering target tracking method based on adaptive weight joint learning
CN107424177B (en) Positioning correction long-range tracking method based on continuous correlation filter
CN110706253B (en) Target tracking method, system and device based on apparent feature and depth feature
CN105160310A (en) 3D (three-dimensional) convolutional neural network based human body behavior recognition method
CN111311647B (en) Global-local and Kalman filtering-based target tracking method and device
CN108447041B (en) Multi-source image fusion method based on reinforcement learning
CN110097575B (en) Target tracking method based on local features and scale pool
CN103426182A (en) Electronic image stabilization method based on visual attention mechanism
CN114565655B (en) Depth estimation method and device based on pyramid segmentation attention
CN109035300B (en) Target tracking method based on depth feature and average peak correlation energy
Chen et al. A stereo visual-inertial SLAM approach for indoor mobile robots in unknown environments without occlusions
Malav et al. DHSGAN: An end to end dehazing network for fog and smoke
Yang et al. Toward country scale building detection with convolutional neural network using aerial images
Teutscher et al. PDC: piecewise depth completion utilizing superpixels
CN117011381A (en) Real-time surgical instrument pose estimation method and system based on deep learning and stereoscopic vision
CN112767267B (en) Image defogging method based on simulation polarization fog-carrying scene data set
CN113850189A (en) Embedded twin network real-time tracking method applied to maneuvering platform
CN113033356A (en) Scale-adaptive long-term correlation target tracking method
CN116958393A (en) Incremental image rendering method and device
CN116342653A (en) Target tracking method, system, equipment and medium based on correlation filter
CN113470074B (en) Self-adaptive space-time regularization target tracking method based on block discrimination
CN113781375B (en) Vehicle-mounted vision enhancement method based on multi-exposure fusion
CN110660079A (en) Single target tracking method based on space-time context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant