CN110706253B

CN110706253B - Target tracking method, system and device based on apparent feature and depth feature

Info

Publication number: CN110706253B
Application number: CN201910884524.4A
Authority: CN
Inventors: 胡卫明; 李晶
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2022-03-08
Anticipated expiration: 2039-09-19
Also published as: CN110706253A

Abstract

The invention belongs to the technical field of computer vision tracking, and particularly relates to a target tracking method, a system and a device based on apparent features and depth features, aiming at solving the problem of low tracking precision caused by neglecting depth information of a target scene in the existing target tracking method. The method comprises the steps of obtaining a target area and a search area of a target to be tracked in a t frame image according to a target position of a t-1 frame and a preset target size; respectively extracting the apparent features and the depth features of the target area and the search area through an apparent feature extraction network and a depth feature extraction network; respectively carrying out weighted average on the apparent characteristics and the depth characteristics of the target area and the search area based on preset weight to obtain respective fusion characteristics; obtaining a response graph of the target through a relevant filter according to the fusion characteristics of the target area and the search area; and taking the position corresponding to the peak value of the response map as the target position of the t-th frame. The invention extracts the depth information of the target scene and improves the target tracking precision.

Description

Target tracking method, system and device based on apparent feature and depth feature

Technical Field

The invention belongs to the technical field of computer vision tracking, and particularly relates to a target tracking method, system and device based on apparent characteristics and depth characteristics.

Background

Object tracking is one of the most fundamental problems in the field of computer vision, whose task is to estimate the motion trajectory of objects or image areas in a video sequence. Object tracking has a very wide range of applications in real scenes, often serving as a component in larger computer vision systems. For example, both autonomous driving and vision-based active safety systems rely on tracking the position of vehicles, bicyclists, and pedestrians. In robotic systems, tracking an object of interest is a very important aspect in visual perception, thereby extracting high-level information from camera sensors for decision-making and navigation. In addition to robot-related applications, target tracking is also often used for automatic video analysis, where information is extracted, first by detecting and tracking players and objects involved in a game. Other applications include augmented reality techniques and dynamic structure techniques, which are often tasked with tracking different local image regions. As can be seen from the diversity of applications, the target tracking problem itself is very diverse.

In recent decades, the field of target tracking has made a breakthrough progress, resulting in numerous classical research results. However, many theoretical and technical problems still exist in the field of target tracking, and particularly, complex problems encountered in an open environment such as background interference, illumination change, scale change and shielding in the tracking process still exist. Therefore, how to be capable of tracking the target in a self-adaptive, real-time and robust manner in a complex scene is always a problem to be solved by a great number of researchers, and the research value and the research space are still large.

For single target tracking, the quality of the characteristics directly determines the quality of tracking performance. The early discriminant model based on manual features can only extract some shallow features of the target object and cannot well describe the essence of the target object, and the recently appeared convolutional neural network can learn the feature expressions of different levels of the target object through a hierarchical structure, but ignores the global depth information of a target scene. The depth information can serve as an auxiliary feature to provide global information for the target, and the problems that the target is shielded and the like are solved, so that the robustness of the model in a complex scene is improved. Therefore, the invention provides a target tracking method based on the appearance characteristic and the depth characteristic.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the tracking accuracy is low due to the fact that the depth information of the target scene is ignored in the existing target tracking method, in a first aspect of the present invention, a method for tracking a target based on an apparent feature and a depth feature is provided, the method including:

step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;

step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network;

step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain fusion features of the target region and the search region;

step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;

wherein the content of the first and second substances,

the apparent feature extraction network is constructed based on a convolutional neural network and is used for acquiring corresponding apparent features according to an input image;

the depth feature extraction network is constructed based on a ResNet network and is used for acquiring the corresponding depth features according to the input image.

In some preferred embodiments, in step S10, "acquiring a region of the target to be tracked in the image of the t-th frame according to the target position of the t-1 frame and the preset target size, and taking the region as the target region", the method includes: if t is equal to 1, acquiring a target area of a target to be tracked according to a preset target position and a preset target size; and if t is larger than 1, acquiring a target area of the target to be tracked according to the target position of the t-1 frame and the preset target size.

In some preferred embodiments, the apparent feature extraction network has a structure of: the filter comprises two convolutional layers and a relevant filter layer, wherein a maximum pooling layer and a ReLU activation function are connected behind each convolutional layer; the network is trained in a training process by adopting a back propagation algorithm.

In some preferred embodiments, the deep feature extraction network has a structure of: 5 convolutional layers, 5 deconvolution layers; the depth feature extraction network is trained through mutual reconstruction of binocular images in the training process.

In some preferred embodiments, in the process of extracting the depth features, if t is equal to 1, the depth feature extraction network comprises:

acquiring the depth feature of the first frame image based on a depth feature extraction network;

and acquiring the depth features of the target area and the search area based on the depth feature of the first frame image and the preset target position.

In some preferred embodiments, when filtering the target region, the correlation filter obtains different scales by a scale transformation method, enlarges or reduces the target region according to the different scales, and then performs filtering. The scale transformation method comprises the following steps:

wherein a is a scale coefficient, S is a scale pool, S is a preset scale degree, a^sIs a scale.

In some preferred embodiments, after step S40, the method further includes updating the state value of the relevant filter, and includes:

acquiring the state value of a correlation filter in a t-1 frame;

and updating the state value of the t frame of the relevant filter based on the state value, the target position of the t frame and a preset learning rate.

In a second aspect of the present invention, a system for tracking a target based on an appearance feature and a depth feature is provided, the system comprising an acquisition region module, a feature extraction module, a feature fusion module, and an output position module;

the acquisition region module is configured to acquire a region of a target to be tracked in a t-th frame image according to the target position of the t-1 frame and a preset target size, and the region is used as a target region; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;

the feature extraction module is configured to extract the apparent features and the depth features corresponding to the target area and the search area respectively through an apparent feature extraction network and a depth feature extraction network;

the feature fusion module is configured to perform weighted average on the target region and the apparent feature and the depth feature corresponding to the search region respectively based on preset weights to obtain a fusion feature of the target region and a fusion feature of the search region;

the output position module is configured to obtain a response map of the target to be tracked through a relevant filter according to the fusion feature of the target area and the fusion feature of the search area; taking the position corresponding to the peak value of the response image as the target position of the t frame;

wherein the content of the first and second substances,

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the program applications being loaded and executed by a processor to implement the above-mentioned object tracking method based on apparent features and depth features.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method described above.

The invention has the beneficial effects that:

the invention extracts the depth information of the target scene and improves the target tracking precision. According to the method, the learned convolution characteristics can be closely coupled with the correlation filtering by integrating the correlation filtering into the convolution neural network, so that the method is more suitable for a target tracking task. Because the related filtering is derived in the frequency domain, higher efficiency is kept, and the tracking effect can be greatly improved on the premise of ensuring real-time tracking of the algorithm.

In addition, the depth feature and the apparent feature are fused, so that a complementary feature is provided for the property that a single feature cannot well express the target, the depth feature is extracted from the whole frame of the target scene, the depth feature has global information and contains depth information without the apparent feature, the problems that the target is partially shielded and deformed and the like are solved, and the tracking algorithm has better robustness.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of a target tracking method based on appearance features and depth features according to an embodiment of the present invention;

FIG. 2 is a block diagram of a target tracking method based on appearance features and depth features according to an embodiment of the present invention;

FIG. 3 is a frame diagram of the training process of the target tracking method based on the appearance feature and the depth feature according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of a practical application of the target tracking method based on the appearance feature and the depth feature according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses a target tracking method based on an apparent characteristic and a depth characteristic, which comprises the following steps:

wherein the content of the first and second substances,

In order to more clearly describe the object tracking method based on the appearance feature and the depth feature of the present invention, the following will describe each step in an embodiment of the method of the present invention in detail with reference to fig. 1.

In the invention, a computer with a 2.8 GHz central processing unit and a 1G byte memory is adopted, the training process of the network is realized under a Pythrch frame, the training and testing processes of the whole network are processed in parallel by adopting a plurality of NVIDIA TITAN XP GPUs, and a working program of the whole target tracking technology is compiled by using a python language, so that the method is realized.

In the following preferred embodiment, an apparent feature extraction network, a depth feature extraction network, and a correlation filter are first detailed, and then a target tracking method based on an apparent feature and a depth feature, in which the position of a target to be tracked is acquired by using the apparent feature extraction network, the depth feature extraction network, and the correlation filter, is detailed.

1. Apparent feature extraction network, deep feature extraction network, and training of correlation filter

Step A1, constructing a training data set

In the invention, the data of the training set is derived from an OTB100 data set, wherein the data comprises 100 videos labeled frame by frame, 11 target apparent change attributes and 2 evaluation indexes. The 11 attributes are respectively: illumination changes, scale changes, occlusion, non-rigid deformation, motion blur, fast motion, horizontal rotation, vertical rotation, out-of-view, background clutter, and low resolution.

The two evaluation indexes are the Center Location Error (CLE) and the rectangular frame overlapping rate (CLE), respectively. For a first evaluation index based on the center point position error, namely, the precision map, defined as the average euclidean distance between the center position of the tracked object and the center point of the artificially marked rectangular frame position, the mathematical expression is as shown in formula (1):

wherein (x)_g,y_g) The positions (ground route), (x) of the rectangular boxes representing the manual marks_p,y_p) Is the predicted target position in the current frame.

If delta_gpBelow a given threshold, the result of this frame is deemed to be successful. In the accuracy map, δ_gpIs set to 20 pixels. The accuracy map does not give a comparison of the estimated target size and shape because the center position error quantifies the difference in pixels. Therefore, we often use a more robust success rate map to evaluate the algorithm. For the second evaluation criterion based on the overlap ratio, namely the success ratio map: suppose the predicted rectangular box is r_tThe manually labeled rectangular frame is r_aThen, the calculation formula of the Overlap ratio (Overlap Score, OS) is shown in formula (2):

S＝|r_t∩r_a|/|r_t∪r_a| (2)

where, U and ^ respectively represent the union and intersection of the two regions, | · | represents the number of pixels in the region. The OS is used to determine whether the tracking algorithm has successfully tracked the target in the current video frame. Frames with an OS score greater than a threshold are referred to as frames that successfully track the target. In the success rate graph, the threshold value varies between 0 and 1, and thus the resulting graph is a variation graph. We use the area under the curve of the accuracy map and the success rate map to represent the performance of the algorithm.

Step A2, off-line training apparent feature extraction network

First, training data, the VID data set in ImageNet, is prepared, containing 3000 video segments. Secondly, designing a Network structure, because the real-time performance of target tracking is also an important index of an evaluation algorithm, the invention designs a lightweight Network, uses a Simase Network as a basic structure of the Network, comprises two convolution layers in total,the input data size is 125 x 125, and after each layer convolution the max pooling layer and the Relu activation function are connected. On this basis, a correlation filtering layer is added and the back propagation of the network is deduced. The whole process can be described as follows: features giving search areas

Obtaining a desired response

The highest response is achieved at the true location. The solution of the objective function is shown in the formulas (3), (4) and (5):

wherein theta is a network parameter, y is a standard Gaussian response, gamma is a regularization coefficient, L (theta) is a target loss function, D is a total frame number of the video, L is a filter channel,

for the learned correlation filter, an indication of a dot product of the matrix,

for the extracted search region features, z is the search region,

in order to be a correlation filter, the correlation filter,

for inversion of FourierIn the alternative,

is the complex conjugate of the discrete fourier transform of the standard gaussian response, k is the current filter,

is a feature of the target area and,

is the complex conjugate of the target region feature, λ is the regularization coefficient ·^*Representing the number of complex conjugates taken.

The objective function should contain regularization of the display, otherwise the objective will get a non-convergence condition. This regularization is implicit using a weight decay approach in conventional parameter optimization. In addition, to limit the size of the feature map values and increase the stability of the training process, a Local Response Normalization (LRN) layer is added at the end of the convolutional layer. And the detection branch and the learning branch are subjected to back propagation based on a deep learning frame Pythrch, and when the error is propagated backwards to the real-valued feature mapping, the rest of back propagation can be conducted as the traditional CNN optimization. Since all operations of back propagation in the correlation filter layer are still Hadamard operations in the fourier frequency domain, the efficiency of the (correlation filter) DCF can be maintained and offline training applied to large-scale datasets. After the off-line training is completed, a specific feature extractor is obtained for the on-line discriminant correlation filtering tracking algorithm.

Step A3, training a deep feature network

Given a single image I under test, the goal is to learn a function f that can predict the scene depth for each pixel, as shown in equation (6):

wherein the content of the first and second substances,

is depth information.

Most existing learning-based approaches treat them as a supervised learning problem. Where they typically have a color input image and its corresponding target depth value at the time of training. As an alternative, the depth estimation can be used as an image reconstruction problem during training. Specifically, we can input two pictures, namely a left color image I and a right color image I which are acquired by a standard binocular camera at the same time^lAnd I^rAnd l and r represent left and right. Try to find a corresponding relation d^rWhen applied to the left image, the right image can be reconstructed instead of trying to predict the depth information directly. The reconstructed image Il (dr) is referred to as

Also, a left image may be estimated using a given right image,

assuming the image is corrected, d corresponds to the image disparity, i.e., the scalar value per pixel that the model will learn to predict. This would require only a single left image as input for the convolutional layer, while the right image is used only during training. Achieving consistency between two disparity maps using this novel left and right consistency loss can yield more accurate results. The structure of the deep feature network is composed of ResNet networks of an encoder and a decoder, and the ResNet network comprises 5 convolutional layers and 5 deconvolution layers in total. The decoder uses a skip-wise transfer from the encoder active block, enabling it to resolve higher resolution details. With respect to the loss function of the training, a loss function C is defined at each scale f_fThe total penalty is the arithmetic sum on each scale,

the loss module is again a combination of three main loss functions, as shown in equation (7):

wherein, C_apIs an apparent matching loss function that represents how similar the reconstructed image is to the corresponding training input, C_dsIs a differential smoothness loss function, C_lrIs a function of the loss of consistency of the left and right differences,

representing the apparent matching loss function of the left and right images,

a difference smoothness loss function representing the left and right images,

left-right difference consistency loss function, alpha, representing left and right images_ap、α_ds、α_lrAre the weighting coefficients of the three loss functions. Each of the main loss functions contains left and right image variants, but only the left image is input to the convolutional layer.

Step A4, train the correlation filter

As shown in fig. 3, based on the training set constructed in step a1, the apparent features and the depth features corresponding to the data set are obtained through the apparent feature extraction network trained in step a2 and the depth feature extraction network trained in step A3.

In order to increase robustness of tracking performance and prevent a target from being interfered by a background, in fig. 3, a target template, i.e., a target area, is obtained according to a target position of a t-1 frame and a preset target size, a range 2 times around the target area is selected as a search area, and the search area is input into an apparent feature extraction network. The purpose of this is to add more background information, prevent the target from drifting and increase the discriminability of the model. And similarly, extracting the depth features of the target template and the search area in the depth feature extraction network respectively. In order to acquire more depth features, in the first frame of a video, the whole image is firstly input into a depth feature extraction network, the depth information of the whole image is extracted, and then the depth information is clipped at a target area, namely the depth features required by us.

Because the extracted apparent feature and depth feature have mismatched dimensions, the apparent feature is 32 dimensions, and the depth feature is 1 dimension, therefore, in order to avoid weakening the influence of the depth feature on the algorithm due to large dimension difference in the feature fusion process, a weight coefficient α is introduced to fuse the features, and the fusion process is shown as formula (8):

wherein the content of the first and second substances,

the apparent features extracted for the k-th frame,

depth feature extracted for the kth frame, ψ (x)_kRepresenting the fusion feature of the k-th frame.

The fusion features are circularly shifted to generate positive and negative samples, and the positive and negative samples are obtained through a formula

And training a relevant filter template. Wherein, f (x)_i) For the estimated target response, y_iIs a standard gaussian response and W is the correlation filter.

To train the correlation filter, the optimization formula for the difference between the predicted response map and the ideal response map is shown in equation (9):

wherein the content of the first and second substances,

for the feature after the fusion of the target region,

for the feature after the search area fusion,

the complex conjugate of the features after fusion of the target region.

When a new frame comes, in order to detect the target, the apparent feature and the depth feature are firstly extracted from the target region estimated in the previous frame and then combined together to obtain a uniform feature. With the help of the correlation filter layer, the response map of the template over the search area can be calculated by equation (10):

wherein the content of the first and second substances,

is the complex conjugate of the correlation filter.

The position of the object in the current video frame is then obtained by the maximum value on the response map. To ensure the robustness of the proposed model, the filter h is updated with a predefined learning rate η, as shown in equation (11):

wherein the content of the first and second substances,

the correlation filter for the (k + 1) th frame,

is the correlation filter for the k frame.

As for the scale change of the target, we use pyramid image blocks with scale factors for scale filtering

Wherein a is a scale coefficient, S is a scale pool, and S is the number of scales to be taken. For example, the algorithm uses 3 scales, that is, S is 3, we can obtain a scale pool S is (-1,0,1), then the power S of a is the scale we want to take, that is, (0.97,1,1.03), then this scale is multiplied to the target area, 0.97 means that the range is reduced by 0.97 times, the target becomes smaller, 1.03 means that the range is enlarged, then filtering is performed, and the maximum response value is the scale we want.

The obtained target response graph corresponds to the probability value that the point is the target, therefore, the point with the maximum in the response graph is selected as the estimated target position, and when the position of the target in a new frame is obtained, the motion model is updated. During online tracking, the filter is updated only over time. The optimization problem for the target can be represented in an incremental mode, as shown in equation (12):

where ε is the output loss, p is the p-th sample, t is the current sample, β_tAs the impact factor for the current sample,

for the correlation filter of the p-th sample,

is the characteristic of the t-th sample.

Parameter beta_t> 0 is sample x_tWhile the closed form solution in the equation can also be extended to time series, as shown in equation (13):

wherein the content of the first and second substances,

for the correlation filter of the p-th sample,

is a sample x_tIs characterized in that it is a mixture of two or more of the above-mentioned components,

sample x obtained for the k-th filter_tIs characterized in that it is a mixture of two or more of the above-mentioned components,

sample x obtained for the k-th filter_tThe complex conjugate of the feature of (a).

The advantage of such incremental updating is that we do not need to save a large sample set and therefore take up little space.

2. Target tracking method based on apparent features and depth features

The embodiment of the invention provides a target tracking method based on an apparent characteristic and a depth characteristic, which comprises the following steps:

step S10, acquiring the area of the target to be tracked in the t frame image according to the target position of the t-1 frame and the preset target size, and taking the area as the target area; and acquiring a region with the size N times of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region.

In this embodiment, the target position and target size are given by a rectangular box in the first frame of the video. In the case of another frame, a position area of the target in the current frame image is acquired based on the target position of the previous frame and a preset target size, and the position area is used as a target area, and an area 2 times the size of the target area is acquired with the target area of the previous frame as the center and used as a search area. Therefore, the robustness of the tracking performance can be increased, and the target is prevented from being interfered by the background.

The target size in the present embodiment is set according to the actual application.

And step S20, respectively extracting the apparent features and the depth features corresponding to the target area and the search area through an apparent feature extraction network and a depth feature extraction network.

In this embodiment, the target region and the search region acquired in step S10 are input into the apparent feature extraction network and the depth feature extraction network, respectively, to obtain an apparent feature and a depth feature corresponding to each other.

When the depth features of the first frame image of the video are extracted, the whole image is firstly input into a depth feature extraction network, the depth information of the whole image is extracted, and then the depth information at the cutting position at the target position, namely the depth features required by people, is extracted.

Wherein the apparent features collectively comprise 32 dimensions, the size is 125 x 32, and the size of the depth features is limited to 125 x 1.

And step S30, respectively carrying out weighted average on the apparent features and the depth features corresponding to the target region and the search region based on preset weights to obtain the fusion features of the target region and the search region.

In this embodiment, because the dimensions of the extracted apparent feature and the extracted depth feature are not matched, the apparent feature is 32 dimensions, and the depth feature is 1 dimension, so that in order to avoid weakening the influence of the depth feature on the algorithm due to a large difference in dimensions in the feature fusion process, a weight coefficient is introduced to fuse the features, and fusion features of the target region and the search region are obtained respectively.

Step S40, obtaining a response graph of the target to be tracked through a relevant filter according to the fusion characteristics of the target area and the fusion characteristics of the search area; and taking the position corresponding to the peak value of the response map as the target position of the t-th frame.

In this embodiment, a response map is obtained through a trained correlation filter based on the fusion characteristics of the target region and the search region, and the position of the target to be tracked in the current video frame is obtained through the maximum value on the response map.

And updating the relevant filter based on the state value of the filter of the previous frame and a preset learning rate when the target position of the current frame is obtained.

Step S10-step S40, reference may be made to fig. 4, in which the apparent feature and the depth feature are obtained through an apparent feature extraction network (ANET) and a depth feature extraction network (DNET) respectively based on the target area image (Current image) and the Search area image (Search area image), the fusion features of the Current image and the Search area image are obtained by performing weighted averaging based on a preset weight α, the Response map (Response map) is obtained based on a correlation filter (DCF), and the position of the target is obtained by the maximum value on the Response map.

A second embodiment of the present invention is a target tracking system based on appearance features and depth features, as shown in fig. 2, including: the system comprises an acquisition region module 100, an extraction feature module 200, a feature fusion module 300 and an output position module 400;

an obtaining region module 100, configured to obtain a region of a target to be tracked in a t-th frame image according to a target position of a t-1 frame and a preset target size, and take the region as a target region; acquiring a region with the size N times that of the target region based on the central point of the target position of the t-1 frame, and taking the region as a search region;

an extraction feature module 200 configured to extract an apparent feature and a depth feature corresponding to the target region and the search region respectively through an apparent feature extraction network and a depth feature extraction network;

the feature fusion module 300 is configured to perform weighted average on the apparent features and the depth features corresponding to the target region and the search region respectively based on preset weights to obtain fusion features of the target region and fusion features of the search region;

an output position module 400, configured to obtain a response map of the target to be tracked through a correlation filter according to the fusion feature of the target region and the fusion feature of the search region; taking the position corresponding to the peak value of the response image as the target position of the t frame;

wherein the content of the first and second substances,

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the target tracking system based on the apparent feature and the depth feature provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded by a processor and to implement the above-described object tracking method based on apparent features and depth features.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method described above.

It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for object tracking based on appearance features and depth features, the method comprising:

wherein the content of the first and second substances,

2. The method for tracking the target according to claim 1, wherein in step S10, "acquiring the region of the target to be tracked in the image of the t-th frame according to the target position of the t-1 frame and the preset target size, and using the region as the target region", the method comprises: if t is equal to 1, acquiring a target area of a target to be tracked according to a preset target position and a preset target size; and if t is larger than 1, acquiring a target area of the target to be tracked according to the target position of the t-1 frame and the preset target size.

3. The method for tracking an object based on the appearance features and the depth features according to claim 1, wherein the appearance feature extraction network has a structure that: the filter comprises two convolutional layers and a relevant filter layer, wherein a maximum pooling layer and a ReLU activation function are connected behind each convolutional layer; the network is trained in a training process by adopting a back propagation algorithm.

4. The method for tracking the target based on the appearance feature and the depth feature of claim 1, wherein the depth feature extraction network has a structure that: 5 convolutional layers, 5 deconvolution layers; the depth feature extraction network is trained through mutual reconstruction of binocular images in the training process.

5. The target tracking method based on the appearance features and the depth features according to claim 2, wherein in the process of extracting the depth features, if t is equal to 1, the extraction method of the depth feature extraction network is as follows:

6. The object tracking method based on the appearance feature and the depth feature of claim 1, wherein the correlation filter obtains different scales by a scaling method when filtering the object region, and enlarges or reduces the object region according to the different scales, and then filters the object region, and the scaling method is as follows:

7. The method for tracking an object based on an apparent feature and a depth feature according to any one of claims 1 to 6, further comprising updating the state values of the correlation filters after step S40 by:

acquiring the state value of a correlation filter in a t-1 frame;

8. A target tracking system based on appearance features and depth features is characterized by comprising an acquisition region module, a feature extraction module, a feature fusion module and an output position module;

wherein the content of the first and second substances,

9. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the apparent feature and depth feature based object tracking method of any of claims 1-7.

10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the apparent and depth feature based object tracking method of any of claims 1-7.