CN109785385B

CN109785385B - Visual target tracking method and system

Info

Publication number: CN109785385B
Application number: CN201910058977.1A
Authority: CN
Inventors: 王金桥; 赵飞; 唐明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2021-01-29
Anticipated expiration: 2039-01-22
Also published as: CN109785385A

Abstract

The invention relates to a visual target tracking method and a system, wherein the visual target tracking method comprises the following steps: acquiring a plurality of historical target templates and historical position information of a current video sequence of a target to be tracked; determining a plurality of groups of target template images and search areas from the current video sequence of the target to be tracked according to each historical target template; predicting the predicted position information of the target template image in the search area according to the target positioning model, each group of target template images and the search area; based on an action network model, determining a target position prediction profit value of the target template image according to the predicted position information of the target template image and the historical position information; and comparing the target position prediction profit values of all the target template images, and determining the prediction position information of the target template image with the maximum target position prediction profit value, so that the prediction position information of the current frame image of the target to be tracked can be accurately determined.

Description

Visual target tracking method and system

Technical Field

The invention relates to the technical field of image processing, in particular to a visual target tracking method and system.

Background

Visual target tracking is an important research direction in computer vision. In each frame of a video sequence, the tracking algorithm needs to determine the position and scale information of the object to be tracked. Because the apparent characteristic of the object to be tracked is only given by the first frame and is influenced by various factors such as illumination change, scale change, shielding, motion blur and the like in the tracking process, the tracking algorithm not only needs to have extremely strong robustness on environmental change, but also needs to model the apparent characteristic of the target to be tracked with strong discriminability. Under such conditions, conventional manual feature-based object tracking algorithms perform poorly.

In recent years, deep learning has enjoyed great success in many directions of computer vision. The deep convolutional neural network automatically learns by utilizing a large amount of training data and a back propagation algorithm of errors to obtain the features with strong discriminative power. Meanwhile, the reinforcement learning algorithm based on the deep neural network has great potential in a complex environment. Namely, the deep neural network can be used for feature extraction of images and fitting complex decision functions.

The existing visual target tracking algorithm based on deep learning or reinforcement learning has the following defects: firstly, a target tracking algorithm based on reinforcement learning can only return tracking results through a plurality of discrete actions, so that not only is the tracking efficiency low, but also the tracking precision is poor due to the discrete actions; secondly, the existing tracking algorithm based on deep learning only utilizes two discrete frame images to train in the training stage, but does not train on a continuous video sequence, so that the accumulated error of the tracking algorithm is large, and long-time tracking cannot be carried out; finally, these tracking algorithms do not have an effective target template updating strategy, and as the tracking time becomes longer, the accumulated error gradually increases, eventually leading to tracking failure.

Disclosure of Invention

In order to solve the above problems in the prior art, i.e. to improve the target tracking accuracy, the invention provides a visual target tracking method and system.

In order to achieve the purpose, the invention provides the following scheme:

a visual target tracking method, the visual target tracking method comprising:

acquiring a plurality of historical target templates and historical position information of a current video sequence of a target to be tracked;

determining a plurality of groups of target template images and search areas from the current video sequence of the target to be tracked according to each historical target template;

predicting the predicted position information of the target template image in the search area according to the target positioning model, each group of target template images and the search area;

based on an action network model, determining a target position prediction profit value of the target template image according to the predicted position information of the target template image and the historical position information;

and comparing the target position prediction profit values of all the target template images, and determining the prediction position information of the target template image with the maximum target position prediction profit value for tracking the current frame image of the target to be tracked.

Optionally, the visual target tracking method further includes:

extracting a prediction target template from the current frame image according to the prediction position information;

and replacing the target template corresponding to the target template image with the minimum target position prediction profit value by the prediction target template so as to update the historical target template.

Optionally, the method for constructing the target location model includes:

carrying out target position labeling on a historical video sequence of a target to be tracked to obtain a target tracking data set of the target to be tracked;

determining training data for training the conjoined network according to the target tracking data set;

and performing connected network training on the training data to obtain a target positioning model.

Optionally, the target position labeling is performed on the historical video sequence of the target to be tracked, and a target tracking data set of the target to be tracked is obtained, which specifically includes:

extracting a plurality of frames of images from the historical video sequence;

determining a corresponding target rectangular frame from each frame of image, wherein each target rectangular frame is a minimum rectangular frame containing a target to be tracked;

obtaining a target tracking data set according to each target rectangular frame; the target tracking data set includes a plurality of pairs of coordinate pairs formed by upper left corner coordinate information and lower right corner coordinate information of a set rectangular box.

Optionally, the determining training data for training the connected network according to the target tracking data set specifically includes:

determining a target area and a search area of each target rectangular frame according to the target tracking data set;

generating a target response graph in a Gaussian shape in the target area by taking the search area as a reference for each target rectangular frame;

determining a truth sample according to the target response graph; the training data includes a plurality of true value samples.

Optionally, the performing connected network training on the training data to obtain a target positioning model specifically includes:

for each target rectangular frame, carrying out pixel scaling on a target image of a target area and a search image of a search area to obtain a pair of scaled target images and scaled search images;

carrying out RGB three-channel change on each pair of zoomed target images and zoomed searching images to obtain a pair of color images;

respectively carrying out mean value reduction operation on the color images to obtain a pair of value-reduced color images;

training each pair of subtraction color images through a depth network to obtain a depth network output value;

calculating errors for output values and true value samples of the depth network through a loss function;

according to the error, obtaining a target positioning model by adopting a back propagation algorithm; and the target positioning model determines the predicted position information of the target template image in the search area according to the input target template image and the search area.

Optionally, the object localization model includes two structurally identical, parameter-shared first sub-networks and a deconvolution first sub-network;

the output ends of the two first sub-networks are connected in parallel and then connected with the input end of the deconvolution first sub-network;

combining the outputs of the two first sub-networks to serve as the input of the deconvolution first sub-network; the output of the deconvolution first subnetwork is predicted position information used for characterizing the center position and scale information of the target.

Optionally, the method for constructing the action network model includes:

determining a plurality of short video sequences according to a target tracking data set of a target to be tracked; each short video sequence comprises a plurality of frames of position images, and each frame of position image comprises position information of an object to be tracked;

and performing network training according to each short video sequence and the target positioning model to determine an action network model.

Optionally, the action network model includes a second sub-network shared by a plurality of weights;

the input of each second sub-network is predicted position information and a plurality of historical position information, and the output is a target position prediction profit value.

In order to solve the technical problems, the invention also provides the following scheme:

a visual target tracking system, the visual target tracking system comprising:

the device comprises an acquisition unit, a tracking unit and a tracking unit, wherein the acquisition unit is used for acquiring a plurality of historical target templates and historical position information of a current video sequence of a target to be tracked;

the first determining unit is used for determining a plurality of groups of target template images and search areas from the current video sequence of the target to be tracked according to each historical target template;

the prediction unit is used for predicting the predicted position information of the target template image in the search area according to the target positioning model, each group of target template images and the search area;

a second determining unit, configured to determine, based on an action network model, a target position prediction profit value of the target template image according to the predicted position information of the target template image and the historical position information;

and the tracking unit is used for comparing the target position prediction profit values of all the target template images, determining the prediction position information of the target template image with the maximum target position prediction profit value, and tracking the current frame image of the target to be tracked.

According to the embodiment of the invention, the invention discloses the following technical effects:

the target positioning method and the target tracking system are based on the target positioning model and the action network model, and can obtain a plurality of pieces of predicted position information according to the historical target template and the historical position information, further determine corresponding target position predicted income values, and accurately determine the predicted position information of the current frame image of the target to be tracked by comparing the sizes of the target position predicted income values.

Drawings

FIG. 1 is a flow chart of a visual target tracking method of the present invention;

FIG. 2 is a schematic diagram of a visual target tracking method according to an embodiment of the present invention;

FIG. 3a is a schematic diagram of a convolution module structure;

FIG. 3b is a schematic diagram of a deconvolution module structure;

FIG. 4 is a schematic diagram of a structure of an object localization model;

FIG. 5 is a block diagram of a visual target tracking system according to the present invention.

Description of the symbols:

an acquisition unit-1, a first determination unit-2, a prediction unit-3, a second determination unit-4, and a tracking unit-5.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The invention provides a visual target tracking method, which is based on a target positioning model and an action network model, can obtain a plurality of pieces of predicted position information according to a historical target template and historical position information, further determines corresponding target position predicted income values, and can accurately determine the predicted position information of a current frame image of a target to be tracked by comparing the sizes of the target position predicted income values.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1 and 2, the visual target tracking method of the present invention includes:

step 100: and acquiring a plurality of historical target templates and historical position information of the current video sequence of the target to be tracked.

Step 200: and determining a plurality of groups of target template images and search areas from the current video sequence of the target to be tracked according to each historical target template.

Step 300: and predicting the predicted position information of the target template image in the search area according to the target positioning model, each group of target template images and the search area.

Step 400: and determining a target position prediction profit value of the target template image according to the predicted position information of the target template image and the historical position information based on an action network model.

Step 500: and comparing the target position prediction profit values of all the target template images, and determining the prediction position information of the target template image with the maximum target position prediction profit value for tracking the current frame image of the target to be tracked.

Preferably, the visual target tracking method of the present invention further comprises:

step 600: extracting a prediction target template from the current frame image according to the prediction position information;

step 700: and replacing the target template corresponding to the target template image with the minimum target position prediction profit value by the prediction target template so as to update the historical target template.

Further, in step 300, the method for constructing the object location model includes:

step 301: and marking the target position of the historical video sequence of the target to be tracked to obtain a target tracking data set of the target to be tracked.

Step 302: and determining training data for training the conjoined network according to the target tracking data set.

Step 303: and performing connected network training on the training data to obtain a target positioning model.

In step 301, the target position labeling is performed on the historical video sequence of the target to be tracked, and a target tracking data set of the target to be tracked is obtained, which specifically includes:

step 3011: extracting a plurality of frames of images from the historical video sequence.

Step 3012: and determining a corresponding target rectangular frame from each frame of image, wherein each target rectangular frame is the minimum rectangular frame containing the target to be tracked.

Step 3013: and obtaining a target tracking data set according to each target rectangular frame.

The target tracking data set includes a plurality of pairs of coordinate pairs formed by upper left corner coordinate information and lower right corner coordinate information of a set rectangular box.

The source of the video sequence containing the target to be tracked can be obtained in many ways, such as collection from a network or self-shooting; in the embodiment, the position of each target to be tracked is marked in a manual marking mode.

In step 302, the determining training data for training the connected network according to the target tracking data set specifically includes:

step 3021: and determining a target area and a search area of each target rectangular frame according to the target tracking data set.

Step 3022: and generating a target response graph in a Gaussian shape in the target area by taking the search area as a reference for each target rectangular frame.

Step 3023: determining a truth sample according to the target response graph; the training data includes a plurality of true value samples.

In the present embodiment, the size of the target area is set to four times the target size, and the target object is fixedly placed at the center position of the target area. In the search area, the length and width of the search area are set to be 1.4 to 3.3 times of the target object, and the target positioning model can be more robust to the scale change of the target object in the tracking process and also has strong robustness to the change of the length-width ratio of the target object; and generating a Gaussian-shaped target response in the target area by taking the search image of the search area as a reference, wherein the other areas are background and the response is zero.

And training a target positioning model based on the connected network by using the generated training data, and outputting the trained target positioning model. In the training process, the initial learning rate is set to be 1e-4, the batch size is set to be 50, and the maximum training round number of the network is set to be 1M.

In step 303, the performing connected network training on the training data to obtain a target positioning model specifically includes:

step 3031: and for each target rectangular frame, carrying out pixel scaling on the target image of the target area and the search image of the search area to obtain a pair of scaled target image and scaled search image.

Step 3032: and carrying out RGB three-channel change on each pair of the zoomed target image and the zoomed searching image to obtain a pair of color images.

Step 3033: and respectively carrying out the average value reduction operation on the color images to obtain a pair of value-reduced color images.

Step 3034: and training each pair of subtraction color images through a depth network to obtain a depth network output value.

Step 3035: the error is calculated by the loss function on the deep network output value and the true value sample.

Wherein the loss function is a mean square error loss.

Step 3036: and obtaining a target positioning model by adopting a back propagation algorithm according to the error.

The target location model may determine predicted location information of the target template image in a search area according to an input target template image and the search area.

Through training, the target positioning model can accurately position the target in the search area according to the input target template image and the search area. Namely, the position of the target to be tracked in the search area is determined according to the position of the maximum response value in the response diagram, and meanwhile, the length and the width of the target are determined according to the shape and the size of the response diagram.

The target center position is the position of the maximum response point C in the response map. The determination method of the target width is as follows: the distance between the two farthest points where the response value in x-direction through element C is greater than the threshold value of 0.1. Similarly, the height of the target is the distance between the two farthest points whose response value in the y direction is greater than the threshold value 0.1.

In particular, the object localization model comprises two structurally identical, parameter-sharing first sub-networks and one deconvolution first sub-network.

The outputs of the two first sub-networks are connected in parallel and then connected to the inputs of the deconvolution first sub-network.

Two sub-networks with the same structure and shared parameters comprise 8 convolution modules (as shown in fig. 4), and each convolution module comprises 32, 64, 128, 256, 512 and 512 feature maps. The deconvolution sub-network includes 8 deconvolution modules, the deconvolution modules of the deconvolution sub-network containing 512, 256, 128, 64, 32, 16, 8, 1 feature maps.

As shown in fig. 3a, each convolution block contains 1 input layer, 3 convolution layers, 2 batch normalization layers, 2 nonlinear functions, 1 element addition layer, and 1 output layer. As shown in fig. 3b, each deconvolution block contains 1 input layer, 1 convolutional layer, 2 convolutional layers, 2 batch normalization layers, 2 nonlinear functions, 1 element addition layer, and 1 output layer.

The Batch Normalization (BN) includes a Batch norm step and a scale step. The batch norm is responsible for normalization of 0-mean-1 variance for the input, and scale is responsible for scaling and translation for the input. The mean and variance of batch norm come from the input, while the scale layer's scale and translation parameters need to be learned from the training data. The batch normalization layer effectively eliminates covariate transfer inside the network by normalizing the network input, accelerates the convergence of the network, and is a regularization mechanism to effectively prevent overfitting of the network. The nonlinear function is a relu (rectified Linear units) activation function, which is a commonly used effective nonlinear activation function and will not be described herein again.

In step 400, the method for constructing the action network model includes:

step 401: and determining a plurality of short video sequences according to a target tracking data set of the target to be tracked.

Each short video sequence comprises a plurality of frames of position images, and each frame of position image contains position information of an object to be tracked. In this embodiment, each short video sequence contains at least 50 frames and at most 100 frames of images.

Step 402: and performing network training according to each short video sequence and the target positioning model to determine an action network model.

Wherein the action network model comprises a second sub-network shared by a plurality of weights.

And each subnetwork contains 2 fully connected layers, with 64 and 32 neurons respectively. The outputs of all the sub-networks are spliced into a vector, a full connection layer with 64 neurons is input, and the output is n nodes. Wherein n has a value of 8 and m has a value of 12.

In the network training phase, the elements involved in reinforcement learning are first defined as follows: defining an agent as a target positioning model, defining an environment (env) as a current video frame and all target templates stored, defining a state(s) as stored historical coordinates and target prediction coordinates corresponding to all target templates, defining an action (a) as selecting a best template from the target templates, defining an award (r) as a predicted target position (bbx)_pre) And sample truth value (bbx)_gt) Cross-to-parallel ratio (IoU) between them, i.e.

Operation value Q^π(s, a) represents the desire to select action a in state s according to policy π. V^π(s) represents the expectation of selecting strategy pi at state s. In the training process, actions are selected according to the probability distribution of the output of the action network.

During training, the state s is firstly acquired by using env_t(ii) a The prize r is derived using equation (1). Updating env by using the action network; obtaining the next state s by env_t+1(ii) a To obtain TD error, i.e.

td＝r+γ·V^π(s_t；θ_c)-V^π(s_t+1；θ_c) (2)；

Where γ is equal to 0.9. Calculating the gradient of the action network, i.e.

Calculating the gradient of the discriminating network, i.e.

And finally, updating the two networks respectively by utilizing the gradients of the action network and the judgment network.

Aiming at the problem of single target tracking, the invention designs a deep neural network based on a conjoined network and an operator-criticc network structure based on a multilayer perceptron. Through offline supervised learning, the target position can be accurately positioned in a search area by the deep neural network based on the connected network, and the length and the width of the target can be accurately predicted; by updating the parameters of the deconvolution part of the conjoined network on line, the prediction precision of the network can be greatly improved. Through reinforcement learning based on time difference errors, the operator network can accurately predict the value functions of different target templates in the tracking process, so that a robust target template updating strategy can be learned in the tracking process, and meanwhile, the critic network can evaluate the strategy made by the operator network in the training process, so that the operator network learns a more robust target template updating strategy. The method has high tracking performance in a single-target visual tracking task.

In addition, the invention also provides a visual target tracking system to improve the target tracking precision.

As shown in fig. 5, the visual target tracking system of the present invention includes an acquisition unit 1, a first determination unit 2, a prediction unit 3, a second determination unit 4, and a tracking unit 5.

The acquiring unit 1 is configured to acquire a plurality of historical target templates and historical position information of a current video sequence of a target to be tracked.

The first determining unit 2 is configured to determine, according to each of the historical target templates, a plurality of sets of target template images and search regions from the current video sequence of the target to be tracked.

The prediction unit 3 is configured to predict the predicted position information of the target template image in the search area according to the target positioning model, and each set of target template image and search area.

The second determining unit 4 is configured to determine a target position prediction profit value of the target template image according to the predicted position information of the target template image and the historical position information based on an action network model.

The tracking unit 5 is configured to compare the target position prediction benefit values of the target template images, determine the prediction position information of the target template image with the maximum target position prediction benefit value, and track the current frame image of the target to be tracked.

Preferably, the visual target tracking system of the present invention further comprises an extraction unit and an update unit. The extraction unit is used for extracting a prediction target template from the current frame image according to the prediction position information; the updating unit is used for replacing the target template corresponding to the target template image with the minimum target position prediction profit value with the prediction target template so as to update the historical target template

Compared with the prior art, the visual target tracking system has the same beneficial effects as the visual target tracking method, and is not repeated herein.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A visual target tracking method, comprising:

2. The visual target tracking method of claim 1, further comprising:

extracting a prediction target template from the current frame image according to the prediction position information of the target template image with the maximum target position prediction profit value;

3. The visual target tracking method of claim 1, wherein the target location model is constructed by a method comprising:

4. The visual target tracking method according to claim 3, wherein the target position labeling is performed on the historical video sequence of the target to be tracked, and a target tracking data set of the target to be tracked is obtained, specifically comprising:

extracting a plurality of frames of images from the historical video sequence;

5. The visual target tracking method according to claim 4, wherein the determining training data for training the connected network according to the target tracking data set specifically comprises:

6. The visual target tracking method according to claim 5, wherein the performing connected network training on the training data to obtain a target positioning model specifically comprises:

7. A visual object tracking method according to any one of claims 1-6, wherein the object localization model comprises two structurally identical, parameter-sharing first sub-networks and one deconvolution first sub-network;

8. The visual target tracking method of claim 3, wherein the method of constructing the action network model comprises:

9. The visual target tracking method of claim 1, wherein the action network model comprises a second sub-network shared by a plurality of weights;

10. A visual target tracking system, the visual target tracking system comprising: