CN108549928A

CN108549928A - Visual tracking method and device based on continuous moving under deeply learning guide

Info

Publication number: CN108549928A
Application number: CN201810226092.3A
Authority: CN
Inventors: 鲁继文; 周杰; 任亮亮; 袁鑫
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-03-19
Filing date: 2018-03-19
Publication date: 2018-09-18
Anticipated expiration: 2038-03-19
Also published as: CN108549928B

Abstract

The invention discloses visual tracking methods and device based on continuous moving under a kind of deeply learning guide, wherein method includes：Pre-training predicts network；Multiple actions are generated according to prediction network and are rewarded accordingly；The Q values that each of multiple actions act are obtained, while updating the network of prediction and generation action.This method will can continuously and cumulatively adjust the target frame of object, while dynamically adjust the external appearance characteristic and model of target object, greatly improve robustness.

Description

Visual tracking method and device based on continuous moving under deeply learning guide

Technical field

The present invention relates to Visual Tracking field, more particularly to continuous moving is based under a kind of deeply learning guide Visual tracking method and device.

Background technology

Vision object tracking is a underlying issue in computer vision, in vision monitoring, robot control, man-machine friendship The fields such as mutual and advanced DAS (Driver Assistant System) are all widely used.Between past many decades, a large amount of vision tracking Method is suggested, but in unrestricted natural environment, deformation, unexpected movement, is blocked and illumination variation to regard Feel that tracking problem still has prodigious challenge.

The purpose of vision tracking problem is the object information only according to first frame, to determine object space in video.When The best visual tracking method of preceding effect is broadly divided into two classes：Method based on correlation filtering and the method based on deep learning. Wherein, the method design correlation filter based on correlation filtering generates the peak value of the correlation filtering of target object in each frame.This Kind method does not need to the multiple repairing weld of object appearance.Based on basic framework MOSSE (Minimum Output Sum of Squared Error filter, error least square and filter) method, a large amount of method such as CFTs and DSST (Discriminatiive Scale Space Tracker, DSST algorithm) is suggested to using color attribute and solves scale Problem.Method tracking based on deep learning uses depth convolutional neural networks as grader, from many candidate frames Select most possible position.The method MDNet, FCNT and STCT based on deep learning represented uses similar sliding Window and the repeatedly inefficient search technique such as sampling.In recent years, some vision track sides that decision is carried out by intensified learning Method is suggested.Such as ADNet using the method for policy gradient come to target object size and displacement carry out decision. But most of existing methods are by sampling come online updating depth model, it is easy to by big deformation and unexpected movement It influences, so that precision reduces, to have to be solved.

Invention content

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide the visions based on continuous moving under a kind of deeply learning guide Tracking, this method can greatly improve robustness.

It is another object of the present invention to propose under a kind of deeply learning guide based on the vision of continuous moving with Track device.

In order to achieve the above objectives, one aspect of the present invention embodiment proposes under a kind of deeply learning guide based on continuous Mobile visual tracking method, includes the following steps：Pre-training predicts network；Multiple actions are generated according to the prediction network simultaneously It is rewarded accordingly；The Q values that each of multiple actions act are obtained, while updating the network of prediction and generation action.

Visual tracking method based on continuous moving under the deeply learning guide of the embodiment of the present invention, can pass through root Multiple actions are generated according to pre-training prediction network and each of are rewarded, and obtain multiple actions the Q values acted accordingly, from And update the network of prediction and generation action simultaneously, by vision tracking problem be modeled as one it is continuous and the movement accumulated Problem has stronger robustness for tracking target appearance variation caused by complex background and deformation, delays to a certain extent It has solved due to large deformation and target is drifted about caused by fast moving.

In addition, the vision track side based on continuous moving under deeply learning guide according to the above embodiment of the present invention Method can also have following additional technical characteristic：

Further, in one embodiment of the invention, the object function of the prediction network is：

Wherein, Δ x and Δ y is the conversion of the Scale invariant of target frame between two frames, and Δ w and Δ h are in logarithm sky Between on width and high variation, X_currCurrent frame position x coordinate, X_prevFor previous frame position x coordinate, W_prevIt is wide for present frame, H_prevFor current vertical frame dimension, W_currIt is wide for next frame, W_prevFor next vertical frame dimension.

Further, in one embodiment of the invention, described that multiple actions are generated according to the prediction network and are obtained To corresponding reward, further comprise：Continue action by deep neural network generation, stop simultaneously update action, stopping and neglect It slightly acts and restarts；The corresponding reward of each action is obtained according to tracking effect.

Further, in one embodiment of the invention, wherein

Continue to act for described, uses I_t ^kAs input, by f_t-1 ^*As hidden layer feature：

l_t,k=l_t,k-1+ δ,

Wherein, l_t,kFor the position after adjustment, l_t,k-1For initial position, t is the time, and k is iterative steps, and δ is offset；

Stop simultaneously update action for described, stop iteration and updates clarification of objective and predict the parameter of network：

Wherein, f_t ^*It is characterized, ρ is smoothing factor, f_tIt is characterized,For last moment feature, θ_tFor network parameter, θ_t-1 For network parameter, μ is pace of learning,It is expected, Q (s, a, δ) is Q functions, and s is turntable, and a is action, and δ is offset；

For the stopping and ignore action, starts to update next frame and using the target signature of last moment and prediction The parameter of network；

It restarts for described, resampling initial block, wherein by the stochastical sampling around current object, To select the frame with highest q value as new initial block：

Wherein, l_t-1,0For initial position, stop is to stop, and update is update；

Also, the reward function for continuing action is：

Wherein, r_t,kFor current prize value, Δ IoU is the variable quantity of IoU, and ε is threshold value；

The stopping and update action and the reward function for stopping and ignoring action are：

Wherein, r_t,K_tFor reward value, K_tFor final iterative steps, g is IoU functions,For actual position, l_tFor carry-out bit It sets；

The reward function restarted is：

Further, in one embodiment of the invention, the Q value calculation formula for continuing action are：

Wherein, γ is coefficient of balance.

It is described to stop and update action, the stopping and ignore and the Q value calculation formula restarted are：

Q(s,a|δ_t,k)=r_t,K_t+γr_t+1,k_t+1+…。

In order to achieve the above objectives, another aspect of the present invention embodiment proposes under a kind of deeply learning guide based on connecting Continue mobile vision tracks of device, including：Pre-training module predicts network for pre-training；Generation module, for according to Prediction network generates multiple actions and is rewarded accordingly；Acquisition module, the Q acted for obtaining each of multiple actions Value, while updating the network of prediction and generation action.

The vision tracks of device based on continuous moving, can pass through root under the deeply learning guide of the embodiment of the present invention Multiple actions are generated according to pre-training prediction network and each of are rewarded, and obtain multiple actions the Q values acted accordingly, from And update the network of prediction and generation action simultaneously, by vision tracking problem be modeled as one it is continuous and the movement accumulated Problem has stronger robustness for tracking target appearance variation caused by complex background and deformation, delays to a certain extent It has solved due to large deformation and target is drifted about caused by fast moving.

In addition, the vision tracking dress based on continuous moving under deeply learning guide according to the above embodiment of the present invention Following additional technical characteristic can also be had by setting：

Further, in one embodiment of the invention, the generation module, further comprises：Generation unit is used for Continue action, stopping and update action by deep neural network generation, stop and ignore action and restart；It obtains Unit, for obtaining the corresponding reward of each action according to tracking effect.

Further, in one embodiment of the invention, wherein

l_t,k=l_t,k-1+ δ,

Wherein, l_t-1,0For initial position, stop is to stop, and update is update；

Also, the reward function for continuing action is：

The reward function restarted is：

Further, in one embodiment of the invention,

It is described continue action Q value calculation formula be：

Wherein, γ is coefficient of balance.

Q(s,a|δ_t,k)=r_t,K_t+γr_t+1,k_t+1+…。

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.

Description of the drawings

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein：

Fig. 1 is according to the visual tracking method based on continuous moving under the deeply learning guide of the embodiment of the present invention Flow chart；

Fig. 2 is according to the vision track side based on continuous moving under the deeply learning guide of one embodiment of the invention The flow chart of method；

Fig. 3 is the schematic diagram that network is acted according to the optimizing evaluation network of one embodiment of the invention and generation；

Fig. 4 is according to the vision tracks of device based on continuous moving under the deeply learning guide of the embodiment of the present invention Structural schematic diagram.

Specific implementation mode

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.

It describes to be based on continuous moving under the deeply learning guide proposed according to embodiments of the present invention with reference to the accompanying drawings Visual tracking method and device, the deeply learning guide proposed according to embodiments of the present invention is described with reference to the accompanying drawings first Under the visual tracking method based on continuous moving.

Fig. 1 be the embodiment of the present invention deeply learning guide under the visual tracking method based on continuous moving flow Figure.

As shown in Figure 1, the visual tracking method based on continuous moving includes the following steps under the deeply learning guide：

In step S101, pre-training predicts network.

Wherein, in one embodiment of the invention, predict that the object function of network is：

It is understood that in conjunction with shown in Fig. 1 and Fig. 2, the initial block of t frames is given, depth is extracted to the position first Feature and current feature and clarification of objective are combined.Then it uses a prediction network and action to generate network to give birth to The location and shape of target frame are adjusted at four actions (continuing, stop and update, stop and ignore and restart).It is right In action " continuation ", the position of target frame is persistently adjusted；Action " is stopped and updated ", iteration and more fresh target are stopped Feature and the model parameter for predicting network；Action " is stopped and ignored ", the step of skipping update is executed；For action " restarting ", target may need resampling initial block with losing.Finally, network is evaluated using depth to estimate to work as The parameter of the Q values and update prediction network and generation action network of preceding action.

Specifically, the embodiment of the present invention can predict network, mesh of the embodiment of the present invention to prediction network with pre-training first Scalar functions do following definition：

Wherein, the Δ x in formula and Δ y indicates the conversion of the Scale invariant of target frame between two frames, and Δ w and Δ h are represented Width on log space and high variation.As shown in Figure 1, the prediction network of the embodiment of the present invention is carried using three-layer coil lamination Target and the feature of candidate region are taken, then feature, which is series connected, is input to two layers of full articulamentum generation estimated location and scale The parameter of variation.Therefore, by above-mentioned object function, the embodiment of the present invention can train a depth nerve net end to end Network is directly predicted to obtain the variation of location and shape.

In step s 102, multiple actions are generated according to prediction network and is rewarded accordingly.

Further, in one embodiment of the invention, multiple actions are generated according to prediction network and obtained corresponding Reward, further comprises：By deep neural network generation continue action, stop and update action, stop and ignore action with It restarts；The corresponding reward of each action is obtained according to tracking effect.

Further, in one embodiment of the invention, wherein

l_t,k=l_t,k-1+ δ,

For stopping and ignoring action, start to update next frame and using the target signature of last moment and prediction network Parameter；

For restarting, resampling initial block, wherein by the stochastical sampling around current object, with choosing The frame with highest q value is selected as new initial block：

Wherein, l_t-1,0For initial position, stop is to stop, and update is update；

Also, the reward function for continuing action is：

The reward function restarted is：

It is understood that the embodiment of the present invention can generate a series of actions and obtain corresponding according to prediction network Reward.The embodiment of the present invention can generate four actions using deep neural network：Continue, stop and update, stop and neglects Slightly and restart.For action " continuation ", the embodiment of the present invention can use I_t ^kAs input, by f_t-1 ^*As hidden layer spy Sign：

l_t,k=l_t,k-1+ δ,

Action " is stopped and updated ", the embodiment of the present invention can stop iteration and update clarification of objective and prediction The parameter of network：

Action " is stopped and ignored ", starts to update next frame and using the target signature of last moment and pre- survey grid The parameter of network.Action " is restarted ", resampling initial block of the embodiment of the present invention, by random around current object Sampling selects that there is highest q value to obtain frame as new initial block：

The embodiment of the present invention is also according to tracking effect to the corresponding reward of each action definition.For action " continuation ", prize Function is encouraged to be defined as：

Wherein, Δ_IoUSuch as following formula：

For action " stop and update " and action " stop and ignore ", reward function can be defined as：

Action " is restarted ", reward function can be defined as：

In step s 103, the Q values that each of multiple actions act are obtained, while updating the net of prediction and generation action Network.

Wherein, in one embodiment of the invention, continue action Q value calculation formula be：

Wherein, γ is coefficient of balance.

Q(s,a|δ_t,k)=r_t,K_t+γr_t+1,k_t+1+…。

It is understood that as shown in figure 3, the embodiment of the present invention can calculate the Q values and optimizing evaluation net of each action Network φ^-With the network θ of generation action, the embodiment of the present invention can evaluate network to predict the Q values of current action simultaneously using depth And the model parameter of update prediction network.For action " continuation ", Q is worth calculating as follows：

For other three actions " continuation ", " stop and update ", " stop and ignore and restart ", Q value meters It calculates as follows：

Q(s,a|δ_t,k)=r_t,K_t+γr_t+1,k_t+1+ ...,

Therefore, the formula that the optimization problem for evaluating network can be expressed as by the embodiment of the present invention：

Wherein, φ^-Indicate target network, it has same structure with φ but is in slow newer state.

The embodiment of the present invention is carried out the parameter of more New Appraisement network by following formula：

The optimization problem that generation can be acted network by the embodiment of the present invention is expressed as formula：

Wherein,

Finally, carry out the parameter θ of more newly-generated action network as follows：

Visual tracking method based on continuous moving under the deeply learning guide proposed according to embodiments of the present invention, can With by predicting that network generates multiple actions and rewarded accordingly, and obtains each action of multiple actions according to pre-training Q values, to update the network of prediction and generation action simultaneously, by vision tracking problem be modeled as one it is continuous and accumulate Movement problem, for caused by complex background and deformation track target appearance variation have stronger robustness, certain It is alleviated in degree due to large deformation and target is drifted about caused by fast moving.

It is based on continuous moving under the deeply learning guide proposed according to embodiments of the present invention referring next to attached drawing description Vision tracks of device.

Fig. 4 be the embodiment of the present invention deeply learning guide under the vision tracks of device based on continuous moving structure Schematic diagram.

As shown in figure 4, the vision tracks of device 10 based on continuous moving includes under the deeply learning guide：Pre-training Module 100, generation module 200 and acquisition module 300.

Wherein, pre-training module 100 predicts network for pre-training.Generation module 200 is used to be generated according to prediction network Multiple actions are simultaneously rewarded accordingly.Acquisition module 300 updates simultaneously for obtaining the Q values that each of multiple actions act The network of prediction and generation action.The device 10 of the embodiment of the present invention will can continuously and cumulatively adjust the target of object Frame, while the external appearance characteristic and model of target object are dynamically adjusted, greatly improve robustness.

Further, in one embodiment of the invention, predict that the object function of network is：

Further, in one embodiment of the invention, generation module 200 further comprises：Wherein, generation unit and Acquiring unit.Generation unit is used to continue action, stopping and update action by deep neural network generation, stops and ignore dynamic Make and restarts.Acquiring unit is used to obtain the corresponding reward of each action according to tracking effect.

Further, in one embodiment of the invention, wherein

l_t,k=l_t,k-1+ δ,

Wherein, l_t-1,0For initial position, stop is to stop, and update is update；

Also, the reward function for continuing action is：

The reward function restarted is：

Further, in one embodiment of the invention,

It is described continue action Q value calculation formula be：

Wherein, γ is coefficient of balance.

Q(s,a|δ_t,k)=r_t,K_t+γr_t+1,k_t+1+…。

It should be noted that the visual tracking method embodiment based on continuous moving under the aforementioned learning guide to deeply Explanation be also applied for the vision tracks of device based on continuous moving under the deeply learning guide of the embodiment, herein It repeats no more.

Vision tracks of device based on continuous moving under the deeply learning guide proposed according to embodiments of the present invention, can With by predicting that network generates multiple actions and rewarded accordingly, and obtains each action of multiple actions according to pre-training Q values, to update the network of prediction and generation action simultaneously, by vision tracking problem be modeled as one it is continuous and accumulate Movement problem, for caused by complex background and deformation track target appearance variation have stronger robustness, certain It is alleviated in degree due to large deformation and target is drifted about caused by fast moving.

In the description of the present invention, it is to be understood that, term "center", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise ", " axial direction ", " radial direction ", " circumferential direction " be orientation based on ... shown in the drawings or Position relationship is merely for convenience of description of the present invention and simplification of the description, and does not indicate or imply the indicated device or element must There must be specific orientation, with specific azimuth configuration and operation, therefore be not considered as limiting the invention.

In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral；Can be that machinery connects It connects, can also be electrical connection；It can be directly connected, can also can be indirectly connected through an intermediary in two elements The interaction relationship of the connection in portion or two elements, unless otherwise restricted clearly.For those of ordinary skill in the art For, the specific meanings of the above terms in the present invention can be understood according to specific conditions.

In the present invention unless specifically defined or limited otherwise, fisrt feature can be with "above" or "below" second feature It is that the first and second features are in direct contact or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature be directly above or diagonally above the second feature, or be merely representative of Fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " lower section " and " below " can be One feature is directly under or diagonally below the second feature, or is merely representative of fisrt feature level height and is less than second feature.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims

1. the visual tracking method based on continuous moving under a kind of deeply learning guide, which is characterized in that including following step Suddenly：

Pre-training predicts network；

Multiple actions are generated according to the prediction network and are rewarded accordingly；And

The Q values that each of multiple actions act are obtained, while updating the network of prediction and generation action.

2. the visual tracking method based on continuous moving under deeply learning guide according to claim 1, feature It is, the object function of the prediction network is：

Wherein, Δ x and Δ y is the conversion of the Scale invariant of target frame between two frames, and Δ w and Δ h are on log space Width and high variation, X_currCurrent frame position x coordinate, X_prevFor previous frame position x coordinate, W_prevIt is wide for present frame, H_prevFor Current vertical frame dimension, W_currIt is wide for next frame, W_prevFor next vertical frame dimension.

3. the visual tracking method based on continuous moving under deeply learning guide according to claim 1, feature It is, it is described that multiple actions are generated according to the prediction network and are rewarded accordingly, further comprise：

Continue action, stopping and update action by deep neural network generation, stop and ignore action and restart；

The corresponding reward of each action is obtained according to tracking effect.

4. the visual tracking method based on continuous moving under deeply learning guide according to claim 3, feature It is, wherein

l_t,k=l_t,k-1+ δ,

Wherein, f_t ^*It is characterized, ρ is smoothing factor, f_tIt is characterized,For last moment feature, θ_tFor network parameter, θ_t-1For net Network parameter, μ are pace of learning,It is expected, Q (s, a, δ) is Q functions, and s is turntable, and a is action, and δ is offset；

For the stopping and ignore action, starts to update next frame and using the target signature of last moment and prediction network Parameter；

It restarts for described, resampling initial block, wherein by the stochastical sampling around current object, with choosing The frame with highest q value is selected as new initial block：

Wherein, l_t-1,0For initial position, stop is to stop, and update is update；

Also, the reward function for continuing action is：

Wherein, r_t,K_tFor reward value, K_tFor final iterative steps, g is IoU functions,For actual position, l_tFor output position；

The reward function restarted is：

5. the visual tracking method based on continuous moving under deeply learning guide according to claim 1, feature It is,

It is described continue action Q value calculation formula be：

Wherein, γ is coefficient of balance.

Q(s,a|δ_t,k)=r_t, K_t+γr_t+1, k_t+1+…。

6. the vision tracks of device based on continuous moving under a kind of deeply learning guide, which is characterized in that including：

Pre-training module predicts network for pre-training；

Generation module, for generating multiple actions according to the prediction network and being rewarded accordingly；And

Acquisition module, the Q values acted for obtaining each of multiple actions, while updating the network of prediction and generation action.

7. the vision tracks of device based on continuous moving, feature under deeply learning guide according to claim 6 It is, the object function of the prediction network is：

8. the vision tracks of device based on continuous moving, feature under deeply learning guide according to claim 6 It is, the generation module further comprises：

Generation unit, for by deep neural network generation continue action, stop and update action, stop and ignore action with It restarts；

Acquiring unit, for obtaining the corresponding reward of each action according to tracking effect.

9. the vision tracks of device based on continuous moving, feature under deeply learning guide according to claim 8 It is, wherein

l_t,k=l_t,k-1+ δ,

Wherein, l_t-1,0For initial position, stop is to stop, and update is update；

Also, the reward function for continuing action is：

Wherein, r_t,K_tFor reward value, K_tFor final iterative steps, g is IoU functions,Actual position, l_tFor output position；

The reward function restarted is：

10. the vision tracks of device based on continuous moving, feature under deeply learning guide according to claim 6 It is,

It is described continue action Q value calculation formula be：

Wherein, γ is coefficient of balance.

Q(s,a|δ_t,k)=r_t,K_t+γr_t+1,k_t+1+…。