CN108537825A

CN108537825A - A kind of method for tracking target based on transfer learning Recurrent networks

Info

Publication number: CN108537825A
Application number: CN201810250785.6A
Authority: CN
Inventors: 权伟; 李天瑞; 江永全; 何武; 刘跃平; 卢学民; 王晔; 贾成君; 陈锦雄
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2018-03-26
Filing date: 2018-03-26
Publication date: 2018-09-14
Anticipated expiration: 2038-03-26
Also published as: CN108537825B

Abstract

The present invention provides a kind of method for tracking target based on transfer learning Recurrent networks, are related to technical field of computer vision.The target object to be tracked is selected and determined from initial pictures；Target location Recurrent networks structure based on block prediction；Training dataset towards tracking generates and network training；Image inputs, and under real-time disposition, extraction acquires by camera and be stored in the video image of memory block, as will be into the input picture of line trace；Target positions, and will obtain image and is input in the Recurrent networks of position, after the processing of network forward direction, network output layer will obtain 8 × 8 × 8 station-keeping datas.Network update calculates 8 × 88 × 8 × 8 relative positions between image block and target marked off by entire image, and constitute one group of training data together with current input image according to obtained target location.

Description

A kind of method for tracking target based on transfer learning Recurrent networks

Technical field

The present invention relates to computer vision, computer graphic image, machine intelligence and systems technology fields.

Background technology

Visual target tracking is the important subject of computer vision field, and main task is that acquisition target is continuous The information such as position, appearance and movement, and then provide base for further semantic layer analysis (such as Activity recognition, scene understanding) Plinth.Target following research is widely used in the fields such as intelligent monitoring, human-computer interaction, automatic control system, has very strong reality With value.Currently, method for tracking target includes mainly classical method for tracking target and deep learning method for tracking target.

Classical method for tracking target is broadly divided into production method (Generative Methods) and discriminate method (Discriminative Methods) two classes.Production method assume target can by certain generating process or model into Row expression, such as principal component analysis (PCA), then tracking problem is considered as interested by sparse coding (Sparse Coding) etc. Region in find most probable candidate item.These methods are intended to design a kind of graphical representation side conducive to robust target following Method.Different from production method, tracking is considered as a classification or a kind of continuous object detection problem by discriminate method, Task is to distinguish target from image background.Such methods utilize target and background information simultaneously, are mainly to grind at present A kind of method studied carefully.Discriminate method generally comprises two main steps, the first step be by selection can discrimination objective and The visual signature of background trains to obtain a grader and its decision rule, and second step is to use the grader during tracking In the target location for each position in visual field evaluate and determination is most possible.Target frame is then moved to the position It sets and repeats such process, and then realize tracking, which be used to design various forms of track algorithms.It is overall next It sees, the main advantage of classical tracking is the speed of service and the dependence less to auxiliary data, while they are also required to Tradeoff is made between the accuracy and real-time of tracking.

Deep learning (Deep Learning) is the hot spot of the research of machine learning in recent years, due to its powerful mark sheet The data set and hardware supported of Danone power and continuous development, deep learning achieve surprising success in many aspects, such as Speech recognition, image recognition, target detection, visual classification etc..Deep learning target following research and development is also very rapid, but by The requirement of the shortage and real-time of priori in target following so that based on needing a large amount of training datas and parameter to calculate Depth learning technology be difficult to adequately be put to good use in this respect, have prodigious exploration space.From current achievement in research From the point of view of, deep learning tracking mainly applies self-encoding encoder network and convolutional neural networks, and there are mainly two types of think for research Road, one is carrying out transfer learning to network to carry out on-line fine again, another kind be the structure of transformation depth network with adapt to The requirement of track.Self-encoding encoder network (AE) is typical non-supervisory deep learning network, because of its feature learning ability and antinoise Performance is applied first in target following.In general, self-encoding encoder network is relatively more intuitive and the scale of construction is moderate, is a kind of outstanding Non-supervisory deep learning model, be able to apply and achieve preferable effect in the track at first.Not with self-encoding encoder network Together, convolutional neural networks (CNN) are a kind of feedforward neural networks of supervision type, its convolution comprising the progress of multiple cycle alternations, Nonlinear transformation and down-sampled operation, embody very powerful performance in pattern-recognition especially Computer Vision Task. All in all, deep learning has more powerful feature representation ability compared to classical way, related training in tracking The selection of collection, the selection of network and the improvement of structure, real-time of algorithm, and application recurrent neural network etc. there is still a need for Further research.

Invention content

The object of the present invention is to provide a kind of method for tracking target based on transfer learning Recurrent networks, deep neural networks Training data and the problem of target position inaccurate when for tracking.

The purpose of the present invention is achieved through the following technical solutions：This method including the use of VGG-19 networks, including Following steps：

(1) Object selection

The target object to be tracked is selected and determined from initial pictures, Object selection process passes through moving object detection side Method automatically extracts, or is manually specified by man-machine interaction method；

(2) the target location Recurrent networks structure based on block prediction

Target location Recurrent networks based on block prediction are by image input layer, a migration net for being used for feature representation Network, a network layer and the four part structures of position output layer for including 8 × 8 × 8 nodes for including 4096 × 1 nodes At；In the entire network, input picture after the dimension normalization of 224 × 224 pixel sizes as the defeated of VGG-19 networks Enter data, the 23rd layer of the VGG-19 networks is connect entirely with the network layer of 4096 × 1 nodes, that is, uses the of VGG-19 23 layers carry out feature representation to input picture, and the network layer of 4096 × 1 nodes is exported with the position of 8 × 8 × 8 nodes again Layer is connected entirely；

The input picture of 224 × 224 sizes is divided into 8 × 8=64 image block, each tile size is 28 × 28 The position of pixel, each image block is corresponding with the node location of preceding bidimensional of position output layer, the third dimension of position output layer 8 nodes then indicate the relative position of target that corresponding image block is predicted, thus every input picture is by returning 8 × 8 × 8 relative position values will be obtained after returning network；

(3) training dataset towards tracking generates

It sets Recurrent networks in order to align and is trained, obtain training data by two aspects here：On the one hand First frame input picture is extracted the image block corresponding to it, according to the target to be tracked then by artificial Target image block is positioned over any position of first frame image and generates new image by the mode of synthesis, former target image block The region at place is then filled up with the mean value of target image block, while being recorded the position of target image block placement and being calculated by whole 8 × 88 × 8 × 8 relative positions between image block and target that width image marks off, these position coordinate data conducts The desired output of network, they collectively form one group of training data with image；On the other hand, then it is the target figure that will first extract It as block is converted, including translation, rotation, the operations such as distorts and blocks, be then positioned over figure according still further to method as before As in and compound training image；All these training datas then composing training data set is used for network training later；

(4) network training

In network training process, the image for training is carried out by the way of inputting one by one, VGG-19 network portions Parameter remains unchanged, the 23rd layer of VGG-19 networks, include the position output layers of 8 × 8 × 8 nodes, and includes 4096 × 1 Connecting quantity between the network layer of node is trained using classical stochastic gradient descent method (SGD)；

(5) image inputs

Under real-time disposition, extraction acquires by camera and is stored in the video image of memory block, as will be into The input picture of line trace；In processed offline, the video file acquired is decomposed into the image sequence of multiple frame compositions Row extract frame image as input picture one by one sequentially in time.If input picture is sky, whole flow process stops；

(6) target positions

(5) acquisition image is input in the Recurrent networks of position, after the processing of network forward direction, network output layer will obtain 8 × 8 × 8 station-keeping datas.By the calculation processing of these 8 × 8 × 8 node datas to output layer come to target into Row positioning.If A_i,jIndicate (i, j) a image block of current input image,Indicate image block A_i,jThe abscissa of central point,Indicate image block A_i,jThe ordinate of central point, image block A_i,j8 nodal values difference in corresponding Recurrent networks output layer For They be respectively target frame upper left corner abscissa with The difference of image block central point abscissa, the difference of the upper left corner ordinate and image block central point ordinate of target frame, target The difference of the upper right corner abscissa and image block central point abscissa of frame, upper right corner ordinate and the image block central point of target frame The lower left corner of the difference of ordinate, the difference of the lower left corner abscissa and image block central point abscissa of target frame, target frame is vertical The difference of coordinate and image block central point ordinate, the difference of the lower right corner abscissa and image block central point abscissa of target frame Value, the difference of the lower right corner ordinate and image block central point ordinate of target frame.

If target location is expressed asWherein The abscissa and ordinate in the expression target frame upper left corner respectively, the abscissa and ordinate in the upper right corner, The abscissa and ordinate in the lower left corner, the abscissa and ordinate in the lower right corner.Then image block A_i,jThe target location of prediction is I.e. for A_i,jHave Similarly, for there are one each image blocks The target location respectively predicted, and the coordinate at four angles of target frame predicted is typically different, it is therefore desirable to each image Four angular coordinates of target frame of block prediction are for statistical analysis, determine each angle of target frame using coordinate accumulation method here Final coordinate, and then position entire target.Specific method is, if The upper left corner of target frame, the accumulative matrix of coordinate in the upper right corner, the lower left corner, the lower right corner are indicated respectively, wherein Respectively value of its homography at (a, b), and 0≤a, b≤224, each element value of these matrixes is when initial 0.For image block A_i,jHave Thus for each image block, accumulation operations are carried out to this four matrixes.

Finally, by the coordinate where the element with maximum value in matrix respectively to should be used as the seat at four angles of target frame Mark, i.e., Wherein The upper left corner, the upper right corner, the lower left corner of expression target frame, the coordinate in the lower right corner add up the member with maximum value in matrix respectively The corresponding transverse and longitudinal coordinate value of element, target positioning are completed；

(7) network updates

According to the target location that (6) obtain, 8 × 8 marked off by entire image 8 between image block and target are calculated × 8 × 8 relative positions, and one group of training data is constituted together with current input image, primary network training is carried out, is realized to net The fine tuning of network updates, and then branches to (5).

The relative position value includes the difference of the upper left corner abscissa and image block central point abscissa of target frame, target The difference of the upper left corner ordinate and image block central point ordinate of frame, upper right corner abscissa and the image block central point of target frame The lower left corner of the difference of abscissa, the difference of the upper right corner ordinate and image block central point ordinate of target frame, target frame is horizontal The difference of coordinate and image block central point abscissa, the difference of the lower left corner ordinate and image block central point ordinate of target frame Value, the difference of the lower right corner abscissa and image block central point abscissa of target frame, the lower right corner ordinate and image of target frame The difference of block central point ordinate.

Advantage and good effect：This method builds a position Recurrent networks first, the network by image input layer, one For the migration network of feature representation, a network layer comprising 4096 × 1 nodes and one comprising 8 × 8 × 8 nodes Output layer four parts in position form.Image input layer will do input picture unified pretreatment, by image normalization be 224 × 224 pixel sizes, migration network are pre-training network VGG-19, and are used as feature representation layer, Zhi Houquan using its 23rd layer A network layer for including 4096 × 1 nodes is connected, the network layer is in one position for including 8 × 8 × 8 nodes of connection entirely Output layer.It sets Recurrent networks in order to align and is effectively trained, a variety of transformation are carried out to target and image here, then Corresponding training dataset is synthesized, network training is then carried out using classical stochastic gradient descent method.Input picture is by returning The prediction about each image block divided by entire image to target location, the prediction can be obtained after the processing of network forward direction Including 8 × 8 × 8 relative positions, the relative coordinate values at four angles of corresponding target frame.Then from there through use coordinate integrating method Target can be positioned, and then realize tracking.In addition, after the completion of each target positions, according to currently determined mesh Cursor position is finely adjusted and updates to network so that network has certain synchronous adjustment ability.Since deep learning is utilized Its powerful feature representation ability, the present invention can handle complicated tracking scene, realize accurate target following, be based on simultaneously The method of recurrence avoids a large amount of location finding, and the speed of target positioning is greatly improved, and real-time mesh may be implemented Mark tracking.In addition, the method for the present invention can be not only used for monotrack, by being correspondingly improved to network (as exported End), the tracking for multiple target can also be extended.

Description of the drawings

Fig. 1 is structure of the invention figure.

Fig. 2 is flow chart of the present invention.

Specific implementation mode

The method of the present invention can be used for the various occasions of target following, such as intelligent video analysis, automatic human-computer interaction, traffic Video monitoring, vehicle drive, biocenose analysis, the analysis of field animal movement, crossing moving object segmentation and fluid It tests the speed on surface.

By taking intelligent video analysis as an example：Intelligent video analysis include it is many it is important automatically analyze task, such as behavioural analysis, Abnormal alarm, video compress etc., and the basis of these work is then that can carry out stable target following.The present invention may be used The tracking of proposition is realized, specifically, the position Recurrent networks based on transfer learning are initially set up, as shown in Figure 1, so A variety of transformation are carried out to target and image afterwards, then synthesize corresponding training dataset, network training then uses the random of classics Gradient descent method carries out, and network can be obtained the ability positioned to target after the completion of training.During tracking, the recurrence Network carries out positive processing to input picture, and network will export the corresponding target relative position information of the image, according to these letters Breath then can be for statistical analysis to target location using coordinate integrating method and completes to position, and then realizes tracking.In addition, After the completion of each target positioning, network is finely adjusted and is updated according to currently determined target location so that network has Certain synchronous adjustment ability.Since deep learning its powerful feature representation ability is utilized, the present invention can handle complexity Tracking scene, realize accurate target following, while the method based on recurrence avoids a large amount of location finding, target positioning Speed be greatly improved, real-time target following may be implemented.In addition, the method for the present invention can be not only used for single goal Tracking, by being correspondingly improved (such as output end) to network, can also extend the tracking for multiple target.

The present invention is to establish the position Recurrent networks based on transfer learning first, is then carried out to target and image more Kind transformation, and then corresponding training dataset is synthesized, network training then using classical stochastic gradient descent method progress, has been trained It can be obtained the ability positioned to target at rear network.During tracking, which carries out just input picture To processing, network will export the corresponding target relative position information of the image, and coordinate integrating method is used then according to these information It can be for statistical analysis to target location and completes to position, and then realizes tracking.In addition, after the completion of each target positions, Network is finely adjusted and is updated according to currently determined target location so that network has certain synchronous adjustment ability.

The method of the present invention can be programmed by any computer programming language (such as C language) and be realized, based on this method Tracking system software can realize real-time modeling method application in any PC or embedded system.

Claims

1. a kind of method for tracking target based on transfer learning Recurrent networks, this method is including the use of VGG-19 networks, feature It is：

(1) Object selection

Select and determine the target object to be tracked from initial pictures, Object selection process by moving target detecting method from Dynamic extraction, or be manually specified by man-machine interaction method；

(2) the target location Recurrent networks structure based on block prediction

Target location Recurrent networks based on block prediction are used for the migration network of feature representation, one by image input layer, one A includes that four parts of network layer and a position output layer comprising 8 × 8 × 8 nodes of 4096 × 1 nodes are constituted； In whole network, input picture is after the dimension normalization of 224 × 224 pixel sizes as the input number of VGG-19 networks According to the 23rd layer of the VGG-19 networks is connect entirely with the network layer of 4096 × 1 nodes, that is, uses the 23rd layer of VGG-19 To input picture carry out feature representation, and the network layer of 4096 × 1 nodes again with the position output layer of 8 × 8 × 8 nodes into The full connection of row；

The input picture of 224 × 224 sizes is divided into 8 × 8=64 image block, each tile size is 28 × 28 pictures The position of element, each image block is corresponding with the node location of preceding bidimensional of position output layer, and the 8 of the third dimension of position output layer A node then indicates the relative position for the target that corresponding image block is predicted, thus every input picture is by returning net 8 × 8 × 8 relative position values will be obtained after network；

(3) training dataset towards tracking generates

It sets Recurrent networks in order to align and is trained, obtain training data by two aspects here：On the one hand for First frame input picture extracts the image block corresponding to it, according to the target to be tracked then by artificial synthesized Mode, target image block is positioned over any position of first frame image and generates new image, where former target image block Region then filled up with the mean value of target image block, while record target image block placement position and calculate by whole picture figure As 8 × 88 × 8 × 8 relative positions between image block and target marked off, these position coordinate datas are as network Desired output, they collectively form one group of training data with image；On the other hand, then it is the target image block that will first extract It is converted, including translation, rotation, the operations such as distorts and block, be then positioned in image according still further to method as before And compound training image；All these training datas then composing training data set is used for network training later；

(4) network training

In network training process, the image for training is carried out by the way of inputting one by one, the parameter of VGG-19 network portions It remains unchanged, the 23rd layer of VGG-19 networks, include the position output layers of 8 × 8 × 8 nodes, and includes 4096 × 1 nodes Network layer between Connecting quantity, be trained using classical stochastic gradient descent method (SGD)；

(5) image inputs

Under real-time disposition, extraction acquires by camera and is stored in the video image of memory block, as to carry out with The input picture of track；In processed offline, the video file acquired is decomposed into the image sequence of multiple frame compositions, is pressed According to time sequencing, frame image is extracted one by one as input picture；If input picture is sky, whole flow process stops；

(6) target positions

(5) acquisition image is input in the Recurrent networks of position, after the processing of network forward direction, network output layer will obtain 8 × 8 × 8 station-keeping datas determine target by the calculation processing of these 8 × 8 × 8 node datas to output layer Position；If A_i,jIndicate (i, j) a image block of current input image,Indicate image block A_i,jThe abscissa of central point, Indicate image block A_i,jThe ordinate of central point, image block A_i,j8 nodal values in corresponding Recurrent networks output layer are respectively They are respectively the upper left corner abscissa and figure of target frame As the difference of block central point abscissa, the difference of the upper left corner ordinate and image block central point ordinate of target frame, target frame Upper right corner abscissa and image block central point abscissa difference, the upper right corner ordinate of target frame is vertical with image block central point The difference of coordinate, the difference of the lower left corner abscissa and image block central point abscissa of target frame, the lower left corner of target frame is vertical to be sat The difference of mark and image block central point ordinate, the difference of the lower right corner abscissa and image block central point abscissa of target frame, The difference of the lower right corner ordinate and image block central point ordinate of target frame；

If target location is expressed asWherein The abscissa and ordinate in the expression target frame upper left corner respectively, the abscissa and ordinate in the upper right corner, The abscissa and ordinate in the lower left corner, the abscissa and ordinate in the lower right corner；Then image block A_i,jThe target location of prediction is I.e. for A_i,jHave For each image block, there are one respectively pre- The target location of survey, and the coordinate at four angles of target frame predicted is typically different, it is therefore desirable to each image block is predicted Four angular coordinates of target frame it is for statistical analysis, the final seat at each angle of target frame is determined using coordinate accumulation method here Mark, and then position entire target；Specific method is, if The upper left corner of target frame, the accumulative matrix of coordinate in the upper right corner, the lower left corner, the lower right corner are indicated respectively, wherein Respectively value of its homography at (a, b), and 0≤a, b≤224, each element value of these matrixes is equal when initial It is 0；For image block A_i,jHave Thus for each image block, accumulation operations are carried out to this four matrixes.

Finally, by the coordinate where the element with maximum value in matrix respectively to should be used as the coordinate at four angles of target frame, i.e., Wherein The upper left corner, the upper right corner, the lower left corner of expression target frame, the coordinate in the lower right corner add up the element pair with maximum value in matrix respectively The transverse and longitudinal coordinate value answered, target positioning are completed；

(7) network updates

According to the target location that (6) obtain, 8 × 8 marked off by entire image 8 × 8 between image block and target are calculated × 8 relative positions, and one group of training data is constituted together with current input image, primary network training is carried out, is realized to network Fine tuning update, then branch to (5).

2. a kind of method for tracking target based on transfer learning Recurrent networks according to claim 1, it is characterised in that：Institute State the difference that relative position value includes the upper left corner abscissa and image block central point abscissa of target frame, the upper left corner of target frame The difference of ordinate and image block central point ordinate, the difference of the upper right corner abscissa and image block central point abscissa of target frame Value, the difference of the upper right corner ordinate and image block central point ordinate of target frame, the lower left corner abscissa and image of target frame The difference of block central point abscissa, the difference of the lower left corner ordinate and image block central point ordinate of target frame, target frame The difference of lower right corner abscissa and image block central point abscissa, lower right corner ordinate and the vertical seat of image block central point of target frame Target difference.