CN109493370A

CN109493370A - A kind of method for tracking target based on spatial offset study

Info

Publication number: CN109493370A
Application number: CN201811186951.7A
Authority: CN
Inventors: 权伟; 李天瑞; 高仕斌; 赵丽平; 陈金强; 陈锦雄; 刘跃平; 卢学民; 王晔
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2019-03-19
Anticipated expiration: 2038-10-12
Also published as: CN109493370B

Abstract

The present invention provides a kind of method for tracking target based on spatial offset study, are related to technical field of computer vision.The target object to be tracked is selected and determined, Object selection process is automatically extracted or is manually specified by moving target detecting method；Spatial offset learning network includes that image data extraction, deep neural network, multi-layer perception (MLP) MLP and spatial offset export four parts: under real-time disposition, the video image that memory block is acquired and be stored in by camera is extracted, as the input picture to be tracked；In processed offline, the video file acquired is decomposed into the image sequence of multiple frame compositions；Tracking uses particle filter method in short-term, a particle in particle filter represents a possible target image block, using corresponding target region-of-interest ROI as the on-line training collection of spatial offset learning network, on-line training is carried out to spatial offset learning network using stochastic gradient descent method SGD, updates network parameter；Target positioning and update.

Description

A kind of method for tracking target based on spatial offset study

Technical field

The present invention relates to computer vision, graph image, pattern-recognition, machine learning techniques fields.

Background technique

Visual target tracking is the important subject of computer vision field, and main task is that acquisition target is continuous The information such as position, appearance and movement, and then base is provided for further semantic layer analysis (such as Activity recognition, scene understanding) Plinth.Target following research is widely used in the fields such as intelligent monitoring, human-computer interaction, automatic control system, has very strong reality With value.Currently, method for tracking target mainly includes classical method for tracking target and deep learning method for tracking target.

Classical method for tracking target is broadly divided into production method (Generative Methods) and discriminate method (Discriminative Methods) two classes.Production method assume target can by certain generating process or model into Row expression, such as principal component analysis (PCA), then tracking problem is considered as interested by sparse coding (Sparse Coding) etc. Region in find most probable candidate item.These methods are intended to design a kind of image expression side conducive to robust target following Method.And in terms of motion modeling, being normally based on certain assumes and verifies to carry out, typical method such as Kalman filtering, mean value Drift, particle filter (PF, Particle Filter) etc., wherein particle filter technology is the solution of nonlinear and non-Gaussian problem A kind of highly effective means are provided, it has good robustness to partial occlusion and background interference.Different from production side Tracking is considered as a classification or a kind of continuous object detection problem by method, discriminate method, and task is by target from figure As being distinguished in background.Such methods utilize target and background information simultaneously, are a kind of methods mainly studied at present.Differentiate Formula method generally comprises two main steps, and the first step is to be capable of the visual signature training of discrimination objective and background by selection A classifier and its decision rule are obtained, second step is that the classifier is used for each in visual field during tracking Position evaluate and determines most possible target position.Target frame is then moved to the position and repeats such mistake Journey, and then realize tracking, which be used to design various forms of track algorithms.All in all, classical tracking Main advantage is the speed of service and the dependence less to auxiliary data, while they are also required to the accuracy and in real time in tracking Tradeoff is made between property.

Deep learning (Deep Learning) is the hot spot of the research of machine learning in recent years, due to its powerful mark sheet The data set and hardware supported of Danone power and continuous development, deep learning achieve surprising success in many aspects, such as Speech recognition, image recognition, target detection, visual classification etc..Deep learning target following research and development is also very rapid, but by The requirement of the shortage and real-time of priori knowledge in target following, so that based on needing a large amount of training datas and parameter to calculate Depth learning technology be difficult to adequately be put to good use in this respect, have very big exploration space.From current research achievement From the point of view of, deep learning tracking mainly applies self-encoding encoder network and convolutional neural networks, and there are mainly two types of think for research Road, one is carrying out transfer learning to network to carry out on-line fine again, another kind be the structure of transformation depth network with adapt to The requirement of track.Self-encoding encoder network (AE) is typical non-supervisory deep learning network, because of its feature learning ability and antinoise Performance is applied first in target following.In general, self-encoding encoder network is relatively more intuitive and the scale of construction is moderate, is a kind of outstanding Non-supervisory deep learning model, be able to apply and achieve preferable effect in the track at first.Not with self-encoding encoder network Together, convolutional neural networks (CNN) are a kind of feedforward neural networks of supervision type, its convolution comprising the progress of multiple cycle alternations, Nonlinear transformation and down-sampled operation, embody very powerful performance in pattern-recognition especially Computer Vision Task. All in all, deep learning has more powerful feature representation ability, related training in tracking compared to classical way The selection of collection, the selection of network and the improvement of structure, real-time of algorithm, and application recurrent neural network etc. there is still a need for Further research.

Therefore, the high robust in view of particle filter PF and the powerful feature representation ability of deep neural network, the present invention It is proposed a kind of method for tracking target based on spatial offset study.This method uses 152 layers of deep neural network GoogleNet Feature representation is carried out to input picture, the image block selected by target region-of-interest ROI in frame image and its in the area is made For the input data of spatial offset learning network, the sky of selected image block and target is then exported through multi-layer perception (MLP) MLP study Between deviant, using the particle in particle filter as the image block selected in target region-of-interest ROI during tracking, even With target region-of-interest ROI, the corresponding spatial offset value of each particle can be obtained after the processing of network forward direction, and then be calculated Its corresponding future position is finally realized the position for having prediction most multiple as new target position and is determined target Position, and then realize tracking.Feature representation is carried out because on the one hand this method uses deep neural network, on the other hand simultaneously will The image data of target and background is as network inputs, and studying space deviant, so that target position fixing process is more accurate, and By effective combination with particle filter during tracking, takes full advantage of the high robust of particle filter and simplify network Learning process, and then can be realized the target following of long-time real-time stabilization.In addition, the method for the present invention can be not only used for list Target following can also extend the tracking for multiple target by increasing and adjusting sample label.

Summary of the invention

The object of the present invention is to provide a kind of method for tracking target based on spatial offset study, it can be efficiently solved The technical issues of tracking of long-time real-time stabilization is carried out to general target object under no constraint environment.

The purpose of the present invention is achieved through the following technical solutions: a kind of target following based on spatial offset study Method includes the following steps:

Step 1: Object selection:

The target object to be tracked is selected and determined from initial pictures, Object selection process passes through moving object detection side Method automatically extracts, or is manually specified by man-machine interaction method；

Step 2: the building of spatial offset learning network and initialization:

Spatial offset learning network includes image data extraction, deep neural network, multi-layer perception (MLP) MLP and spatial offset Export four parts:

Image data extraction part extracts two kinds of image blocks to original image, one is the region-of-interest ROI of target, That is goal-orientation, and the image block of 9 times of target sizes, another kind are random in the region-of-interest ROI of the target The image block with target with same size of selection；For deep neural network part, using the pre-training that can disclose acquisition Network G oogleNet carries out feature representation to image, and it is comprising the big of a training images up to a million which, which shares 154 layers, The deep neural network that training obtains on scale data collection ImageNet, input picture pass through the scale of 224 × 224 pixel sizes Input data after normalization as GoogleNet network；It is used as feature representation layer by 152 layers of GoogleNet, it has 1024 A value output；The image block selected in using target region-of-interest ROI and the region is defeated as the image of GoogleNet Enter, after the processing of GoogleNet forward direction, exports two characteristic values, each characteristic value includes 1024 values, by the two characteristic values Data after connection merging as the part multi-layer perception (MLP) MLP input；Multi-layer perception (MLP) MLP include three full articulamentums, first Node layer number is 2048, and second layer number of nodes is 1024, and third layer number of nodes is 512, and multi-layer perception (MLP) MLP's is last One layer of connection space deviates output par, c；Spatial offset output par, c includes 4 values, this 4 values are respectively that selected image block is left The difference Dx of upper angle abscissa and target upper left corner abscissa^l, the vertical seat of selected image block upper left corner ordinate and the target upper left corner Target difference Dy^l, the difference Dx of selected image block lower right corner abscissa and target lower right corner abscissa^r, the selected image block lower right corner The difference Dy of ordinate and target lower right corner ordinate^r；

Target region-of-interest ROI and the image block selected in the region are pairs of input space offset learning networks , for same frame image, target region-of-interest ROI is constant, and in the region select image block have it is multiple；It is paid close attention in target It randomly chooses 1000 image blocks in the ROI of region, and records the spatial offset value of each selected image block and target, i.e., it is selected The coordinate difference Dx of image block and target^l, Dy^l, Dx^r, Dy^r, by being obtained centered on the target image block determined in step 1 Target region-of-interest ROI and the region in the image block that selects generate initial training collection, using SGD pairs of stochastic gradient descent method Spatial offset learning network carries out off-line training, so that it is determined that the parameter of multi-layer perception (MLP), and then complete spatial offset and learn net The initialization of network；

Step 3: image inputs:

It under real-time disposition, extracts and the video image of memory block is acquired and be stored in by camera, as will be into The input picture of line trace；In processed offline, the video file acquired is decomposed into the image sequence of multiple frame compositions Column extract frame image as input picture sequentially in time one by one；If input picture is sky, whole flow process stops；

Step 4: tracking in short-term:

Tracking uses particle filter method (PF, Particle Filter) in short-term, and a particle in particle filter represents One possible target image block, particle filter include 1000 particles, and each particle is random in target region-of-interest ROI Selection obtains, and target region-of-interest ROI is and the region of 9 times of target sizes centered on the target that the last time determines； The similarity value of the prediction target image block of particle filter output and target image block is calculated as between the two image blocks Regularization intersects cross correlation value NCC (Normalized Cross-Correlation), then if value > 0.9 during tracking, Then indicate that the target for tracking output in short-term is credible, target positioning is completed, and the spatial offset of each particle Yu new definition target is recorded Value, i.e. the coordinate difference Dx of particle and target^l, Dy^l, Dx^r, Dy^r, step 5 is jumped to, otherwise indicates that target is insincere, jumps To step 6；

Step 5: network on-line training:

The particle of particle filter and its corresponding target region-of-interest ROI are as spatial offset learning network using in step 4 On-line training collection, using stochastic gradient descent method SGD to spatial offset learning network carry out on-line training, update network ginseng Number；

Step 6: target positioning and update:

Each particle of particle filter in current goal region-of-interest ROI and step 4 is separately input to spatial offset Practise network, through network forward direction processing after, export the corresponding spatial offset value of each particle, according to where particle position and the grain The corresponding spatial offset value of son, is calculated the target position of the particle prediction；The target position for counting all particle predictions, With most multiple position is predicted as new target position, target positioning is completed；Calculate the target image block and grain of new definition The similarity value between target image block that son filtering tracks in short-term, i.e. their regularization intersect cross correlation value NCC, if should Value > 0.9 then updates the target image block that particle filter tracks in short-term with the target image block of new definition；Jump to step 3.

During tracking, when the objective result that particle filter tracks output in short-term is credible, tracking can carry out reality in short-term When target following, while to spatial offset learning network carry out on-line training, and when its output objective result it is insincere when, then Target is positioned by spatial offset learning network, at the same according to the network determine target come update particle filter in short-term with The target image block of track.Feature representation is carried out because on the one hand this method uses deep neural network, on the other hand simultaneously will The image data of target and background is as network inputs, and studying space deviant, so that target position fixing process is more accurate, and By effective combination with particle filter during tracking, takes full advantage of the high robust of particle filter and simplify network Learning process, and then can be realized the target following of long-time real-time stabilization.

Compared with prior art the advantages of and good effect: the present invention propose it is a kind of based on spatial offset study target with Track method.This method carries out feature representation to input picture using 152 layers of deep neural network GoogleNet, by frame image Middle target region-of-interest ROI and its image block selected in the area are inputted as spatial offset learning network, then through more Layer perceptron MLP study exports the spatial offset value of selected image block and target, by the particle in particle filter during tracking It can after the processing of network forward direction together with target region-of-interest ROI as the image block selected in target region-of-interest ROI The corresponding spatial offset value of each particle is obtained, and then its corresponding future position is calculated, finally with prediction Most multiple position can realize the positioning to target as new target position, and then realize tracking.On the one hand due to this method Feature representation is carried out using deep neural network, it is on the other hand simultaneously that the image data of target and background is defeated as network Enter, and studying space deviant, so that target position fixing process is more accurate, and by having with particle filter during tracking Effect combines, and takes full advantage of the high robust of particle filter and simplifies the learning process of network, and then can be realized for a long time The target following of real-time stabilization.In addition, the method for the present invention can be not only used for monotrack, by increasing and adjusting sample mark Note, can also extend the tracking for multiple target.

Detailed description of the invention

Fig. 1 is spatial offset learning network structure composition schematic diagram of the present invention

Fig. 2 is method for tracking target flow chart of the present invention

Specific embodiment

Embodiment:

Method of the invention can be used for target object tracking various occasions, such as intelligent video analysis, automatic human-computer interaction, Logical video monitoring, vehicle drive, and biocenose analysis and flow surface test the speed.

By taking intelligent video analysis as an example: intelligent video analysis includes many important to automatically analyze task, such as object behavior Analysis, video compress etc., and the basis of these work is then the target following for being able to carry out long-time stable.This hair can be used The tracking realization of bright proposition, specifically, first simultaneously according to picture construction spatial offset learning network where target selection Initialization training is completed, as shown in the spatial offset learning network structure composition of Fig. 1；Then it is filtered during tracking using particle Wave method PF is tracked in short-term, when particle filter track in short-term determining target it is credible when, according in short-term tracking determined by Target region-of-interest ROI is extracted in target position, constitutes online training set and training spatial offset together with the particle of particle filter Practise network, and when particle filter track in short-term determining target it is insincere when, then by spatial offset learning network to target carry out It positions, while updating the target image block of particle filter according to the target that the network determines.Because on the one hand this method uses Deep neural network carries out feature representation, on the other hand simultaneously using the image data of target and background as network inputs, and Studying space deviant so that target position fixing process is more accurate, and passes through effective knot with particle filter during tracking It closes, takes full advantage of the high robust of particle filter and simplify the learning process of network, and then can be realized for a long time in real time Stable target following.

The method of the present invention can be programmed by any computer programming language (such as C language) and be realized, based on this method Tracking system software can realize real-time modeling method application in any PC or embedded system.

Claims

1. a kind of method for tracking target based on spatial offset study, includes the following steps:

Step 1: Object selection:

The target object to be tracked is selected and determined from initial pictures, Object selection process passes through moving target detecting method certainly It is dynamic to extract, or be manually specified by man-machine interaction method；

Step 2: the building of spatial offset learning network and initialization:

Spatial offset learning network includes image data extraction, deep neural network, multi-layer perception (MLP) MLP and spatial offset output Four parts:

Image data extraction part extracts two kinds of image blocks to original image, and one is the region-of-interest ROI of target, i.e., with Centered on target, and the image block of 9 times of target sizes, another kind are randomly choosed in the region-of-interest ROI of the target With target have same size image block；For deep neural network part, using the pre-training network that can disclose acquisition GoogleNet carries out feature representation to image, and it is comprising the extensive of a training images up to a million which, which shares 154 layers, The deep neural network that training obtains on data set ImageNet, input picture pass through the scale normalizing of 224 × 224 pixel sizes Input data after change as GoogleNet network；It is used as feature representation layer by 152 layers of GoogleNet, it there are 1024 values Output；The image block selected in using target region-of-interest ROI and the region is inputted as the image of GoogleNet, After the processing of GoogleNet forward direction, two characteristic values are exported, each characteristic value includes 1024 values, the two characteristic values are connected It engages and the rear data as the part multi-layer perception (MLP) MLP inputs；Multi-layer perception (MLP) MLP includes three full articulamentums, first layer Number of nodes be 2048, second layer number of nodes be 1024, third layer number of nodes be 512, multi-layer perception (MLP) MLP last Layer connection space deviates output par, c；Spatial offset output par, c includes 4 values, this 4 values are respectively selected image block upper left The difference Dx of angle abscissa and target upper left corner abscissa^l, selected image block upper left corner ordinate and target upper left corner ordinate Difference Dy^l, the difference Dx of selected image block lower right corner abscissa and target lower right corner abscissa^r, the selected image block lower right corner is indulged The difference Dy of coordinate and target lower right corner ordinate^r；

Target region-of-interest ROI and the image block selected in the region are pairs of input space offset learning networks, right In same frame image, target region-of-interest ROI is constant, and in the region select image block have it is multiple；In target region-of-interest It randomly chooses 1000 image blocks in ROI, and records the spatial offset value of each selected image block and target, i.e., selected image The coordinate difference Dx of block and target^l, Dy^l, Dx^r, Dy^r, pass through the mesh obtained centered on the target image block determined in step 1 It marks the image block selected in region-of-interest ROI and the region and generates initial training collection, using stochastic gradient descent method SGD to space It deviates learning network and carries out off-line training, so that it is determined that the parameter of multi-layer perception (MLP), and then complete spatial offset learning network Initialization；

Step 3: image inputs:

Under real-time disposition, extract and the video image of memory block acquired and be stored in by camera, as to carry out with The input picture of track；In processed offline, the video file acquired is decomposed into the image sequence of multiple frame compositions, is pressed According to time sequencing, frame image is extracted one by one as input picture；If input picture is sky, whole flow process stops；

Step 4: tracking in short-term:

Tracking in short-term uses particle filter method, and a particle in particle filter represents a possible target image block, grain Son filtering includes 1000 particles, and each particle is to randomly choose to obtain in target region-of-interest ROI, the target region-of-interest ROI is and the region of 9 times of target sizes centered on the target that the last time determines；The prediction mesh that particle filter is exported The regularization that logo image block and the similarity value of target image block are calculated as between the two image blocks intersects cross correlation value NCC, If value > 0.9 during then tracking, then it represents that the target for tracking output in short-term is credible, and target positioning is completed, and records each grain The spatial offset value of son and new definition target, i.e. the coordinate difference Dx of particle and target^l, Dy^l, Dx^r, Dy^r, step 5 is jumped to, Otherwise indicate that target is insincere, jumps to step 6；

Step 5: network on-line training:

The particle of particle filter and its corresponding target region-of-interest ROI exist as spatial offset learning network using in step 4 Line training set carries out on-line training to spatial offset learning network using stochastic gradient descent method SGD, updates network parameter；

Step 6: target positioning and update:

Each particle of particle filter in current goal region-of-interest ROI and step 4 is separately input to spatial offset study net Network, through network forward direction processing after, export the corresponding spatial offset value of each particle, according to where particle position and the particle pair The spatial offset value answered, is calculated the target position of the particle prediction；The target position for counting all particle predictions, having Most multiple position is predicted as new target position, target positioning is completed；The target image block and particle for calculating new definition are filtered The similarity value between target image block that wave tracks in short-term, i.e. their regularization intersect cross correlation value NCC, if the value > 0.9, then the target image block that particle filter tracks in short-term is updated with the target image block of new definition；Jump to step 3.