CN109215057A

CN109215057A - A kind of high-performance visual tracking method and device

Info

Publication number: CN109215057A
Application number: CN201810857145.1A
Authority: CN
Inventors: 葛仕明
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-07-31
Filing date: 2018-07-31
Publication date: 2019-01-15
Anticipated expiration: 2038-07-31
Also published as: CN109215057B

Abstract

The present invention provides a kind of high-performance visual tracking method, and step includes: that the image block of the frame is extracted according to the object space of the former frame of video, extracts the original multi-channel feature of the image block；The image block that present frame is extracted according to the object space of the former frame of video, extracts the original multi-channel feature of the image block；The original multi-channel feature of above-mentioned two frame is distilled into module by channel, obtains compressed multi-channel feature；Fourier transformation is carried out to compressed multi-channel feature, and carries out dot product operation with trace model, then carry out inverse Fourier transform, obtains response diagram；The peak response position on response diagram is found, object is obtained and deviates vector, and object is deviateed into vector and is added with the object space of former frame corresponding to present frame, obtain the object space of present frame；According to the object space of present frame, compressed multi-channel feature is calculated, updates trace model.The present invention also provides a kind of high-performance vision tracking devices.

Description

A kind of high-performance visual tracking method and device

Technical field

The invention belongs to computer vision and field of multimedia analysis more particularly to a kind of under the conditions of resource-constrained Visual tracking method and device.

Background technique

High performance vision tracking has important application, such as video in many computer visions and field of multimedia analysis Content analysis, video monitoring, self-navigation and human-computer interaction etc..Current vision tracking scheme generallys use multi-channel feature It indicates vision object, and is embedded into an Optimization Framework, achieve good tracking effect.Especially with deep learning skill The development of art, current state-of-the-art vision tracker often use depth characteristic as multi-channel feature to indicate vision object, Achieve current highest tracking accuracy.Nevertheless, interior when causing model reasoning since the parameter of deep learning model is larger It deposits and occupies big, computation complexity height, so that these vision trackers are disposed under the conditions of resource-constrained very difficult.

Tracking efficiency in order to promote vision tracker proposes some vision track sides to improve deployment ability in recent years Method, different according to characteristic processing mode, these methods can be divided into three classes, i.e. the tracking of study class, the tracking of weighting class and compression class Tracking.Study class tracking directly learn new model from large-scale data carry out the character representation of vision object (referring to: L.Bertinetto,J.Valmadre,J.F.Henriques,and et al..2016.Fully-Convolutional Siamese Networks for Object Tracking.In:ECCV Workshop.850-865), this kind of tracking is logical Often need additional large-scale training, and a most important element be will be trained from various different vision object datas To knowledge be able to carry out migration to realize the tracking of particular visual object；Therefore, it is not addressed how in such methods adaptive Migrate desired knowledge with answering rather than the A to Z of.Class tracking is weighted by adaptively measuring the influence in channel to multichannel Feature is weighted the ability to express of processing promotion vision object or channel response is weighted and improves the accurate of tracking and positioning Property (referring to: A.Lukezi, T.Vojir, L.C.Zajc, and et al..2017.Discriminative Correlation Filter with Channel and Spatial Reliability.In IEEE CVPR.6309-6318), it is usually this kind of Tracking obtains preferable tracking accuracy, but the quantity in feature channel does not reduce and still keeps very big.Compress class tracking then It is by reducing or the dimension of compressive features promotes tracking efficiency (referring to M.Danelljan, G.Bhat, F.Khan, and M.Felsberg.2017.ECO:Efficient Convolution Operators for Tracking.In IEEE CVPR 6638-6646), these methods typically reduce model parameter, but the complexity of memory is still very high.

Normally, multi-channel feature has general knowledge, vision object can be described from different perspectives, vision The key problem of tracking is how (1) adaptively extracts correct knowledge from world knowledge, and how (2) migrate these Knowledge is for tracking specific vision object.

Summary of the invention

In order to overcome the deficiencies of the prior art, the present invention provides a kind of high-performance visual tracking method and device, this method Algorithm is distilled using channel, identifies and select feature channel according to the vision body self-adaptation to be tracked, it is special to reduce multichannel The dimension of sign and the ability to express of lifting feature.Firstly, the current video frame of known vision object space passes through multi-channel feature Extractor, that is, characteristic extracting module, obtains multi-channel feature, and characteristic extracting module can be using the default multi-channel feature set Extract model or the good deep learning model of pre-training；Then, multi-channel feature distills mould by trained channel in advance Block obtains compressed multi-channel feature；Then, compressed multi-channel feature carry out Fourier transformation and with it is trained with Track model, that is, correlation filter carries out relevant operation, obtains response diagram；Then, the peak response position of figure is regarded according to response Feel that object deviates vector, and is added to obtain the object space of present frame with the object space of former frame；Finally, according to present frame Object space updates trace model.Meanwhile the present invention also proposes the training method of channel distillation module, by optimizing energy function It selects optimal feature channel, realizes Feature Compression, adaptively lifting feature ability to express, and improve processing speed and drop Low memory.

In order to achieve the above objectives, the present invention is achieved through the following technical solutions:

A kind of high-performance visual tracking method, step include:

The image block that the frame is extracted according to the object space of the former frame of video, the original multi-channel for extracting the image block are special Sign；

The image block that present frame is extracted according to the object space of the former frame of video, extracts the original multi-channel of the image block Feature；

The original multi-channel feature of above-mentioned two frame is distilled into module by channel, obtains compressed multi-channel feature；

Fourier transformation is carried out to compressed multi-channel feature, and carries out dot product operation with trace model, then carry out Fu In leaf inverse transformation, obtain response diagram；

The peak response position on response diagram is found, object is obtained and deviates vector, and object is deviateed into vector and present frame The object space of corresponding former frame is added, and obtains the object space of present frame；

According to the object space of present frame, compressed multi-channel feature is calculated, updates trace model.

Further, if former frame is first frame, known to the object space of the first frame.

Further, according to the original multi-channel feature of at least front cross frame of video, compressed multi-channel feature is obtained, By minimizing energy function, channel distillation module and trace model are obtained.

Further, by the difference between the relevant response figure of compressed multi-channel feature and expected response figure come structure Energy function is made, which is that an intermediate response value is big and the function of Gaussian of the surrounding response close to 0.

Further, the energy function are as follows:

Wherein α_l∈{0,1}

Wherein, h is trace model, h^(l)It is first of channel template of h；Binary set a=(the α of d dimension₁,α₂,…,α_d) use In one channel selecting of expression, α_l=1 first of feature channel of expression is selected, α_l=0 first of feature channel of expression is not chosen It selects；‖ a ‖ indicates the number of active lanes of selection；Constant λ is for balancing two parts energy loss；It is Discrete Fourier Transform Operation, ⊙ are the operations of step-by-step dot product, and * is conjugate operation.

Further, it is optimized by alternative optimization algorithm, obtains minimizing energy function.

Further, original multi-channel feature is extracted by characteristic extracting module, this feature extraction module is default sets Multi-channel feature extract model or the good deep learning model of pre-training, or both combination.

Further, it presets the multi-channel feature set and extracts extractable HOG (the Histogram of Oriented of model Gradients, abbreviation histogram of gradients) feature or color attribute feature.

Further, compressed multi-channel feature is converted to frequency domain using Fourier transformation, is carried out with trace model Weighted accumulation, to update trace model.

A kind of high-performance vision tracking device, comprising:

Characteristic extracting module, using the good deep learning model or both of the default feature extractor set, pre-training Combination, extraction obtain original multi-channel feature；

Module is distilled in channel, for selecting the feature of most rich information channel from original multi-channel feature, after obtaining compression Multi-channel feature；

Feature comparison module obtains correlation for compressed multi-channel feature and trace model to be carried out relevant operation Response diagram；

Response prediction module, for obtaining object displacement vector according to relevant response figure searching peak response position, thus Calculate current object position；

Model modification module, for the information update trace model according to current object position；

Trace model is multichannel template, for carrying out aspect ratio pair with object multi-channel feature.

The beneficial effects of the present invention are: it is directed to vision object tracking problem, under the conditions of resource-constrained Vision object tracking problem, method and device of the invention carry out adaptive compression and distillation to feature, are keeping tracking essence In the case of degree, there is great advantage on tracking velocity, EMS memory occupation；In addition, the present invention is in the deep learning using miniaturization Model can also obtain very high precision as object expression, and the object tracking methods of same type are then needed using very big depth Learning model could obtain high-precision as object expression.

Detailed description of the invention

Fig. 1 is a kind of structure drawing of device of high-performance visual tracking method of embodiment.

Fig. 2A is the default multi-channel feature model structure based on histogram of gradients set.

Fig. 2 B is the default multi-channel feature model structure based on color attribute set.

Fig. 3 A is the good miniaturization deep learning illustraton of model of pre-training.

Fig. 3 B is the good deep enlargement degree learning model figure of pre-training.

Fig. 4 is the processing schematic of channel distillation module.

Fig. 5 is the process flow diagram of feature comparison module.

Specific embodiment

To be clearer and more comprehensible above scheme and beneficial effect of the invention, hereafter by embodiment, and attached drawing is cooperated to make Detailed description are as follows.

The present embodiment provides a kind of high-performance visual tracking method, this method based on device composition include feature extraction mould Module, feature comparison module, response prediction module, model modification module are distilled in block, channel, as shown in Figure 1, its step includes:

1) video frame is received, which is the frame image comprising object to be tracked.

2) the original multi-channel feature of frame image block is extracted by characteristic extracting module.

This feature extraction module can be default multi-channel feature (such as HOG feature, color attribute feature) mould set Type (as shown in Fig. 2A, 2B) or the good deep learning mould of pre-training on ImageNet or other large-scale image taxonomy databases Type (as shown in Fig. 3 A, 3B) or the two combination.

3) module is distilled by channel and obtains compressed multi-channel feature.

Referring to FIG. 4, channel distillation module is trained optimal channel collection, selected from original multi-channel feature The channel for selecting out most rich information, forms compressed multi-channel feature.

4) response diagram is obtained by feature comparison module.

Referring to FIG. 5, this feature comparison module first by compressed multi-channel feature carry out Fourier transformation, then with Trace model carries out dot product operation, then carries out inverse Fourier transform, obtains response diagram.Herein, correlation filtering operation passes through Fu In leaf transformation and inverse Fourier transform be transformed into frequency domain and carry out, improve processing speed.

5) current object position is determined by response prediction module.Specially scheme according to response, it is maximum to find response Position obtains object displacement vector, obtains the object space of present frame with object space accumulation calculating before.

6) trace model is updated by model modification module.According to the object space of present frame, after acquiring compression Multi-channel feature, frequency domain is converted to using Fourier transformation, be weighted with trace model before it is cumulative, after obtaining update Trace model.

The original multi-channel feature that characteristic extracting module obtains generally comprises a large amount of number of active lanes, occupies a large amount of Memory consumes a large amount of processing time.The present invention realizes the compression of multi-channel feature by channel distillation, to accelerate Tracking velocity, and reduce required memory；In addition, channel distillation is adaptive selected by way of optimizing energy function Feature channel rich in information, is capable of providing the tracking of higher precision.

Trained channel is detailed below and distills implementation method used by obtaining optimal feature channel.

The purpose of channel distillation be in order to according to the object appearance of tracking by being adaptive selected to the object of tracking most Good feature channel, thus extract the feature channel (referred to as optimal channel) of most rich information and wipe out the feature channel of Noise, So that the channel characteristics that distillation obtains can promote tracking performance (including accuracy and speed).Channel distillation is expressed as one by the present invention A combined optimization problem jointly optimizes Feature Compression, trace model and response diagram and generates, and target is from comprising n sample Training datasetMiddle study trace model and optimal channel selection, problem can be expressed as following energy letter Number:

In formula, h is trace model, is often expressed as a multichannel template；Binary set a=(the α of d dimension₁,α₂,…, α_d) for indicating a channel selecting, α_l=1 first of feature channel of expression is selected, otherwise α_l=0 indicates that first of feature is logical Road is not selected；Indicate multi-channel feature extractor；Aspect ratio is for measuring the single feature extracted to function phi The matching of channel and trace model responds, and the response in accumulative d channel obtains response diagram；Response prediction functionIt is produced for measuring Difference between raw response diagram and expected response figure；Regular functionFor constraining trace model；Constant λ is for balancing two Divide energy loss.

Channel distillating method be applied to differentiate correlation filter (discriminative correlation filter, DCF it) tracks on frame, then formula (1) is expressed as following energy function:

Wherein { 0,1 } (2) α x ∈

Wherein,It is cyclic convolution operation, h^(l)It is first of channel template (i.e. correlation filter) of trace model h, ‖ a ‖ Indicate the number of active lanes of selection.First item is used to measure the filter intersected between convolution output and ideal correlation output in formula (2) Wave loss, Section 2 are used to carry out regularization to correlation filter.According to Paasche Wa Er theorem (Parseval's theorem), Formula (2) can be expressed as frequency domain form:

Wherein α_l∈{0,1} (3)

Wherein,It is discrete Fourier transform operations, ⊙ is the operation of step-by-step dot product, and * is conjugate operation.

Since formula (2) or (3) are difficult to solve by disposably, the present invention is optimized using alternative optimization algorithm It (3), mainly include two steps.The first step givesFormula (3) are minimized by searching for optimal channel selecting a；Notice two Molecule and denominator of the value vector a in formula (3) all include that without analytic solutions, and exhaust algorithm is then very time-consuming, therefore this Invention finds optimal channel selecting using heuristic search.Specifically, formula (3) are used to single feature channel first In first item calculate off-energy, channel is ranked up from small to large according to energy value, building sequence channel sequence and enables Initial selector channel sequence is sky；Then channel iteratively being chosen one by one from sequence channel sequence, selector channel sequence is added Column calculate energy value using formula (3)；The smallest selector channel sequence of energy is finally taken to be arranged as optimal channel.Second step, Current channel selecting is given, by the correlation filtering method for solving of standard (referring to H.K.Galoogahi, T.Sim, and S.Lucey.Multi-Channel Correlation Filters.In IEEE ICCV.2013,4321-4328), it calculates To correlation filter

It is above to implement to be merely illustrative of the technical solution of the present invention rather than be limited, the ordinary skill people of this field Member can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this hair Bright protection scope should be subject to described in claims.

Claims

1. a kind of high-performance visual tracking method, step include:

The image block that the frame is extracted according to the object space of the former frame of video extracts the original multi-channel feature of the image block；

The image block that present frame is extracted according to the object space of the former frame of video, the original multi-channel for extracting the image block are special Sign；

Fourier transformation is carried out to compressed multi-channel feature, and carries out dot product operation with trace model, then carry out Fourier Inverse transformation obtains response diagram；

The peak response position on response diagram is found, object is obtained and deviates vector, and object deviation vector and present frame institute is right The object space for the former frame answered is added, and obtains the object space of present frame；

2. the method as described in claim 1, which is characterized in that if former frame is first frame, the object space of the first frame It is known.

3. the method as described in claim 1, which is characterized in that according to the original multi-channel feature of at least front cross frame of video, Compressed multi-channel feature is obtained, by minimizing energy function, obtains channel distillation module and trace model.

4. method as claimed in claim 3, which is characterized in that pass through the relevant response figure of compressed multi-channel feature and phase Hope the difference between response diagram construct energy function, the expected response figure be intermediate response value greatly and surrounding response connects It is bordering on the function of 0 Gaussian.

5. the method as claimed in claim 3 or 4, which is characterized in that the energy function are as follows:

Wherein, h is trace model, h^(l)It is first of channel template of h；Binary set a=(the α of d dimension₁,α₂,…,α_d) it is used for table Show a channel selecting, α_l=1 first of feature channel of expression is selected, α_l=0 first of feature channel of expression is not selected；‖a‖ Indicate the number of active lanes of selection；Constant λ is for balancing two parts energy loss；It is discrete Fourier transform operations, ⊙ is the operation of step-by-step dot product, and * is conjugate operation.

6. method as claimed in claim 5, which is characterized in that optimized by alternative optimization algorithm, obtain minimizing energy Flow function.

7. the method as described in claim 1, which is characterized in that original multi-channel feature is extracted by characteristic extracting module, it should Characteristic extracting module be the default multi-channel feature set extract model or the good deep learning model of pre-training, or both Combination.

8. the method for claim 7, which is characterized in that the default multi-channel feature set, which extracts model, can extract HOG Feature or color attribute feature.

9. the method as described in claim 1, which is characterized in that convert compressed multi-channel feature using Fourier transformation At frequency domain, be weighted with trace model it is cumulative, to update trace model.

10. a kind of high-performance vision tracking device, comprising:

Characteristic extracting module, using the group of the good deep learning model of the default feature extractor set, pre-training or both It closes, extraction obtains original multi-channel feature；

Channel distillation module obtains compressed more for selecting the feature of most rich information channel from original multi-channel feature Channel characteristics；

Feature comparison module obtains relevant response for compressed multi-channel feature and trace model to be carried out relevant operation Figure；

Response prediction module obtains object displacement vector, to calculate for finding peak response position according to relevant response figure Current object position out；