CN110210551A

CN110210551A - A kind of visual target tracking method based on adaptive main body sensitivity

Info

Publication number: CN110210551A
Application number: CN201910452144.3A
Authority: CN
Inventors: 张辉; 齐天卉; 卓力; 李嘉锋
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2019-09-06
Anticipated expiration: 2039-05-28
Also published as: CN110210551B

Abstract

A kind of visual target tracking method based on adaptive main body sensitivity belongs to technical field of computer vision, including overall flow, offline part and online part.Overall flow: the process of target following is devised, and devises network structure；Each phase characteristic figure of the network is adjusted to adaptive size, completes twin network end-to-end tracking process；Offline part includes 6 steps: training sample database generates；Forward direction track training；Back-propagation gradient calculates；Gradient is lost item and is calculated；Target template image masks generate；Network model training and model obtain.Online part includes 3 steps: model modification；Online tracking；Position target area.Model modification calculates before including to tracking, back-propagation gradient, gradient loss item calculates, the generation of target template image masks；Online tracking obtains similarity matrix to tracking including preceding, calculates current tracking result confidence level, regressive object region.This method can better adapt to the target robust tracking of cosmetic variation.

Description

A kind of visual target tracking method based on adaptive main body sensitivity

Technical field

The invention belongs to technical field of computer vision, are related to a kind of method for tracking target, more specifically, are related to one Visual target tracking method of the kind based on adaptive main body sensitivity.

Background technique

Visual target tracking is most basic one of task in computer vision and video processing, video content analysis, The fields such as intelligent transportation system, human-computer interaction, unmanned, vision guided navigation have important application.Typical online track side Method after the bounding box of target, completes the target positioning of subsequent all frames in given video first frame in an automated way.True In real application scenarios, the variation of the target appearance as caused by the factors such as image-forming condition, posture deforming is intricate, by target with it is miscellaneous Random background distinguishes, and realizes that the accurate tracking of target is an extremely challenging problem.

Currently, visual target tracking technology has correlation filtering tracking and deep learning to track two class main stream approach.Based on instruction Practice the loop structure of data it is assumed that and carrying out frequency-domain operations, correlation filtering class visual target tracking by Fast Fourier Transform (FFT) Method obtains higher computational efficiency and tracking accuracy.The representative MOSSE algorithm based on correlation filtering tracking, using fastly Fast Fourier transform operation makes tracking velocity reach 600~700 frames/second.Due to the limitation of model complexity and flexibility, pass Often easily there is performance saturation with continuing to increase for data volume very much in system algorithm, and deep learning is then showed to magnanimity big data Adaptive faculty.Representative FCNT method according to convolutional neural networks (Convolutional Neural Network, CNN) no The analysis of same layer feature, construction feature screens network and complementary response diagram predicts network, reaches and effectively inhibits chaff interferent, reduces Tracking drift, while to the deformation of target more robust itself.Deep learning class algorithm utilizes " offline pre-training+online micro- The thinking of tune " reaches tracking purpose by the powerful feature representation ability of deep learning.

Further thinking is by visual target tracking as a kind of similarity mode, is carried out point by point in region of search Template matching captures the sensitive features of target subject, then using maximum similarity score as target position in a manner of be more bonded Tracing task.The twin neural network of depth based on this thinking tracks frame, shows huge development potentiality in recent years.It represents Property method --- SiamFC replaces correlation filtering using convolution, the output matching response diagram in the form of full convolutional network and takes Response highest point predicts target position.SiamRPN method migrates object detection field on the basis of SiamFC method Candidate region motion network.However this method is weaker for the separating capacity of similar object, in order to obtain more accurately target Main body, DaSiamRPN method are directed to target jamming object, introduce chaff interferent sensor model, enhance distinction in the class of network.But It is that first frame is used only as fixed form in above-mentioned several method, and process can not adapt to outside target and scene in subsequent frame matching See variation.DSiam method then on the basis of SiamFC method, by on-line study deformation matrix dynamic adjustment target template with Target appearance variation is adapted to, so that the tracking accuracy and robustness of the method are further promoted.

In conclusion existing depth tracking mainly passes through the convolutional Neural net of the tasks such as migration object classification, detection Network parameter constructs depth tracking network.Although this migration pattern all achieves success in many visual tasks, In vision tracking application, apparent advantage is not shown also compared to traditional tracking technique.One it is a key issue that vision with Target in track application does not often have fixed semantic classes information, and interesting target can be arbitrary image block, such as may be used To be the semantic class object such as pedestrian, vehicle, it is also possible to the visual units such as marked region on pedestrian or vehicle, and merit attention Target subject is often different, and is difficult to precisely be described with rectangle frame region.Therefore, further feature trained in advance is modeling Effect is also not fully up to expectations when these any form of targets.In true complicated natural scene, model is difficult to adaptively not The target of same type does not obtain the feature representation of sensitive to target subject, effective differentiation prospect and background, visual target tracking yet Method performance it is also highly desirable to have further promotion.

Summary of the invention

It is an object of the invention to overcome defect of the existing technology, before proposing a kind of target based on deep learning Scape/Adaptive background subtraction disjunctive model, and the expression of the target appearance depth under mixed and disorderly background is constructed, it realizes a kind of based on target subject The real-time vision method for tracking target of analysis.

The present invention is realized using following technological means:

A kind of visual target tracking method based on adaptive main body sensitivity, mainly include overall flow, offline part and Online part.

Overall flow: the process of target following is devised first；Then according to this process design network structure；Finally will Each phase characteristic figure of the network is adjusted to adaptive size, completes twin network end-to-end tracking process；

Offline part: mainly include 6 steps: training sample database generates；Forward direction track training；Back-propagation gradient meter It calculates；Gradient is lost item and is calculated；Target template image masks generate；Network model training and model obtain.Wherein, network model is instructed Experienced and model obtains the selection that the stage includes tracking loss function, gradient loss item, gradient descent method；

Online part: mainly including 3 steps: model modification；Online tracking；Position target area.Wherein, model modification It is generated including preceding to tracking, back-propagation gradient calculating, the calculating of gradient loss item, target template image masks；Online tracking bag Similarity matrix is obtained to tracking before including, calculates current tracking result confidence level, regressive object region.

The overall flow, the specific steps are as follows:

(1) overall flow of the invention.As shown in Fig. 1, the online stream of the visual target tracking of adaptive main body sensitivity Journey is mainly included in line and updates and online tracking processing.When network model by offline pre-training adjustment after, vision of the invention Target following process is specific as follows:

In initial frame treatment process, it is sensitive to carry out adaptive main body to the two for input template image and present image Twin network characterization extracts, and generates tracking and returns response, obtains similarity matrix；Then, by calculate back-propagation gradient figure, Back-propagation gradient loses item, Optimization Solution optimal models.In subsequent frame treatment process, input template image and current figure Picture carries out feature extraction by the sensitive twin network of adaptive main body, generates tracking and return response；Then, tracking is calculated to return The confidence level of response exports the tracking result of present image if confidence level is more than or equal to 0.7, if confidence level is weighed lower than 0.7 Multiple initial frame operation carries out network model online updating；

(2) network structure that the present invention designs, each layer correspond to the visual target tracking process of adaptive main body sensitivity, tool There is different physical significances.As shown in Fig. 3, network structure of the invention includes twin network trace structure, and two adaptive Pay close attention to construction module, target template image masks fusion structure.Wherein, in the propagated forward the step of, by twin network trace To tracking submodule before structure and two adaptive concern construction module compositions.In the backpropagation the step of, target mould is utilized Plate image masks fusion structure is as backpropagation submodule.Twin network trace structure includes before two groups to structure, every component There are not 5 convolutional layers, 5 normalization layers and 5 maximum pond layers, and weight trained to structural union is shared before two groups.Twin Adaptive concern structure is added after the down-sampled process layer of first four of raw network trace stay in place form frame branch.This is adaptively paid close attention to Structure is two structure combinations, is utilized respectively the channel peacekeeping space dimension information of feature.One is the concern of feature channel adaptive Structure, the other is structure is adaptively paid close attention in spatial position.Feature channel adaptive pays close attention to structure by 1 average pond layer, 1 Maximum pond layer, 2 convolutional layer combinations.It adaptively pays close attention to structure and is made of 1 convolutional layer in spatial position.Target template image is covered Film fusion structure is the combination of normalization with mathematical operations.

In the image processing process of convolutional neural networks, need to contact by convolution filter between convolutional layer, convolution The definition of filter is expressed as D × C × W × H, wherein C is represented by the port number of filtering image；W, H respectively represents filter range Width, height；D represents the type of convolution filter.Such as 20 × 3 × 5 × 5: representing the wide height of filter is respectively 5 pixels, input figure As port number be 3, totally 20 kinds.

(3) for the present invention during visual target tracking of adaptive main body sensitivity, each convolutional layer outputs and inputs feature The variation of figure is as follows:

In the present invention in forward direction tracking treatment process, the image that template frame input picture size is 3 × 127 × 127, In first convolutional layer, first passes through 96 convolution kernels 11 × 11 and then the output channel number by normalizing layer is 96, then 96 × 29 × 29 characteristic pattern is obtained by maximum pond；In second convolutional layer, the spy that size is 96 × 29 × 29 is inputted Sign figure, 256 × 25 × 25 characteristic pattern can be generated later by first passing through 256 convolution kernels 5 × 5, using the output of normalization layer Port number is 256, then obtains 256 × 12 × 12 characteristic pattern by maximum pond；In third convolutional layer, size is inputted For 256 × 12 × 12 characteristic pattern, first passes through 384 convolution kernels 3 × 3 and then be by the output channel number of normalization layer 384, then 384 × 10 × 10 characteristic pattern is obtained by maximum pond；In the 4th convolutional layer, input size be 384 × 10 × 10 characteristic pattern, first passes through 384 convolution kernels 3 × 3 and then the output channel number by normalizing layer is 384, then 384 × 8 × 8 characteristic pattern is obtained by maximum pond；In the 5th convolutional layer, the feature that size is 384 × 8 × 8 is inputted Figure, first passes through 256 convolution kernels 3 × 3 and then the output channel number by normalizing layer is 256, then by maximum pond Obtain 256 × 6 × 6 characteristic pattern.The image that present frame input picture size is 3 × 255 × 255, in first convolutional layer, It first passes through 96 convolution kernels 11 × 11 and then is 96 by the output channel number of normalization layer, then by maximum pond To 96 × 61 × 61 characteristic pattern；In second convolutional layer, the characteristic pattern that input size is 96 × 61 × 61 first passes through 256 A convolution kernel 5 × 5 and then the output channel number by normalizing layer are 256, then obtain 256 × 28 by maximum pond × 28 characteristic pattern；In third convolutional layer, the characteristic pattern that input size is 256 × 28 × 28 first passes through 384 convolution kernels 3 × 3 and then be 384 by the output channel number of normalization layer, then 384 × 26 × 26 spy is obtained by maximum pond Sign figure；In the 4th convolutional layer, input size be 384 × 26 × 26 characteristic pattern, first pass through 384 convolution kernels 3 × 3 it Afterwards, it is 384 using the output channel number of normalization layer, then obtains 384 × 24 × 24 characteristic pattern by maximum pond；? In 5th convolutional layer, the characteristic pattern that input size is 384 × 24 × 24 first passes through 256 convolution kernels 3 × 3 and then passes through The output channel number for normalizing layer is 256, then obtains 256 × 22 × 22 characteristic pattern by maximum pond.By the mould of acquisition The output of plate frame waits until that tracking returns characteristic response 1 × 17 × 17 as convolution kernel and present frame output feature convolution.

In the present invention in backpropagation treatment process, characteristic response 1 × 17 × 17 is returned according to tracking, is asked by chain type Inducing defecation by enema and suppository then, is calculated and is responded about the gradient of first convolutional layer of template image, obtains the adaptive mesh that size is 96 × 29 × 29 Exposure mask is marked, carries out dot product operations with the output response of first convolutional layer after normalized gradient response.

The offline part, the specific steps are as follows:

(1) training sample database generates: firstly, training sample is handled according to network structure, the craft provided according to data set Random offset several pixels in target position are cut and compress image by mark tracking box.All generation images are re-encoded, are made Obtaining training sample can be grouped again according to different phase demand；

(2) forward direction track training.Input template image and training image carry out sensitive twin of adaptive main body to the two Network characterization extracts, and generates tracking and returns response, obtains similarity matrix；

(3) backpropagation calculates.It is lost by currently inputting the matched jamming result of picture pair and returning, according to chain type derivation Rule calculates the gradient map of current input picture pair；

(4) gradient loss item calculates.For target and background, pass through the variance and pixel intensity value of gradient respectively, calculates Gradient loses item；

(5) target template image masks generate.Establish the adaptive pass that template frame is exported about network first tier convolutional layer Exposure mask is infused, data normalization is carried out to the target template image adaptive concern exposure mask of generation and layer output carries out dot product behaviour Make, improves foreground features ability to express；

(6) network model training and model obtain.It is lost according to the recurrence loss and gradient that currently obtain, using under gradient The method of drop carries out network parameter update.Firstly, establishing damage respectively to track training network and target template image masks to be preceding Supervision is lost, and the problem of minimizing loss function is solved using common gradient descent method.Wherein, forward direction track training network is adopted Loss function is returned with tracking and determines error, peak optimizating network parameter is adjusted by stochastic gradient descent method, basic learning rate is set It is set to 0.0001,1 round of every training halves；Weight decaying is set as 0.0005, and momentum is set as 0.9；Target template image is covered The film training stage calculates gradient loss item using stochastic gradient descent method and tracking returns loss function and regulating networks parameter, base Plinth learning rate is set as 0.001, and 1 round of every training halves；Weight decaying is set as 0.0005, and momentum is set as 0.9；Finally, By iterating, the deconditioning when reaching preset maximum number of iterations (50 rounds) obtains network model.

The online part, the specific steps are as follows:

(1) input template image and present image carry out judging whether it is first frame input to feature extraction before network, If then carrying out step (2)；If not then carrying out step (3)；

(2) model modification.Firstly, carrying out Xiang Tezheng before network to tracking, input template image and present image before carrying out It extracts, the matched jamming result and tracking for obtaining current input picture pair return loss；Then, backpropagation calculating is carried out, by The matched jamming result of current input picture pair and tracking return loss, according to chain type Rule for derivation, calculate current input picture Pair gradient map；Then it calculates gradient and loses item, data normalization is carried out to the gradient map of extraction；Finally, generating target Template image exposure mask establishes the target template image adaptive that template frame is exported about network first tier convolutional layer and pays close attention to exposure mask, Data normalization is carried out to the adaptive concern exposure mask of generation and layer output carries out dot product operations, obtain about template it is adaptive Answer Enhanced expressing.Iteration update is repeated, until total losses currently entered (is set as initial frame and always damages lower than specific threshold λ 80% lost)；

(3) online tracking.Input template image and present image obtain similarity matrix by obtaining image, wherein defeated Enter template image branch to be strengthened using adaptive concern exposure mask；Current tracking result is calculated according to the similarity matrix of acquisition Confidence level, if representing lower than threshold alpha (being set as 0.7), current tracking result is unreliable, and progress step (2) updates model parameter；

(4) target area is positioned.The matrix that 17 × 17 matrix is become to 255 × 255 using bicubic interpolation, thus really Surely objective result is tracked；

(5) step (3) are repeated and arrives step (4), until image sequence last frame.

The features of the present invention:

The invention proposes a kind of visual target tracking methods based on adaptive main body sensitivity, and this method can be preferably Adapt to the target robust tracking of cosmetic variation.Firstly, the present invention devises a kind of gradient loss item, increase algorithm in picture frame The distinction in foreground/background region.Secondly, the present invention characterizes the important of basic convolution filter using back-propagation gradient Property, and by screening high activating force feature, building can automatically generate target template image masks, obtain adaptive concern target master The depth network model of body portion, improves the ability to express of foreground features.Finally, the channel peacekeeping space dimension using feature is quick Feel construction module, strengthens the capturing ability of foreground features in the structure of depth network.The network is inputted using slender type double fluid As the network architecture, according to the thought of transfer learning, depth network is completed by the way of fine tuning to preparatory trained model Training, solve the problems, such as gradient disappear, gradient explosion.

Detailed description of the invention:

It Fig. 1, is that the present invention is based on the flow diagrams of the visual target tracking method of adaptive main body sensitivity；

It Fig. 2, is the offline partial process view of the present invention；

It Fig. 3, is the online partial process view of the present invention.

Specific embodiment:

Below in conjunction with Figure of description, embodiment of the invention is described in detail:

A kind of visual target tracking method based on adaptive main body sensitivity, overall flow figure are as shown in Fig. 1；Algorithm point For offline part and online part；Its flow chart is respectively as shown in attached drawing 2 and attached drawing 3；Offline part, first according to training sample Collection generates corresponding picture pair, respectively as input template image and present image；Then, the two input is put up adaptive The sensitive twin network of main body carries out feature extraction, and generation tracking, which returns, responds and calculate tracking recurrence loss function；Then, pass through Chain type Rule for derivation calculates back-propagation gradient, generates and exports the target template image masks being superimposed with network model；Finally, meter It calculates tracking recurrence loss and loses item, backpropagation Optimization Solution optimal models with back-propagation gradient, greatest iteration round is 50 It is secondary.Online part, in initial frame treatment process, it is quick to carry out adaptive main body to the two for input template image and present image The twin network characterization of sense extracts, and generates tracking and returns response；Then, it by calculating back-propagation gradient, generates and network mould The target template image masks of type output superposition；Finally, iterative calculation tracking, which returns response, loses item with back-propagation gradient, it is excellent Change and solves optimal models.In subsequent frame treatment process, input template image and present image pass through the adaptive main body of fine tuning Sensitive twin network carries out feature extraction, generates tracking and returns response；Then, the confidence level that tracking returns response is calculated, if Confidence level height then exports the tracking result of present image, and initial frame operation is repeated if confidence level is low and carries out network model online more Newly.

The offline part, the specific steps are as follows:

(1) training sample database generates: firstly, training sample is handled according to network structure, the craft provided according to data set Tracking box is marked, centered on the x of region of search, by several pixels of target position random offset, cuts image, and compression sizes It is 3 × 255 × 255.The scaling and charging formula of image:

S (w+2p) × s (h+2p)=A (1)

Wherein, s is scaling, and w is width, and h is height, and p is offset, and A is final picture size.If beyond figure As then being filled with pixel average, keep target the ratio of width to height constant.From

In 4500 videos of ILSVRC15 (Large Scale Visual Recognition Challenge, 2015) 4417 videos are selected, the tracking box marked more than 2,000,000 is as training set.All generation images are re-encoded, so that Training sample can be grouped again according to different phase demand, and 50 rounds of training, each round has 50,000 sample pair.

(2) forward direction track training.Input template image and training image carry out sensitive twin of adaptive main body to the two Network characterization extracts, and generates tracking and returns response, obtains similarity matrix:

Z is input template image, x training image, b₁For offset, method of discrimination come to positive and negative samples to being trained, It tracks recurrence loss function and is defined as follows:

L_track=l (y, v)=log (1+e^-yx) (3)

Y ∈ (+1, -1) indicates true value, and v indicates the practical similarity score of input template image and training image,

D indicates that the similarity score finally obtained, u indicate all positions in similarity score.

(3) backpropagation calculates.Loss is returned by the matched jamming result and tracking that currently input picture pair, according to chain type Rule for derivation calculates the gradient map of current input picture pair:

Wherein, X_oIt is output prediction, X_inIt is input feature vector, L_trackIt is that tracking returns loss.The gradient lost using recurrence, We have found can be to the filter of target and context-sensitive.

(4) back-propagation gradient loss item calculates.Data normalization is carried out to the gradient map of extraction, using following mark Quasi-ization method:

A_p、A_nRespectively positive sample and negative sample gradient map, σ, μ are respectively variance and expectation,

L_back=yR_(y=1)+(1-y)·R_(y=0) (8)

L_total=L_track+β·L_back (9)

Wherein, L_backIndicate that back-propagation gradient loses item, R_(y=1)And R_(y=0)Respectively indicate the regularization of positive and negative samples ?.L_totalIndicate that total losses function, β are loss function fusion parameters, β=2.

(5) target template image masks generate.Establish the target template that template frame is exported about network first tier convolutional layer Image adaptive pays close attention to exposure mask, carries out data normalization to the adaptive concern exposure mask of generation and layer output carries out dot product behaviour Make, obtain about the adaptive Enhanced expressing of template.Data normalization formula:

X '=(x-X_min)/(X_max-X_min) (10)

Wherein, X_minIt is minimum response value, X_maxIt is maximum response.

The online part, the specific steps are as follows:

(2) model modification.Firstly, carrying out Xiang Tezheng before network to tracking, input template image and present image before carrying out It extracts, obtains the matched jamming result of current input picture pair and return and lose；Then, backpropagation calculating is carried out, by current It inputs the matched jamming result of picture pair and returns and lose, according to chain type Rule for derivation, calculate the gradient of current input picture pair Mapping；Then it calculates gradient and loses item, data normalization, for example offline part of calculation formula are carried out to the gradient map of extraction；Most Afterwards, target template image masks are generated, it is adaptive to establish the target template image that template frame is exported about network first tier convolutional layer Exposure mask should be paid close attention to, data normalization is carried out to the target template image adaptive concern exposure mask of generation and layer output carries out dot product Operation, obtain about the adaptive Enhanced expressing of template.Iteration update is repeated, until total losses currently entered is lower than spy Determine threshold value λ (be set as initial frame total losses 80%)；

(4) target area is positioned.Utilize bicubic interpolation:

For each pixel x to be asked, pixel value can be acquired by its adjacent each two pixels weighting in left and right:

Wherein,Have cubic interpolation basic function as follows different s values:

17 × 17 matrix is become to 255 × 255 matrix, to position target area；

It is an object of the invention to overcome defect of the existing technology, before proposing a kind of target based on deep learning Scape/Adaptive background subtraction disjunctive model, and the expression of the target appearance depth under mixed and disorderly background is constructed, it realizes a kind of based on target subject The real-time vision method for tracking target of analysis.Since the present invention designs a kind of gradient loss item, increase foreground/background distinction. Target template image masks are automatically generated simultaneously, improve foreground features ability to express.And present invention uses twin networks to mention The feature taken, while being added to adaptive structure module therefore characteristically containing more semantic informations, enrich feature Ability to express, and then can have higher tracking accuracy.Compared with using the visual target tracking method of high-rise depth characteristic, In construction module insertion convolution feature extraction network used herein, under the premise of not dramatically increasing calculation amount and parameter amount The ability in feature extraction of network model can be promoted, can both save computed losses, and do not lack semantic information.Therefore, this hair It is bright to have done a tradeoff in tracking accuracy and tracking velocity well, obtain excellent tracking performance.

In conclusion the present invention has taken into account tracking accuracy and tracking velocity, practicability and with strong applicability has very high Using and promotional value.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit and essential characteristics of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention, and any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it should be understood that although this specification is described in terms of embodiments, but not each embodiment only includes an independence Technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should make specification For an entirety, the technical solutions in the various embodiments may also be suitably combined, formed it will be appreciated by those skilled in the art that Other embodiments.

Claims

1. a kind of visual target tracking method based on adaptive main body sensitivity, it is characterised in that: including overall flow, offline portion Divide and online part；

Overall flow: the process of target following is devised first；Then according to this process design network structure；Finally by the net Each phase characteristic figure of network is adjusted to adaptive size, completes twin network end-to-end tracking process；

Offline part: including 6 steps: training sample database generates；Forward direction track training；Back-propagation gradient calculates；Gradient damage Item is lost to calculate；Target template image masks generate；Network model training and model obtain；Wherein, network model training and model The acquisition stage includes the selection for tracking loss function, gradient loss item, gradient descent method；

Online part: including 3 steps: model modification；Online tracking；Position target area；Wherein, model modification include it is preceding to Tracking, back-propagation gradient calculate, gradient loss item calculates, target template image masks generate；Online tracking include it is preceding to Track obtains similarity matrix, calculates current tracking result confidence level, regressive object region；

Specific step is as follows for the overall flow:

(1) the online process of the visual target tracking of adaptive main body sensitivity includes online updating and online tracking processing；Work as net For network model after the adjustment of offline pre-training, visual target tracking process is specific as follows:

In initial frame treatment process, input template image and present image carry out sensitive twin of adaptive main body to the two Network characterization extracts, and generates tracking and returns response, obtains similarity matrix；Then, by calculating back-propagation gradient figure, reversed Disease gradient loses item, Optimization Solution optimal models；In subsequent frame treatment process, input template image and present image lead to It crosses the sensitive twin network of adaptive main body and carries out feature extraction, generate tracking and return response；Then, it calculates tracking and returns response Confidence level exports the tracking result of present image if confidence level is more than or equal to 0.7, if confidence level repeats initially lower than 0.7 Frame operation carries out network model online updating；

(2) network structure includes twin network trace structure, and two adaptively concern construction modules, target template image masks melt Close structure；Wherein, it in the propagated forward the step of, is made of twin network trace structure and two adaptive concern construction modules Forward direction tracks submodule；In the backpropagation the step of, using target template image masks fusion structure as backpropagation Module；Twin network trace structure includes before two groups to structure, and every group has 5 convolutional layers, 5 normalization layers and 5 most respectively Great Chiization layer, before two groups to structural union training and weight it is shared；In the first four of twin network trace stay in place form frame branch Adaptive concern structure is added after down-sampled process layer；

The adaptive concern structure is two structure combinations, is utilized respectively the channel peacekeeping space dimension information of feature；One is special It levies channel adaptive and pays close attention to structure, the other is structure is adaptively paid close attention in spatial position；Feature channel adaptive pays close attention to structure by 1 A average pond layer, 1 maximum pond layer, 2 convolutional layer combinations；Structure is adaptively paid close attention to by 1 convolutional layer structure in spatial position At；Target template image masks fusion structure is the combination of normalization with mathematical operations；

In the image processing process of convolutional neural networks, need to contact by convolution filter between convolutional layer, convolutional filtering The definition of device is expressed as D × C × W × H, wherein C is represented by the port number of filtering image；W, H respectively represents filter range It is wide, high；D represents the type of convolution filter；

(3) during the visual target tracking of adaptive main body sensitivity, each convolutional layer outputs and inputs the variation of characteristic pattern such as Under:

In forward direction tracking treatment process, the image that template frame input picture size is 3 × 127 × 127, in first convolutional layer In, it first passes through 96 convolution kernels 11 × 11 and then the output channel number by normalizing layer is 96, then by maximum pond Obtain 96 × 29 × 29 characteristic pattern；In second convolutional layer, the characteristic pattern that input size is 96 × 29 × 29 is first passed through The characteristic pattern that 256 × 25 × 25 can be generated after 256 convolution kernels 5 × 5, the output channel number using normalization layer are 256, Then 256 × 12 × 12 characteristic pattern is obtained by maximum pond；In third convolutional layer, input size be 256 × 12 × 12 characteristic pattern, first passes through 384 convolution kernels 3 × 3 and then the output channel number by normalizing layer is 384, then passes through Maximum pond obtains 384 × 10 × 10 characteristic pattern；In the 4th convolutional layer, the feature that size is 384 × 10 × 10 is inputted Figure, first passes through 384 convolution kernels 3 × 3 and then the output channel number by normalizing layer is 384, then by maximum pond Obtain 384 × 8 × 8 characteristic pattern；In the 5th convolutional layer, the characteristic pattern that input size is 384 × 8 × 8 first passes through 256 A convolution kernel 3 × 3 and then be 256 by the output channel number of normalization layer, then 256 × 6 are obtained by maximum pond × 6 characteristic pattern；The image that present frame input picture size is 3 × 255 × 255 first passes through 96 volumes in first convolutional layer Product core 11 × 11 and then the output channel number by normalizing layer are 96, then obtain 96 × 61 × 61 by maximum pond Characteristic pattern；In second convolutional layer, the characteristic pattern that input size is 96 × 61 × 61 first passes through 256 convolution kernels 5 × 5 And then the output channel number by normalizing layer is 256, then obtains 256 × 28 × 28 characteristic pattern by maximum pond； In third convolutional layer, the characteristic pattern that input size is 256 × 28 × 28 first passes through 384 convolution kernels 3 × 3 and then warp The output channel number for crossing normalization layer is 384, then obtains 384 × 26 × 26 characteristic pattern by maximum pond；At the 4th In convolutional layer, the characteristic pattern that input size is 384 × 26 × 26 first passes through 384 convolution kernels 3 × 3 and then by normalization The output channel number of layer is 384, then obtains 384 × 24 × 24 characteristic pattern by maximum pond；In the 5th convolutional layer, The characteristic pattern that size is 384 × 24 × 24 is inputted, 256 convolution kernels 3 × 3 and then the output by normalization layer are first passed through Port number is 256, then obtains 256 × 22 × 22 characteristic pattern by maximum pond；By the template frame output of acquisition as volume Product core and present frame output feature convolution wait until that tracking returns characteristic response 1 × 17 × 17；

In backpropagation treatment process, characteristic response 1 × 17 × 17 is returned according to tracking, by chain type Rule for derivation, is calculated Gradient about first convolutional layer of template image responds, and obtains the adaptive targets exposure mask that size is 96 × 29 × 29, normalizing Change and carries out dot product operations with the output response of first convolutional layer after gradient responds.

2. a kind of visual target tracking method based on adaptive main body sensitivity according to claim 1, which is characterized in that Specific step is as follows for the offline part:

(1) training sample database generates: firstly, training sample is handled according to network structure, the manual mark provided according to data set Random offset several pixels in target position are cut and compress image by tracking box；Re-encode all generation images；

(2) forward direction track training；Input template image and training image carry out the sensitive twin network of adaptive main body to the two Feature extraction generates tracking and returns response, obtains similarity matrix；

(3) backpropagation calculates；It is lost by currently inputting the matched jamming result of picture pair and returning, according to chain type method of derivation Then, the gradient map of current input picture pair is calculated；

(4) gradient loss item calculates；For target and background, pass through the variance and pixel intensity value of gradient respectively, calculates gradient Lose item；

(5) target template image masks generate；The adaptive concern that template frame is exported about network first tier convolutional layer is established to cover Film carries out data normalization to the target template image adaptive concern exposure mask of generation and layer output carries out dot product operations；

(6) network model training and model obtain；It is lost according to the recurrence loss and gradient that currently obtain, is declined using gradient Method carries out network parameter update；Firstly, establishing loss prison respectively to track training network and target template image masks to be preceding It superintends and directs, and the problem of minimizing loss function is solved using common gradient descent method；Wherein, forward direction track training network use with Track returns loss function and determines error, adjusts peak optimizating network parameter by stochastic gradient descent method, basic learning rate is set as 0.0001,1 round of every training halves；Weight decaying is set as 0.0005, and momentum is set as 0.9；Target template image masks instruction The white silk stage is calculated gradient loss item and tracked using stochastic gradient descent method returns loss function and regulating networks parameter, and basis is learned Habit rate is set as 0.001, and 1 round of every training halves；Weight decaying is set as 0.0005, and momentum is set as 0.9；Finally, by It iterates, the deconditioning when reaching preset maximum number of iterations obtains network model.

3. a kind of visual target tracking method based on adaptive main body sensitivity according to claim 1, which is characterized in that Specific step is as follows for the online part:

(1) input template image and present image carry out judging whether it is first frame input to feature extraction before network, if Then carry out step (2)；If not then carrying out step (3)；

(2) model modification；Firstly, being mentioned before carrying out network to feature before carrying out to tracking, input template image and present image It takes, the matched jamming result and tracking for obtaining current input picture pair return loss；Then, backpropagation calculating is carried out, by working as The matched jamming result of preceding input picture pair and tracking return loss, according to chain type Rule for derivation, calculate current input picture pair Gradient map；Then it calculates gradient and loses item, data normalization is carried out to the gradient map of extraction；Finally, generating target mould Plate image masks establish template frame about the target template image adaptive that network first tier convolutional layer exports and pay close attention to exposure mask, right Generate adaptive concern exposure mask carry out data normalization and the layer output carry out dot product operations, obtain about template it is adaptive Enhanced expressing；Iteration update is repeated, until total losses currently entered is lower than specific threshold λ, λ is set as initial frame total losses 80%；

(3) online tracking；Input template image and present image obtain similarity matrix by obtaining image, wherein inputting mould Plate image branch is strengthened using adaptive concern exposure mask；Current tracking result confidence is calculated according to the similarity matrix of acquisition Degree, if being lower than threshold alpha, it is unreliable to represent current tracking result, and α is set as 0.7, carries out step (2) and updates model parameter；

(4) target area is positioned；17 × 17 matrix is become to 255 × 255 matrix using bicubic interpolation, so that it is determined that with Track objective result；