CN110490906A

CN110490906A - A kind of real-time vision method for tracking target based on twin convolutional network and shot and long term memory network

Info

Publication number: CN110490906A
Application number: CN201910771090.7A
Authority: CN
Inventors: 王彩玲; 臧振飞; 蒋国平
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2019-11-22

Abstract

A kind of real-time vision method for tracking target based on twin convolutional network and shot and long term memory network, firstly for video sequence to be tracked, the input obtained every time using the continuous two field pictures in front and back as network；Then feature extraction is carried out by two continuous frames image of the twin convolutional network to input, the appearance and semantic feature of different levels, then the depth characteristic high and low layered by full articulamentum cascading is obtained after convolution operation；Depth characteristic is transmitted to the shot and long term memory network comprising two LSTM units again and carries out Series Modeling, door is forgotten by LSTM, activation screening is carried out to the target signature of different location in sequence, and export the status information of current goal by out gate；The full articulamentum of LSTM output is finally received to export target in the predicted position coordinate of present frame, and updates the region of search of next frame target.Tracking velocity is greatly improved while guaranteeing certain tracking stability and accuracy, real-time performance of tracking is enabled to be greatly improved.

Description

A kind of real-time vision target based on twin convolutional network and shot and long term memory network with Track method

Technical field

The invention belongs to computer visions and visual target tracking technical field, and in particular to one kind is based on twin convolution net The real-time vision method for tracking target of network and shot and long term memory network.

Background technique

Visual target tracking technology is one of most important Railway Project in computer vision field, and is possessed extensive Application scenarios, such as safety monitoring, smart home, smart city etc..Its main task be in one group of video sequence, give to Same target is found in the position of present frame from the second frame in position and state of the track target in first frame backward.Although target with Track technology has been widely studied, but be limited to during tracking the target deformation that may occur, background is blocked, motion blur, The disturbing factors such as illumination variation, stability, accuracy and the rapidity of visual target tracking still it is difficult to ensure that.

Current vision method for tracking target can be divided into two major classes from basic framework: be based on manual extraction feature respectively Conventional method and deep learning method based on depth characteristic；Can inwardly two major classes be divided into from building: with stencil matching For the production method of representative and to be detected as the duscriminant method represented frame by frame.Conventional method based on manual extraction feature is most Big advantage is that algorithm is small and exquisite, and structure is simple, and operational efficiency is high and debugging is convenient, is easy to be transformed；However its disadvantage is also very Obviously, i.e., lower with the most precision of the method for manual extraction traditional characteristic, feature extraction exists uncertain, it is difficult to maximize mesh Appearance is marked for the validity of tracking process.Therefore, it is ruled by deep learning in ImageNet image recognition and detection contest Afterwards, target tracking domain also starts to introduce the method using deep neural network as frame.The VOT contest held in 2015 In, Hyeonseob Nam et al. proposes that the MDNet that duscriminant tracking is carried out using convolution feature reaches surprising in cycle tests 94.8% precision win the title at one stroke and substantially leading other methods, thus open deep learning in target tracking domain Research is widely applied.

Although extracting the method that target appearance feature is used to track process using convolutional neural networks can reach relatively high Tracking accuracy, but due in one group of video sequence, when appearance of tracked target, may be difficult to expect change Change, therefore on-line fine must be all added in existing most of methods based on depth network during tracking, that is, is tracking When one group of sequence, different positive negative samples hundreds of to the Objective extraction of each frame, and after treatment continue frame when to extract All samples based on, constantly update in addition to convolutional network responsible classification output full articulamentum weight parameter.It is many It is well known, the parameter scale of deep neural network be it is very huge, any minor modifications for parameter can all make entire net Network finds the optimal value under current state again, and since the calculation amount of this process is excessively huge compared to conventional method, institute The time of cost is very very long, therefore the problem that the generally existing tracking velocity of tracking based on depth characteristic is slow, It is difficult to reach real-time.

Summary of the invention

It is a kind of based on twin convolutional network the technical problem to be solved by the present invention is to overcome the deficiencies of the prior art and provide With the real-time vision method for tracking target of shot and long term memory network, net is remembered by using the shot and long term comprising two LSTM units Network part replaces the full articulamentum of use multilayer in conventional tracking to realize and is directed to external appearance characteristic on-line fine online- The strategy of finetune so that sequence relation is effectively used between the distinctive frame of video data and frame, and carries out sequence and builds Mould；While guaranteeing tracking stability and accuracy, tracking velocity is greatly improved, the real-time of tracking is improved.

The present invention provides a kind of real-time vision method for tracking target based on twin convolutional network and shot and long term memory network, Include the following steps:

Step S1, it for video sequence to be tracked, is obtained every time using the continuous two field pictures in front and back as network Input；

Step S2, feature extraction is carried out by two continuous frames image of the twin convolutional network to input, by convolution operation The appearance and semantic feature of different levels, then the depth characteristic high and low layered by full articulamentum cascading are obtained afterwards；

Step S3, depth characteristic is transmitted to the shot and long term memory network comprising two LSTM units and carries out Series Modeling, Door is forgotten by LSTM, activation screening is carried out to the target signature of different location in sequence, and current goal is exported by out gate Status information；

Step S4, the full articulamentum of LSTM output is received to export target in the predicted position coordinate of present frame, and more The region of search of new next frame target.

As further technical solution of the present invention, in the step S2, twin convolutional network by the network number of plies, structure, Group in parallel above and below two convolutional networks of convolution kernel size, pond mode and weight identical and shared with Padding step-length At；The network number of plies includes the first convolutional layer, the first pond layer, the second convolutional layer, the second pond layer, third convolutional layer, Volume Four Lamination, the 5th convolutional layer and third pond layer；The convolution kernel size and port number of first convolutional layer are 11*11*96, the second convolution The convolution kernel size and port number of layer are 5*5*256, the convolution kernel size of third convolutional layer, Volume Four lamination and the 5th convolutional layer It is 3*3*384 with port number, the first pond layer, the filter size of the second pond layer and third pond layer and port number are 3* 3；First convolutional layer, the first pond layer, the second pond layer and third pond layer padding mode be valid mode, volume Two Lamination, five convolutional layer of third convolutional layer, Volume Four lamination and ground padding mode be same mode；Input picture is twin Convolutional network is modified to having a size of 227*227*3.

Further, shot and long term memory network part includes two LSTM units, is connected entirely wherein the first LSTM receives to come from The convolution feature input of layer is connect, the 2nd LSTM is defeated with the cascade nature of the output of the first LSTM and twin convolutional network part Enter, and continuous and independent tracking video sequence is combined to carry out sequence data modeling, exists to the same target in same sequence The output of corresponding each sequence state is calculated separately under different conditions.

Further, the data set of video sequence includes ILSVRC2016 video object detection data collection, Amsterdam Normal video data library and non-natural video sequence, ILSVRC2016 video object detection data collection include 3862 video sequences Column, 1122397 width images, the bounding box and 7911 target trajectories of 1731913 spottings；Amsterdam Normal video data library includes 314 video sequences, and 148319 width images, each video sequence includes a specific objective；It is non- Natural video frequency sequence is selected 478807 width static images to enhance using data by artificial synthesis from ImageNet data set Strategy synthesis constructs.

Further, the feed-forward mode of two LSTM units is

Wherein, t is the index of frame, x^tAnd y^t-1Respectively the feature of current time input frame and previous moment output frame to Amount, W, R, P are respectively the weight coefficient matrix of input gate, out gate and peephole transmitting, and b is bias vector, h be hyperbolic just Function is cut, σ is sigmoid function, and ⊙ is dot product；Z is the whole input of LSTM unit, and i is to transmit between the cell of LSTM Input gate, o are the out gate of each cell of LSTM, and f is to forget door, and c is the cell state of different moments in sequence, y LSTM Overall output；Forward direction transmitting generates the output vector y for storing present frame dbjective state^tWhen the current frame with processing The cell state c of LSTM^t, and y^tAnd c^tAll input will be taken as to be transmitted to cell when handling subsequent frame, to reach in sequence Propagated forward in data.

Further, when video sequence to be tracked input, the target position of sequence head frame is with top-left coordinates and bottom right The pairs of form of coordinate is given.

It is mostly based on invention replaces other in method of deep learning and receives convolution net using the full articulamentum of multilayer The external appearance characteristic that network layers are extracted, then be used for realizing the on-line fine for improving robustness for target appearance eigentransformation The feature of online-finetune strategy replaces multilayer complete using the shot and long term memory network part comprising two LSTM units Articulamentum carries out Series Modeling to handle the distinctive relationship between sequences of video data, for proposed in technical background based on depth The tracking universal velocity for spending study is slower, the not high defect of real-time, is guaranteeing certain tracking stability and accuracy Tracking velocity is greatly improved simultaneously, real-time performance of tracking is enabled to be greatly improved.

Detailed description of the invention

Fig. 1 is method flow schematic diagram of the invention；

Specific embodiment

Referring to Fig. 1, the present embodiment provides a kind of real-time vision based on twin convolutional network and shot and long term memory network Method for tracking target includes the following steps:

When carrying out the real-time vision target following based on twin convolutional network and shot and long term memory network, builds be used for first Execute tracing task deep neural network, network mainly includes two parts: for input picture carry out dimensional variation and The twin convolutional neural networks part of convolution feature extraction, and the external appearance characteristic and semantic feature that receive target different levels are simultaneously The shot and long term memory network part of Series Modeling is carried out to the feature from same video image successive frame.

Twin convolutional network part is complete by the network number of plies, structure, convolution kernel size, pond mode and Padding step-length Two convolutional networks of identical and shared weight compose in parallel up and down, and specific network structure and parameter are as shown in table 1:

Table 1, network structure and parameter list

Wherein, since Layer indicate to the last one layer of pond receiving the first layer convolutional layer Conv1 that original image inputs Change the all-network layer between layer；Size indicates the convolution kernel or filter size and port number of current convolutional layer or pond layer； The filter step size of Stride expression current network layer；Padding indicates the Padding mode that current network layer uses: Same Mode or Valid mode.

When the two continuous frames image from same video sequence is input into network, network is first by the defeated of two images The characteristic dimension for entering modification of dimension to 227*227*3, after each layer network convolutional calculation output in twin convolutional network part And size, as shown in table 2:

Table 2, each network layer characteristic dimension and size

Wherein, Layer indicates network layer different in twin convolutional network part, and Size indicates input picture by corresponding Characteristic dimension and port number after network layer handles.

Shot and long term memory network part is made of two LSTM units, and first LSTM receives the convolution from full articulamentum Feature input, second LSTM is input with the output of first LSTM and the cascade nature of twin convolutional network part, and is tied It closes continuous and independent tracking video sequence and carries out sequence data modeling.

After putting up the deep neural network for executing tracing task, start to train network weight parameter.It is arrived using end The training method at end, video sequence data collection used by training include:

1) ILSVRC2016 video object detection data collection, wherein comprising 3862 video sequences, 1122397 width images, The bounding box and 7911 target trajectories of 1731913 spottings.

2) Amsterdam Normal video data library (ALOV 300+), wherein including 314 video sequences, 148319 width Image, each video sequence include a specific objective.

3) in order to increase the targeted species in training data, artificial synthesis and data enhancing strategy are adopted from ImageNet Different static images is selected in data set to synthesize the non-natural video sequence that construction includes 478807 width images.

Compared with single layer LSTM, two layers of LSTM can handle more features details, capture more complicated target signature variation, And longer sequence data is handled, it is transmitted using the feed-forward mode of such as following formula:

Wherein, t indicates the index of frame, x^tAnd y^t-1It is the feature of current time input frame and previous moment output frame respectively Vector, W, R, P respectively indicate the weight coefficient matrix of input gate, out gate and peephole transmitting, and b indicates that bias vector, h are Hyperbolic tangent function, σ are sigmoid functions, and ⊙ indicates dot product；Z indicates the whole input of LSTM unit, and i is indicated LSTM's The input gate transmitted between cell, o indicate the out gate of each cell of LSTM, and f, which is represented, forgets door, and c indicates different moments in sequence Cell state, y indicate LSTM overall output.The transmitting of forward direction generate the output for storing present frame dbjective state to Measure y^tWith the cell state c for handling LSTM when the current frame^t, and y^tAnd c^tAll input will be taken as to be transmitted to when handling subsequent frame Cell, to reach the propagated forward on sequence data.Since shot and long term memory network is compared to other depth nets for tracking Network during tracking without carrying out on-line fine, therefore real-time and tracking velocity is available significantly improves.

It is tested below.From the method proposed in recent years, chooses be based on traditional characteristic and depth characteristic, Yi Jisheng respectively Several trackings of an accepted way of doing sth and duscriminant, respectively to biography on two target following sequential test collection of VOT2014 and VOT2016 The duscriminant correlation filter tracker DSST and the present invention of system carry out tracking accuracy and rapidity Experimental comparison, as a result such as table Shown in 3:

Table 3, test result

Wherein, Methods is the tracking for participating in comparison；Accuracy indicates future position and the true position of target The overlapping region set is handed over and is compared, and can embody the accuracy of tracking, Accuracy numerical value is bigger, and accuracy is higher；Speed Indicating the average tracking rate during tracking tested entire test set within 1 second unit time, Speed numerical value is bigger, Rapidity is higher.It should be noted that the field with "/" mark shows that the method is not delivered also in test set publication, therefore Have neither part nor lot in the experiment of corresponding test set.

As can be seen from the table, the present invention is better than largely testing comparison other in accuracy, that is, precision, and with Rapidity, that is, tracking velocity of track is substantially better than other and participates in experimental method, therefore effectiveness of the invention is proved.

The basic principles, main features and advantages of the invention have been shown and described above.Those skilled in the art should Understand, the present invention do not limited by above-mentioned specific embodiment, the description in above-mentioned specific embodiment and specification be intended merely into One step illustrates the principle of the present invention, and under the premise of not departing from spirit of that invention range, the present invention also has various change and changes Into these changes and improvements all fall within the protetion scope of the claimed invention.The scope of protection of present invention is by claim Book and its equivalent thereof.

Claims

1. a kind of real-time vision method for tracking target based on twin convolutional network and shot and long term memory network, which is characterized in that Include the following steps:

Step S1, for video sequence to be tracked, the input obtained every time using the continuous two field pictures in front and back as network；

Step S2, feature extraction is carried out by two continuous frames image of the twin convolutional network to input, is obtained after convolution operation Take the appearance and semantic feature of different levels, then the depth characteristic high and low layered by full articulamentum cascading；

Step S3, depth characteristic is transmitted to the shot and long term memory network comprising two LSTM units and carries out Series Modeling, by LSTM forgets door and carries out activation screening to the target signature of different location in sequence, and the shape of current goal is exported by out gate State information；

Step S4, the full articulamentum of LSTM output is received to export target in the predicted position coordinate of present frame, and under update The region of search of one frame target.

2. a kind of real-time vision target based on twin convolutional network and shot and long term memory network according to claim 1 with Track method, which is characterized in that in the step S2, twin convolutional network is by the network number of plies, structure, convolution kernel size, Chi Huafang Two convolutional networks of formula and weight identical and shared with Padding step-length compose in parallel up and down；The network number of plies includes the One convolutional layer, the first pond layer, the second convolutional layer, the second pond layer, third convolutional layer, Volume Four lamination, the 5th convolutional layer and Third pond layer；The convolution kernel size and port number of first convolutional layer be 11*11*96, the convolution kernel size of the second convolutional layer and Port number is 5*5*256, and third convolutional layer, the convolution kernel size of Volume Four lamination and the 5th convolutional layer and port number are 3*3* 384, the first pond layer, the filter size of the second pond layer and third pond layer and port number are 3*3；First convolutional layer, The padding mode of one pond layer, the second pond layer and third pond layer is valid mode, the second convolutional layer, third convolution Layer, five convolutional layer of Volume Four lamination and ground padding mode be same mode；Input picture by twin convolutional network modify to Having a size of 227*227*3.

3. a kind of real-time vision target based on twin convolutional network and shot and long term memory network according to claim 1 with Track method, which is characterized in that shot and long term memory network part includes two LSTM units, is connected entirely wherein the first LSTM receives to come from The convolution feature input of layer is connect, the 2nd LSTM is defeated with the cascade nature of the output of the first LSTM and twin convolutional network part Enter, and continuous and independent tracking video sequence is combined to carry out sequence data modeling, exists to the same target in same sequence The output of corresponding each sequence state is calculated separately under different conditions.

4. a kind of real-time vision target based on twin convolutional network and shot and long term memory network according to claim 1 with Track method, which is characterized in that the data set of video sequence includes ILSVRC2016 video object detection data collection, Amsterdam Normal video data library and non-natural video sequence, ILSVRC2016 video object detection data collection include 3862 video sequences Column, 1122397 width images, the bounding box and 7911 target trajectories of 1731913 spottings；Amsterdam Normal video data library includes 314 video sequences, and 148319 width images, each video sequence includes a specific objective；It is non- Natural video frequency sequence is selected 478807 width static images to synthesize from ImageNet data set and is constructed by artificial synthesis.

5. a kind of real-time vision target based on twin convolutional network and shot and long term memory network according to claim 1 with Track method, which is characterized in that the feed-forward mode of two LSTM units is

Wherein, t is the index of frame, x^tAnd y^t-1The respectively feature vector of current time input frame and previous moment output frame, W, R, P is respectively the weight coefficient matrix of input gate, out gate and peephole transmitting, and b is bias vector, and h is tanh letter Number, σ are sigmoid function, and ⊙ is dot product；Z is the whole input of LSTM unit, and i is the input transmitted between the cell of LSTM Door, o are the out gate of each cell of LSTM, and f is to forget door, and c is the cell state of different moments in sequence, and y is the whole of LSTM Body output；Forward direction transmitting generates the output vector y for storing present frame dbjective state^tWith processing LSTM when the current frame Cell state c^t, and y^tAnd c^tAll input will be taken as to be transmitted to cell when handling subsequent frame, to reach on sequence data Propagated forward.

6. a kind of real-time vision target based on twin convolutional network and shot and long term memory network according to claim 1 with Track method, which is characterized in that when video sequence to be tracked inputs, the target position of sequence head frame is with top-left coordinates and bottom right The pairs of form of coordinate is given.