CN108537825B

CN108537825B - Target tracking method based on transfer learning regression network

Info

Publication number: CN108537825B
Application number: CN201810250785.6A
Authority: CN
Inventors: 权伟; 李天瑞; 江永全; 何武; 刘跃平; 卢学民; 王晔; 贾成君; 陈锦雄
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2018-03-26
Filing date: 2018-03-26
Publication date: 2021-08-17
Anticipated expiration: 2038-03-26
Also published as: CN108537825A

Abstract

The invention provides a target tracking method based on a transfer learning regression network, and relates to the technical field of computer vision. Selecting and determining a target object to be tracked from the initial image; constructing a target position regression network based on block prediction; generating a tracking-oriented training data set and training a network; image input, in the real-time processing condition, extracting the video image collected by the camera and stored in the storage area as the input image to be tracked; and (4) positioning the target, inputting the obtained image into a position returning network, and after forward processing of the network, obtaining 8 × 8 × 8 relative position data by a network output layer. The network update calculates 8 × 8 × 8 relative positions between the 8 × 8 image blocks divided from the entire image and the target according to the obtained target position, and forms a set of training data together with the current input image.

Description

Target tracking method based on transfer learning regression network

Technical Field

The present invention relates to the technical field of computer vision, computer graphic images, machine intelligence and systems.

Background

Visual target tracking is an important research subject in the field of computer vision, and the main task of the visual target tracking is to acquire information such as continuous positions, appearances and motions of targets and further provide a basis for further semantic layer analysis (such as behavior recognition and scene understanding). The target tracking research is widely applied to the fields of intelligent monitoring, man-machine interaction, automatic control systems and the like, and has strong practical value. At present, target tracking methods mainly include a classical target tracking method and a deep learning target tracking method.

The classical target tracking Methods are mainly classified into a Generative method (Generative Methods) and a Discriminative method (Discriminative Methods). Generative methods assume that the target can be expressed through some kind of generation process or model, such as Principal Component Analysis (PCA), Sparse Coding (Sparse Coding), etc., and then consider the tracking problem as finding the most likely candidate in the region of interest. These methods aim at designing an image representation method that facilitates robust target tracking. Unlike the generative method, the discriminant method treats tracking as a classification or a continuous object detection problem, whose task is to distinguish objects from the image background. This type of method, which utilizes both target and background information, is currently the main method of research. Discriminant methods typically involve two main steps, the first being training to derive a classifier and its decision rules by selecting visual features that discriminate between target and background, and the second being using the classifier for evaluation of each location within the field of view and to determine the most likely target location during tracking. The target frame is then moved to that location and the process is repeated to effect tracking, and the framework is used to design various forms of tracking algorithms. In general, the main advantages of classical tracking methods are the speed of operation and the low dependence on auxiliary data, while they also require a trade-off between accuracy and real-time performance of the tracking.

Deep Learning (Deep Learning), which is a hot spot of machine Learning research in recent years, has been surprisingly successful in many aspects, such as speech recognition, image recognition, object detection, video classification, etc., due to its powerful feature expression capability and evolving data sets and hardware support. The deep learning target tracking research is also developed rapidly, but due to the lack of prior knowledge in target tracking and the requirement of real-time performance, the deep learning technology based on a large amount of training data and parameter calculation is difficult to be fully developed in this respect, and has a large exploration space. From the current research results, the deep learning tracking method mainly applies an auto-encoder network and a convolutional neural network, and the research mainly has two ideas, one is to perform transfer learning on the network and then perform online fine tuning, and the other is to modify the structure of the deep network to adapt to the tracking requirement. An auto-encoder network (AE) is a typical unsupervised deep learning network, as its feature learning capability and anti-noise performance are first applied to target tracking. In a comprehensive view, the self-encoder network is intuitive and moderate in size, is an excellent unsupervised deep learning model, and is applied to tracking firstly and obtains a better effect. In contrast to self-encoder networks, Convolutional Neural Networks (CNNs) are supervised feedforward neural networks, which involve a number of cyclically alternating convolution, nonlinear transformation and downsampling operations, and exhibit very powerful performance in pattern recognition, especially in computer vision tasks. In general, deep learning has stronger feature expression capability compared with the classical method, and further research is still needed in the aspects of selection of related training sets, improvement of network selection and structure, real-time performance of algorithms, application of recurrent neural networks and the like in the tracking method.

Disclosure of Invention

The invention aims to provide a target tracking method based on a transfer learning regression network, and a deep neural network is used for solving the problems of inaccurate training data and target positioning during tracking.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps of utilizing a VGG-19 network:

(1) target selection

Selecting and determining a target object to be tracked from the initial image, wherein the target selection process is automatically extracted by a moving target detection method or manually specified by a human-computer interaction method;

(2) target position regression network construction based on block prediction

The target position regression network based on block prediction is composed of four parts, namely an image input layer, a migration network for feature expression, a network layer containing 4096 x 1 nodes and a position output layer containing 8 x 8 nodes; in the whole network, an input image is used as input data of a VGG-19 network after being subjected to scale normalization with the size of 224 multiplied by 224 pixels, the 23 th layer of the VGG-19 network is fully connected with the network layer of 4096 multiplied by 1 nodes, namely the 23 th layer of the VGG-19 is adopted to carry out feature expression on the input image, and the network layer of the 4096 multiplied by 1 nodes is fully connected with the position output layer of 8 multiplied by 8 nodes;

dividing an input image with the size of 224 × 224 into 8 × 8-64 image blocks, wherein each image block is 28 × 28 pixels in size, the position of each image block corresponds to the position of a node in the first two dimensions of the position output layer, and 8 nodes in the third dimension of the position output layer represent the relative positions of targets predicted by the corresponding image blocks, so that each input image passes through a regression network to obtain 8 × 8 × 8 relative position values;

(3) trace-oriented training data set generation

To be able to train a location-homing network, the training data is here acquired by two aspects: on one hand, for a first frame input image, extracting corresponding image blocks according to a target to be tracked, then placing the target image blocks at any position of the first frame image in a manual synthesis mode to generate a new image, filling the area where the original target image blocks are located by using the mean value of the target image blocks, simultaneously recording the position where the target image blocks are placed and calculating 8 x 8 relative positions between 8 x 8 image blocks divided by the whole image and the target, using the position coordinate data as expected output of a network, and forming a group of training data together with the image; on the other hand, the extracted target image block is firstly transformed, including operations such as translation, rotation, distortion and shielding, and then is placed in the image according to the same method and a training image is synthesized; all the training data form a training data set and are then used for network training;

(4) network training

In the network training process, images used for training are input one by one, parameters of a VGG-19 network part are kept unchanged, and connection parameters between a 23 th layer of the VGG-19 network, a position output layer containing 8 x 8 nodes and a network layer containing 4096 x 1 nodes are trained by adopting a classical random gradient descent (SGD);

(5) image input

Under the condition of real-time processing, extracting a video image which is acquired by a camera and stored in a storage area as an input image to be tracked; under the condition of off-line processing, the acquired video file is decomposed into an image sequence consisting of a plurality of frames, and the frame images are extracted one by one as input images according to the time sequence. If the input image is empty, the whole process is stopped;

(6) target localization

And (5) inputting the obtained image into a position returning network, and after forward processing of the network, obtaining 8 × 8 × 8 relative position data by a network output layer. The target is located by the calculation processing of these 8 × 8 × 8 node data of the output layer. Let A_i,jAn (i, j) -th image block representing the current input image,

representing image blocks A_i,jThe abscissa of the center point is the axis of the circle,

representing image blocks A_i,jOrdinate of center point, image block A_i,jThe corresponding 8 node values in the regression network output layer are respectively

The difference values are respectively the difference value between the abscissa of the upper left corner of the target frame and the abscissa of the image block center point, the difference value between the ordinate of the upper left corner of the target frame and the ordinate of the image block center point, the difference value between the abscissa of the upper right corner of the target frame and the abscissa of the image block center point, the difference value between the abscissa of the lower left corner of the target frame and the abscissa of the image block center point, the difference value between the ordinate of the lower left corner of the target frame and the ordinate of the image block center point, the difference value between the abscissa of the lower right corner of the target frame and the abscissa of the image block center point, and the difference value between the ordinate of the lower right corner of the target frame and the ordinate of the image block center point.

Setting target positionIs shown as

Wherein

Respectively represents the abscissa and ordinate of the upper left corner of the target frame, the abscissa and ordinate of the upper right corner, the abscissa and ordinate of the lower left corner, and the abscissa and ordinate of the lower right corner. The image block A_i,jThe predicted target position is

I.e. for A_i,jIs provided with

Similarly, each image block has a respective predicted target position, and the coordinates of the four corners of the predicted target frame are usually different, so that the coordinates of the four corners of the predicted target frame of each image block need to be statistically analyzed, and a coordinate accumulation method is adopted to determine the final coordinates of each corner of the target frame, so as to locate the whole target. The concrete method comprises the following steps of

Respectively representing the coordinate accumulation matrixes of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the target frame, wherein

Are respectively the corresponding moment thereofThe values of the matrices at (a, b), with 0 ≦ a, b ≦ 224, and each element value of these matrices is initially 0. For image block A_i,jIs provided with

Whereby for each image block the four matrices are accumulated.

Finally, the coordinates of the element with the maximum value in the matrix are respectively and correspondingly used as the coordinates of the four corners of the target frame, namely

Wherein

Respectively representing the horizontal and vertical coordinate values corresponding to the element with the maximum value in the coordinate accumulation matrixes of the upper left corner, the upper right corner, the lower left corner and the lower right corner of the target frame, and finishing target positioning;

(7) network update

And (4) according to the target position obtained in the step (6), 8 × 8 × 8 relative positions between 8 × 8 image blocks divided from the whole image and the target are calculated, a group of training data is formed together with the current input image, network training is performed for one time, fine tuning and updating of the network are achieved, and then the step (5) is skipped.

The relative position value comprises a difference value between an abscissa of the upper left corner of the target frame and an abscissa of the center point of the image block, a difference value between an ordinate of the upper left corner of the target frame and an ordinate of the center point of the image block, a difference value between an abscissa of the upper right corner of the target frame and an abscissa of the center point of the image block, a difference value between an ordinate of the upper right corner of the target frame and an ordinate of the center point of the image block, a difference value between an ordinate of the lower left corner of the target frame and an ordinate of the center point of the image block, a difference value between an abscissa of the lower right corner of the target frame and an abscissa of the center point of the image block, and a difference value between the ordinate of the.

The advantages and positive effects are as follows: the method comprises the steps of firstly constructing a position regression network, wherein the network consists of four parts, namely an image input layer, a migration network for feature expression, a network layer containing 4096 x 1 nodes and a position output layer containing 8 x 8 nodes. The image input layer carries out uniform preprocessing on input images, the images are normalized to be 224 x 224 pixel size, the migration network is a pre-training network VGG-19, the 23 th layer of the pre-training network is used as a feature expression layer, then a network layer containing 4096 x 1 nodes is fully connected, and the network layer is fully connected with a position output layer containing 8 x 8 nodes. In order to effectively train the position regression network, a plurality of transformations are carried out on the target and the image, then a corresponding training data set is synthesized, and the network training is carried out by adopting a classical random gradient descent method. The input image is processed forward by the regression network to obtain the prediction of the target position for each image block divided by the whole image, wherein the prediction comprises 8 × 8 × 8 relative positions and corresponding to the relative coordinate values of the four corners of the target frame. Therefore, the target can be positioned by adopting a coordinate accumulation method, and tracking is further realized. In addition, after the target positioning is finished each time, the network is finely adjusted and updated according to the currently determined target position, so that the network has certain synchronous adjustment capability. The invention can process complex tracking scene and realize accurate target tracking by utilizing the strong characteristic expression ability of deep learning, and simultaneously, the regression-based method avoids a large amount of position search, greatly improves the target positioning speed and can realize real-time target tracking. In addition, the method can be used for single-target tracking, and can also be expanded to be used for multi-target tracking by correspondingly improving the network (such as output end).

Drawings

FIG. 1 is a block diagram of the present invention.

FIG. 2 is a flow chart of the present invention.

Detailed Description

The method can be used in various occasions of target tracking, such as intelligent video analysis, automatic man-machine interaction, traffic video monitoring, unmanned vehicle driving, biological population analysis, field animal motion analysis, crossing moving object detection, fluid surface speed measurement and the like.

Take intelligent video analysis as an example: the intelligent video analysis comprises a plurality of important automatic analysis tasks such as behavior analysis, abnormal alarm, video compression and the like, and the basis of the tasks is to perform stable target tracking. The tracking method can be realized by adopting the tracking method provided by the invention, specifically, firstly, a position regression network based on transfer learning is established, as shown in figure 1, then, a plurality of transformations are carried out on the target and the image, then, a corresponding training data set is synthesized, the network training is carried out by adopting a classical random gradient descent method, and the network can obtain the capability of positioning the target after the training is finished. In the tracking process, the regression network carries out forward processing on an input image, the network outputs the relative position information of a target corresponding to the image, and the target position can be statistically analyzed and positioned by adopting a coordinate accumulation method according to the information, so that tracking is realized. In addition, after the target positioning is finished each time, the network is finely adjusted and updated according to the currently determined target position, so that the network has certain synchronous adjustment capability. The invention can process complex tracking scene and realize accurate target tracking by utilizing the strong characteristic expression ability of deep learning, and simultaneously, the regression-based method avoids a large amount of position search, greatly improves the target positioning speed and can realize real-time target tracking. In addition, the method can be used for single-target tracking, and can also be expanded to be used for multi-target tracking by correspondingly improving the network (such as output end).

The method comprises the steps of firstly establishing a position regression network based on transfer learning, then carrying out various transformations on a target and an image, further synthesizing a corresponding training data set, carrying out network training by adopting a classical random gradient descent method, and obtaining the capability of positioning the target by the network after the training is finished. In the tracking process, the regression network carries out forward processing on an input image, the network outputs the relative position information of a target corresponding to the image, and the target position can be statistically analyzed and positioned by adopting a coordinate accumulation method according to the information, so that tracking is realized. In addition, after the target positioning is finished each time, the network is finely adjusted and updated according to the currently determined target position, so that the network has certain synchronous adjustment capability.

The method can be realized by programming in any computer programming language (such as C language), and the tracking system software based on the method can realize real-time target tracking application in any PC or embedded system.

Claims

1. A target tracking method based on a transfer learning regression network comprises the following steps of utilizing a VGG-19 network, and being characterized in that:

(1) target selection

(2) target position regression network construction based on block prediction

(3) trace-oriented training data set generation

(4) network training

(5) image input

Under the condition of real-time processing, extracting a video image which is acquired by a camera and stored in a storage area as an input image to be tracked; under the condition of off-line processing, decomposing an acquired video file into an image sequence consisting of a plurality of frames, and extracting frame images one by one as input images according to a time sequence; if the input image is empty, the whole process is stopped;

(6) target localization

Inputting the image obtained in the step (5) into a position returning network, after the forward processing of the network, obtaining 8 multiplied by 8 relative position data by a network output layer, and positioning the target by computing the 8 multiplied by 8 node data of the output layer; let A_i,jAn (i, j) -th image block representing the current input image,

The difference values are respectively the difference value between the abscissa of the upper left corner of the target frame and the abscissa of the center point of the image block, the difference value between the ordinate of the upper left corner of the target frame and the ordinate of the center point of the image block, the difference value between the abscissa of the upper right corner of the target frame and the ordinate of the center point of the image block, the difference value between the abscissa of the lower left corner of the target frame and the abscissa of the center point of the image block, the difference value between the ordinate of the lower left corner of the target frame and the ordinate of the center point of the image block, the difference value between the abscissa of the lower right corner of the target frame and the abscissa of the center point of the image block, and the difference value between the ordinate of the lower right corner of the target frame and the ordinate of the center point of the image block;

the target position is expressed as

Wherein

Respectively representing the abscissa and ordinate of the upper left corner of the target frame, the abscissa and ordinate of the upper right corner, the abscissa and ordinate of the lower left corner and the abscissa and ordinate of the lower right corner; the image block A_i,jThe predicted target position is

I.e. for A_i,jIs provided with

Each image block has a respective predicted target position, and the coordinates of the four corners of the predicted target frame are usually different, so the coordinates of the four corners of the predicted target frame of each image block need to be statistically analyzed, and a coordinate accumulation method is adopted to determine the final coordinates of each corner of the target frame, so that the whole target is positioned; the concrete method comprises the following steps of

The values of the corresponding matrixes at (a and b) are respectively, 0 is less than or equal to a, b is less than or equal to 224, and each element value of the matrixes is 0 initially; for image block A_i,jIs provided with

Thus for each image block, the four matrices are subjected to an accumulation operation;

Wherein

(7) network update

2. The target tracking method based on the transfer learning regression network according to claim 1, characterized in that: the relative position value comprises a difference value between an abscissa of the upper left corner of the target frame and an abscissa of the center point of the image block, a difference value between an ordinate of the upper left corner of the target frame and an ordinate of the center point of the image block, a difference value between an abscissa of the upper right corner of the target frame and an abscissa of the center point of the image block, a difference value between an ordinate of the upper right corner of the target frame and an ordinate of the center point of the image block, a difference value between an ordinate of the lower left corner of the target frame and an ordinate of the center point of the image block, a difference value between an abscissa of the lower right corner of the target frame and an abscissa of the center point of the image block, and a difference value between the ordinate of the lower right corner of the target frame and the ordinate of the center point of the image block.