CN110660082A

CN110660082A - Target tracking method based on graph convolution and trajectory convolution network learning

Info

Publication number: CN110660082A
Application number: CN201910908419.XA
Authority: CN
Inventors: 卢学民; 权伟; 刘跃平; 张卫华; 周宁; 邹栋; 郭少鹏; 郑丹阳; 侯思帧; 郭永成; 彭宇晨; 陈锦雄
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2020-01-07
Anticipated expiration: 2039-09-25
Also published as: CN110660082B

Abstract

The invention discloses a target tracking method based on graph convolution and trajectory convolution network learning, and relates to the technical field of computer vision and target tracking. The network mainly comprises a double-flow feature extraction layer, a target candidate track extraction layer and a target positioning layer. The network outputs the confidence of each target candidate track, selects the target candidate track with the maximum confidence as a target motion track, and then takes the target block of the last frame of the target motion track as a target image block, and the network has initial target positioning capability after training. In the tracking process, the space characteristics and the target motion track characteristics of the target of the continuous 16 frames of images are extracted and connected to form double-flow characteristics, the target candidate track following the target motion rule is obtained through the LSTM structure, and in the characteristic extraction mode, larger weight can be distributed to the parts of the target which are more distinguished for tracking.

Description

Target tracking method based on graph convolution and trajectory convolution network learning

Technical Field

The invention relates to the technical field of computer vision, machine learning and target tracking.

Background

Visual target tracking is a very popular research topic in the field of computer vision, and the research content is to automatically identify or manually specify a target object to be tracked in a video sequence according to a given video segment, so as to predict information such as the position, appearance and motion of the target in a subsequent frame. The target tracking is widely applied to the fields of military and civil intelligent monitoring, man-machine interaction, traffic monitoring and the like, and has strong practical value. Although this research topic has been studied for decades, it remains a challenging topic. In real-world situations, target objects are susceptible to various factors, such as illumination changes, attitude changes, target occlusion, and the like, so that developing a continuously robust target tracking system becomes a very challenging problem. In the last two thirty years, the visual target tracking technology has advanced greatly, and particularly in recent years, the target tracking method using deep learning has achieved satisfactory effect, so that the target tracking technology has achieved breakthrough progress.

Deep learning, which is a hot spot of machine learning research in recent years, has been surprisingly successful in many aspects, such as speech recognition, image recognition, target detection, video classification, etc., due to its powerful feature expression capability and powerful data set and hardware and software support. Research and development of deep learning in target tracking are very rapid, but due to the lack of prior knowledge of target tracking and the requirement of real-time performance, a deep learning technology based on a large amount of training data and parameter calculation is difficult to be fully developed in this respect, and still has a large exploration space. Compared with the traditional manual feature extraction, the deep learning has the important characteristics of deeper semantic features and stronger representation capability, and is more accurate and reliable in solving the target tracking problem.

At present, target tracking algorithms based on deep learning are mainly classified into three categories: a tracking algorithm based on template matching, an algorithm based on machine learning regression and an algorithm based on machine learning classification. However, the current deep learning tracking algorithm does not completely solve the possible problems in the actual tracking process, and the target may be subjected to various interferences, such as deformation, shielding, illumination change and the like, so that the uncertainty of the target motion is increased. However, the spatial position relationship of each part of the target and the target motion trajectory information play an extremely important role in performing accurate and robust target tracking. Recently, a graph convolution neural network has made a great progress in the target tracking research direction in the computer vision field, and Zhen Cui et al propose a spectral filter tracking method, which adopts a spectral filter to encode and extract features of a local structure of an image, and finally uses filter parameters and a feature projection function to perform regression positioning on the position of a target. Junyu Gao et al propose a graph convolution tracking method, which can simultaneously realize spatio-temporal appearance modeling and context-aware adaptive learning of a target, thereby realizing robust target positioning. The target motion track is used as an important information characteristic of a continuous motion target, and is widely applied to the fields of target tracking and behavior recognition. Chenge Li provides a method for tracking a target in a video in real time, extracts a space-time convolution characteristic of the target to detect a three-dimensional track, thereby realizing the tracking of the target, Yue Zhao and the like provide an end-to-end track convolution identification network aiming at a behavior identification task of the video, extracts a dynamic track characteristic of the target by introducing track convolution operation, further combines the appearance and motion information of the target, and is different from time sequence convolution, the track convolution takes the position offset and motion change rule of the target into consideration, so that the appearance characteristic of the target is aggregated along a motion path, thereby more accurately expressing the continuous motion characteristic of the target along the time lapse.

Disclosure of Invention

The invention aims to provide a target tracking method based on graph convolution and trajectory convolution network learning, which can effectively solve the technical problem of accurately, robustly and long-term tracking of a target object in a complex motion scene.

The purpose of the invention is realized by the following technical scheme: a target tracking method based on graph convolution and trajectory convolution network learning comprises the following steps:

step one, target selection

Selecting and determining a target object to be tracked from the initial image sequence, wherein the process of selecting the target object is automatically extracted by a moving target detection method or manually specified by a man-machine interaction method;

step two, generation of training data set

The generation of the training data set is divided into two steps, firstly, the selection of the data set is carried out, and then the manufacture of the data set is carried out; selecting a large classified recognition Video data set ImageNet Video, marking all images with corresponding target object position coordinates, then manufacturing a training data set through a known label, wherein the data set has 4500 types of videos in total, and manufacturing the data set from each type of Video data according to two different selection rules; specifically, 16-frame images (I) are taken continuously₁,I₂,...,I₁₆) Taking 1 frame (I) as a set of training data and every 2 frames of images₁,I₃,..,I₃₁) As a set of training data set, where I represents a frame image, the number of frames of the sampled image is 16 frames, and finally 56250 sets of training sets are generated, and the sizes of the image frames are normalized to 224 × 224 pixels;

step three, constructing and training graph convolution and track convolution network

The network model is divided into three parts, namely a double-flow feature extraction layer, a candidate track extraction layer and a target positioning layer; wherein, the double-current feature extraction layer adopts a graph convolution and track convolution structure to extract together; the specific operation of graph convolution is: firstly, dividing a target object into graph nodes or called parts, specifically dividing the target object into M grids with the same size, wherein each graph node is of the same grid structure, and simultaneously constructing a non-directional weight graph G (v, W) which comprises graph nodes v and weights W of edges connecting the graph nodes; the edges of the graph nodes between the previous and subsequent frames of the successive 16 frame images need to be weight initialized, where the {0,1} initialization weight, i.e., W, is used_ijE {0,1}, wherein i is the graph node of the t-th frame, and j is the graph node of the t + 1-th frame; setting each graph node to be connected with four directly adjacent graph nodes only, wherein the weight is 1, and the rest are 0; the network structure adopts the first five layers of Alexenet networks pre-trained on ImageNet, two layers of graph convolution layers are added behind the Alexenet networks, the output characteristic F is calculated to be F ═ WX, X is the characteristic of each graph node after passing through the Alexenet five-layer networks, each frame image obtains h × w × 256 graph convolution characteristics, and the continuous 16 frames of images are finally output to obtain T × h × w × 256 graph convolution characteristics; wherein T is the frame number of the video image sequence, where T is 16, h is the feature height, w is the feature width, and 256 is the number of feature channels;

the specific operation of the trajectory convolution is: knowing the target position of each frame of image in the 16 frames of images, wherein each target position is represented as x, y, w, h, the x, y, w, h respectively represent the central abscissa, the central ordinate, the width and the height of the target position, and connecting the target positions between the front frame and the rear frame of the continuous 16 frames of images to obtain a target motion track; the trajectory convolution is adopted inTop five layers of a pre-trained C3D network on ImageNet, given an input profile x at time t_t(p) the output characteristic map is y_t(p), convolution kernel parameters of trajectory convolution { W_τ:τ∈[0,Δt]And kernel parameter size Δ t-1, where Δ t is 16, output profile y_t(p) is calculated as

Inputting the graph convolution characteristics obtained by each frame of image into the track convolution to finally obtain the continuous 16-frame image track convolution characteristics, wherein the dimension of the graph convolution characteristics is T multiplied by h multiplied by w multiplied by 256; finally, the graph convolution and the track convolution are connected to form a T multiplied by h multiplied by w multiplied by 512 dimensional characteristic, wherein T is the frame number of the video image sequence, T is 16, h is the characteristic height, w is the characteristic width, and 512 is the characteristic channel number;

taking the target position of the previous frame image as the center, forming a target attention area in the current input frame by taking the target position of the previous frame image as the center and taking 4 times of the target, obtaining target candidate blocks in the target attention area by adopting a sliding search window method, wherein the length-width ratios of the adopted search windows are respectively 1:1, 1:2 and 2:1, moving from the initial coordinate position of the target attention area until the target attention area is searched, taking the image blocks selected by the search window as the target candidate blocks, normalizing the scales of the image blocks into the size same as that of a target object, connecting each target candidate block with the target positions of the previous 16 frames of images to form a new target motion track, then passing the double-flow characteristics of the continuous 17 frames of images through an LSTM structure to obtain N target candidate tracks, wherein the dimensions are Nx 4, N is the number of the target candidate tracks, and 4 represents 4 position coordinates of the target position of each frame of image, in particular, setting the loss function of the target candidate trajectory network as

T is the image frame number, Delta theta is the deviation of the predicted value and the true value of the coordinate, and the position coordinate of the target candidate block of the current input frame is represented as x₀,y₀,w₀,h₀Wherein x is₀,y₀,w₀,h₀Respectively representing the center abscissa of the target candidate block,Center ordinate, width and height, while the offset value for coordinate prediction is Δ x₀,Δy₀,Δw₀,Δh₀Then the coordinates of each target candidate block are x₀+Δx₀,y₀+Δy₀,w₀+Δw₀,h₀+Δh₀Connecting target motion tracks of continuous 16-frame images with each target candidate block to form target candidate tracks, finally obtaining N target candidate tracks by learning a target motion rule, inputting double-current characteristics of the N target candidate tracks into a full-connection layer for classification, and setting a network classification loss function as cross entropy loss;

after the network is constructed, training the network by using the training data set generated in the second step, wherein the training method adopts a classical random gradient descent method, after the training is finished, the network outputs the confidence (namely, the similarity) of each target candidate track, then selects the target candidate track with the maximum confidence as a target motion track, and then takes the target position of the last frame image of the target motion track as a target image block, so as to obtain the initial capability of target positioning;

step four, inputting image sequence

After the graph convolution and track convolution network training is finished, under the condition of real-time processing, extracting a video image which is collected by a camera and stored in a storage area as an input image to be tracked; under the condition of offline processing, decomposing the video file which is acquired into an image sequence consisting of a plurality of frames, extracting continuous 16-frame images as an input image sequence according to a time sequence, and stopping the whole process if the number of the input image frames is not equal to 16;

step five, generating target candidate tracks

Dividing the target object in the continuous 16-frame images into M image nodes according to the method in step three, simultaneously connecting the positions of the target object between the front frame and the rear frame of the 16-frame images to obtain a target motion track, inputting a double-stream feature extraction layer, extracting to obtain double-stream features with the dimension of T multiplied by w multiplied by 512, and obtaining N target candidate tracks with the dimension of N multiplied by 4 through a candidate track extraction layer of a graph convolution and track convolution network, wherein N is the number of the target candidate tracks, and 4 represents 4 position coordinates of each frame of target;

sixthly, positioning the target

Classifying the target candidate tracks obtained in the fifth step through a full connection layer, outputting the confidence coefficient of each target candidate track by a network, selecting the target candidate track with the maximum confidence coefficient as a target motion track, and taking the target position of the last frame of the target motion track as a target image block so as to obtain the initial capability of target positioning, wherein the target positioning is completed;

step seven, network online updating

After the tracked target result is successfully determined, inputting the target object and the position coordinates of the current input image frame obtained by positioning in the sixth step into the end of the 16-frame image sequence of the initial training set, simultaneously deleting the first frame of the 16-frame image sequence, and updating the first frame into a new training set, which is represented as (I)₂,...,I₁₇) (ii) a And then jumping to the step four, obtaining a new training set of continuous 16-frame images again, dynamically adjusting the target motion track in real time, performing network on-line learning, realizing fine adjustment and updating of the network, and performing a new round of target positioning.

The invention has the advantages and positive effects that: a target tracking method based on graph convolution and trajectory convolution network learning is provided. The method uses a training data set to train a graph convolution and track convolution network model in an off-line mode, and the network mainly comprises a double-current feature extraction layer, a target candidate track extraction layer and a target positioning layer. The method comprises the steps of extracting spatial features of a target in each frame of image by adopting graph convolution, extracting track features of continuous frames of video images by adopting track convolution, obtaining target candidate tracks following a target motion rule by adopting a time recursive neural network (LSTM) structure, classifying the target candidate tracks by adopting a full-connection layer structure, outputting confidence coefficient of each target candidate track by a network, selecting the target candidate track with the maximum confidence coefficient as a target motion track, then selecting a target block of the last frame of the target motion track as a target image block, and having initial target positioning capability after network training is completed. In the tracking process, space features and target motion track features of targets of continuous 16 frames of images are extracted and connected to form double-flow features, target candidate tracks following the target motion rule are obtained through an LSTM structure, the network outputs the confidence coefficient of each target candidate track, then the target candidate track with the maximum confidence coefficient is selected as the target motion track, then a target block of the last frame of the target motion track is selected as a target image block, and target positioning is completed, so that the target object is tracked. In the online learning process of the network, the network model is finely adjusted through the target image block obtained by tracking, so that the network model can dynamically adjust the motion track of the target, and the current image sequence is better adapted.

The network model can fully extract the characteristics of the target under the condition of continuous motion, including the space position characteristics and the target motion track characteristics of the target. In a feature extraction mode, a larger weight can be distributed to a part with more discrimination of a target for tracking; meanwhile, in the generation mode of the target candidate track, the corresponding target candidate track can be generated in the motion constraint range according to the target motion track, so that the probability of target drifting and even target loss can be reduced, the calculated amount of target positioning is greatly reduced, and the robustness and accuracy of target tracking are improved. The invention can process complex tracking scenes, realize long-time real-time accurate target tracking and solve the problems of target shielding, drifting and the like in the tracking process. In addition, the method can be used for single-target tracking and multi-target tracking in complex scenes.

Drawings

FIG. 1 is a schematic view of the present invention showing the connection of nodes

FIG. 2 is a block diagram of the present invention

FIG. 3 is a flow chart of the present invention

Detailed Description

The method can be used for various occasions of visual target tracking, including the fields of military, civil use and the like, the fields of military such as unmanned aircrafts, accurate guidance, air early warning and the like, and the fields of civil use such as mobile robots, intelligent video monitoring of traction substations, intelligent traffic systems, intelligent security and the like. Take intelligent video monitoring of a traction substation as an example: the intelligent video monitoring of the traction substation comprises a plurality of important automatic analysis tasks, such as intrusion detection, behavior analysis, abnormal alarm and the like, and the basis of the work is to realize real-time and stable target tracking. The tracking method provided by the invention can be adopted for realizing the tracking method, and specifically, a graph convolution and track convolution network model is required to be constructed firstly, and the network mainly comprises a double-current feature extraction layer, a target candidate track extraction layer and a target positioning layer, as shown in fig. 2. And then, manually labeling the target in the monitoring video in the traction substation to obtain a corresponding training data set, and then training the network by adopting the monitoring video training set and a random gradient descent method, wherein the network initially has corresponding target positioning capability after training. In the tracking process, space features and target motion track features of targets of 16 continuous frames of images are extracted and connected to form double-flow features, target candidate tracks following a target motion rule are obtained through an LSTM structure, the target candidate tracks are classified by adopting a full-connection layer structure, a network outputs the confidence coefficient of each target candidate track, then the target candidate track with the maximum confidence coefficient is selected as a target motion track, then a target block of the last frame of the target motion track is selected as a target image block, and target positioning is completed, so that the target object is tracked. In the online learning process of the network, the network model is finely adjusted through the target image block obtained by tracking, so that the network model can dynamically adjust the motion track of the target, thereby better adapting to the actually monitored image sequence in the traction substation and effectively improving the robustness and accuracy of target tracking. The invention can process complex tracking scenes, realize long-time real-time accurate target tracking and solve the problems of target shielding, drifting and the like in the tracking process. In addition, the method can be used for single-target tracking and multi-target tracking in complex scenes.

The method can be realized by programming in any computer programming language (such as C language), and the tracking system software based on the method can realize real-time target tracking application in any PC or embedded system.

Claims

1. A target tracking method based on graph convolution and trajectory convolution network learning comprises the following steps:

step one, target selection

step two, generation of training data set

The network model is divided into three parts, namely a double-flow feature extraction layer, a candidate track extraction layer and a target positioning layer; wherein, the double-current feature extraction layer adopts a graph convolution and track convolution structure to extract together; the specific operation of graph convolution is: firstly, dividing a target object into graph nodes or called parts, specifically dividing the target object into M grids with the same size, wherein each graph node is of the same grid structure, and simultaneously constructing a non-directional weight graph G (v, W) which comprises graph nodes v and weights W of edges connecting the graph nodes; the edges of the graph nodes between the previous and subsequent frames of the successive 16-frame images need to be weight initialized, where the {0,1} initialization weight, i.e., the initialization weight, is usedW_ijE {0,1}, wherein i is the graph node of the t-th frame, and j is the graph node of the t + 1-th frame; setting each graph node to be connected with four directly adjacent graph nodes only, wherein the weight is 1, and the rest are 0; the network structure adopts the first five layers of Alexenet networks pre-trained on ImageNet, two layers of graph convolution layers are added behind the Alexenet networks, the output characteristic F is calculated to be F ═ WX, X is the characteristic of each graph node after passing through the Alexenet five-layer networks, each frame image obtains h × w × 256 graph convolution characteristics, and the continuous 16 frames of images are finally output to obtain T × h × w × 256 graph convolution characteristics; wherein T is the frame number of the video image sequence, where T is 16, h is the feature height, w is the feature width, and 256 is the number of feature channels;

the specific operation of the trajectory convolution is: knowing the target position of each frame of image in the 16 frames of images, wherein each target position is represented as x, y, w, h, the x, y, w, h respectively represent the central abscissa, the central ordinate, the width and the height of the target position, and connecting the target positions between the front frame and the rear frame of the continuous 16 frames of images to obtain a target motion track; trace convolution takes the first five layers of a C3D network pre-trained on ImageNet, given an input feature map x at time t_t(p) the output characteristic map is y_t(p), convolution kernel parameters of trajectory convolution { W_τ:τ∈[0,Δt]And kernel parameter size Δ t-1, where Δ t is 16, output profile y_t(p) is calculated as

taking the target position of the image of the previous frame as the center, forming a target attention area in the current input frame by taking the target position as 4 times of the target, acquiring target candidate blocks in the target attention area by adopting a sliding search window method, wherein the aspect ratios of the adopted search windows are respectively1:1, 1:2 and 2:1, moving from an initial coordinate position of a target attention area until the target attention area is searched, taking an image block selected by a search window as a target candidate block, normalizing the dimension of the image block to be the same as that of a target object, connecting each target candidate block with the target position of the previous 16 frames of images to form a new target motion track, then passing the double-flow characteristics of the continuous 17 frames of images through an LSTM (local Scale invariant feature) structure to obtain N target candidate tracks with dimensions of Nx 4, wherein N is the number of the target candidate tracks, 4 represents 4 position coordinates of the target position of each frame of images, and specifically, setting the loss function of a target candidate track network as the loss function

T is the image frame number, Delta theta is the deviation of the predicted value and the true value of the coordinate, and the position coordinate of the target candidate block of the current input frame is represented as x₀,y₀,w₀,h₀Wherein x is₀,y₀,w₀,h₀Respectively representing the center abscissa, center ordinate, width and height of the target candidate block, and the offset value of coordinate prediction is Δ x₀,Δy₀,Δw₀,Δh₀Then the coordinates of each target candidate block are x₀+Δx₀,y₀+Δy₀,w₀+Δw₀,h₀+Δh₀Connecting target motion tracks of continuous 16-frame images with each target candidate block to form target candidate tracks, finally obtaining N target candidate tracks by learning a target motion rule, inputting double-current characteristics of the N target candidate tracks into a full-connection layer for classification, and setting a network classification loss function as cross entropy loss;

after the network is constructed, training the network by using the training data set generated in the second step, wherein the training method adopts a classical random gradient descent method, after the training is finished, the network outputs the confidence coefficient of each target candidate track, then selects the target candidate track with the maximum confidence coefficient as a target motion track, and then takes the target position of the last frame image of the target motion track as a target image block, so as to obtain the initial capability of target positioning;

step four, inputting image sequence

step five, generating target candidate tracks

sixthly, positioning the target

step seven, network online updating

After the tracked target result is successfully determined, inputting the target object and the position coordinates of the current input image frame obtained by positioning in the sixth step into the end of the 16-frame image sequence of the initial training set, simultaneously deleting the first frame of the 16-frame image sequence, and updating the first frame into a new training set, which is represented as (I)₂,...,I₁₇) (ii) a Then jumping to step four, obtaining a new training set of continuous 16 frames of images again, and dynamically adjusting the target in real timeAnd (4) performing network on-line learning by the motion track, realizing fine adjustment and updating of the network, and performing a new round of target positioning.