CN107274433B

CN107274433B - Target tracking method and device based on deep learning and storage medium

Info

Publication number: CN107274433B
Application number: CN201710474118.1A
Authority: CN
Inventors: 王欣; 石祥文
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2020-04-03
Anticipated expiration: 2037-06-21
Also published as: CN107274433A

Abstract

A target tracking method, device and storage medium based on deep learning, the method is to read two frames of pictures continuously; respectively setting and cutting a target area of a previous frame and a search area of a current frame, and when setting and cutting the search area of the current frame, obtaining the search area by judging whether the position of a central point is stably set or not when an object moves rapidly; inputting the target area and the search area into a convolutional neural network to calculate to obtain a current frame target area; calculating to obtain the interframe displacement of the current frame relative to the previous frame target; and judging whether the current frame is the last frame or not so as to judge whether iterative target tracking is continued or not. The invention realizes the prediction of the central point position of the current frame cutting area by judging the rapid speed of the target object moving in the image, improves the target tracking accuracy and the target contact ratio on the premise of basically keeping the original high tracking speed compared with the existing algorithm, and has better tracking robustness.

Description

Target tracking method and device based on deep learning and storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a target tracking method and apparatus based on deep learning in image processing, and a storage medium.

Background

Target tracking is a challenging research topic in the field of computer vision, and is a research hotspot because it is widely applied in many fields such as security, transportation, military, virtual reality, medical imaging, and the like. The target tracking aims to determine the continuous position of a target object in an ordered image sequence so as to facilitate further analysis and processing, thereby realizing the analysis and understanding of the motion behavior of the target object. Since the twenty-first century, information technology has been developed at a high speed, the computing performance of computers and the acquisition quality of image acquisition equipment such as cameras have been gradually improved, and more experts and scholars are invested in the related technology of research target tracking because of the increasing importance of people on their own and property safety.

The target tracking technology is one of the core research subjects in the field of computer vision, and comprises various technologies such as computer graphics, target recognition, artificial intelligence, automatic control and the like. The target tracking technology has originated in the last 50 years, and through the continuous development of more than 60 years, various tracking algorithms have been proposed so far, such as Mean Shift algorithm (Mean Shift), Background difference method (Background difference method), Background modeling method (Background modeling), optical flow method (optical flow method), Kalman Filter (Kalman Filter), Particle Filter (Particle Filter), and various improved algorithms based on the above algorithms, but these algorithms basically have certain problems and defects, such as low tracking accuracy or poor real-time performance, and are difficult to meet various requirements of real-world scene applications.

Since the concept of Deep Learning (Deep Learning) was proposed in 2006, research of Deep Learning has become popular, and more experts and scholars are invested in the research of Deep Learning, and Deep Learning has made breakthrough progress in many fields and is widely applied to fields such as computer vision, image processing, natural language processing, information classification, search, and big data. Naturally, attempts have been made to solve the target tracking problem by using deep learning methods. However, the algorithm for researching target tracking by adopting the deep learning mode is often slow due to huge calculation amount, and has poor real-time performance, so that the requirement of practical application is difficult to meet.

Therefore, how to improve both tracking accuracy and tracking efficiency in target tracking is a technical problem that needs to be solved urgently in the prior art.

Disclosure of Invention

The invention aims to provide a target tracking method, a target tracking device and a storage medium based on deep learning, which are used for processing input videos frame by frame to realize accurate tracking of a target object, enable a neural network to have stronger characteristic generalization capability through offline training of a large amount of labeled data, improve the tracking precision, accelerate the operation speed through means of cutting, GPU acceleration and the like, and improve the tracking efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

a target tracking method based on deep learning comprises the following steps:

picture reading step S110: continuously reading two frames of pictures, including a previous frame of picture and a current frame of picture, wherein the previous frame of picture has a calculated target position, and the current frame of picture needs to calculate the target position;

area setting step S120: respectively setting and cutting a target area of a previous frame and a search area of a current frame;

the setting and cutting of the target area of the previous frame specifically comprises the following steps: knowing the center point position c of the target from the previous frame (c ═ c)_x,c_y) Marking the target object by taking the rectangular frame as a first bounding box as a central point, wherein the height of the first bounding box is h, the width of the first bounding box is w, and the height and the width of the target area obtained after cutting are respectively k₁h and k₁w. Parameter k₁For controlling the size of the target area;

the setting and cutting of the search area of the current frame specifically comprises: judging whether the motion of the object in the image is stable or not, and if the speed is stable, determining that the center point position c 'of the search area of the current frame is (c'_x,c'_y) Equal to the center point position c of the known target of the previous frame ═ c_x,c_y) Adding the inter-frame displacement S of the two previous frame image objects, if the speed changes sharply, for example, decreases or increases rapidly, the center point position c 'of the search area of the current frame is (c'_x,c'_y) The center point position c of the known target of the previous frame is (c)_x,c_y) Namely, the position of the target center point of the previous frame is used as the clipping center of the current frame, the rectangular frame is used as the second bounding box for marking, the height of the second bounding box is h, the width of the second bounding box is w, and the height and the width of the clipped search area are respectively k₂h and k₂w. Parameter k₂For controlling the size of the search area;

a feature extraction and comparison step S130: inputting the target area and the search area into a Convolutional Neural Network (CNN), performing feature extraction and feature comparison, and calculating to obtain the target area of the current frame;

interframe displacement calculation step S140: calculating to obtain the interframe displacement of the current frame relative to the target of the previous frame by using the target area of the current frame and the target area of the previous frame;

a judgment step S150: and judging whether the current frame is the last frame, if so, finishing the tracking, otherwise, entering a picture reading step S110, continuously reading two continuous frames of pictures, and continuously carrying out iterative target tracking.

Preferably, in the region setting step S120, the step of determining whether the target object moves smoothly in the image is: comparing the interframe displacement of the target of two adjacent frames in three continuous frames before the current frame, and if the interframe displacement difference of two adjacent frames in the three continuous frames is smaller, considering that the motion is stable; if the interframe displacement difference of two adjacent frames in the three continuous frames is large, the movement speed is considered to be changed violently.

Preferably, in the area setting step S120, it is determined whether the inter-frame displacement difference between two adjacent frames in the three consecutive frames is smaller than 1/3 of the inter-frame displacement between the two previous frames;

controlling the size of the region k₂And k₁All take the value of 2.

Preferably, in the region setting step S120, in order to avoid the situation that the actual position of the current frame target exceeds the second bounding box due to too fast change of the moving speed, when the speed is changed drastically, the size of the second bounding box is increased, that is, k is increased₂The numerical value of (c).

Preferably, the feature extraction and comparison step S130 is to perform feature extraction on the target region and the search region in the convolution layer, input the extracted features into the full-link layer, perform feature comparison on the target region and the search region in the full-link layer, and finally obtain the target region of the current frame after calculation.

The invention further discloses a target tracking device based on deep learning, which comprises the following components:

a picture reading unit: continuously reading two frames of pictures, including a previous frame of picture and a current frame of picture, wherein the previous frame of picture has a calculated target position, and the current frame of picture needs to calculate the target position;

an area setting unit: respectively setting and cutting a target area of a previous frame and a search area of a current frame;

and (3) feature extraction and comparison steps: inputting the target area and the search area into a Convolutional Neural Network (CNN), performing feature extraction and feature comparison, and calculating to obtain the target area of the current frame;

an interframe displacement calculation unit: calculating to obtain the interframe displacement of the current frame relative to the target of the previous frame by using the target area of the current frame and the target area of the previous frame;

a judging unit: and judging whether the current frame is the last frame, if so, finishing the tracking, otherwise, continuously reading two continuous frames of pictures by the picture reading unit, and performing iterative target tracking.

Preferably, in the region setting unit, the determining whether the object moves smoothly in the image is: comparing the interframe displacement of the target of two adjacent frames in three continuous frames before the current frame, and if the interframe displacement difference of two adjacent frames in the three continuous frames is smaller, considering that the motion is stable; if the interframe displacement difference of two adjacent frames in the three continuous frames is large, the movement speed is considered to be changed violently.

Preferably, in the region setting unit, 1/3 that whether the inter-frame displacement difference between two adjacent frames in the three consecutive frames is smaller than the inter-frame displacement between the two previous frames is judged;

controlling the size of the region k₂And k₁All take the value of 2.

Preferably, in the region setting unit (220), in order to avoid the situation that the actual position of the current frame target exceeds the second bounding box due to too fast change of the motion speed, when the speed is changed violently, the size of the second bounding box is increased, namely, k is increased₂The value of (d); and/or

The feature extraction and comparison unit (230) firstly extracts features of the target area and the search area in the convolutional layer, then inputs the extracted features into the full-link layer, compares the features of the target area and the search area in the full-link layer, and finally obtains the target area of the current frame after calculation.

A storage medium for storing computer-executable instructions,

the computer executable instructions, when executed by a processor, perform the object tracking method as described above.

The invention sets the position of the central point of the cutting area of the current frame by judging whether the object moves rapidly or stably in the image or not through the position of the central point of the target of the previous frame, improves the target tracking accuracy rate compared with the prior algorithm, has high target coincidence degree, basically keeps the original high tracking speed and has better algorithm robustness.

Drawings

FIG. 1 is a schematic diagram of a deep learning based target tracking method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram of a target tracking method based on deep learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a motion model of a deep learning based target tracking method according to an embodiment of the invention;

FIG. 4 is a comparative example of tracking robustness of a target tracking method according to a specific embodiment of the present invention;

FIG. 5 is another comparative example of tracking robustness of a target tracking method according to a specific embodiment of the present invention

Fig. 6 is a block diagram of a target tracking apparatus based on deep learning according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Referring to fig. 1, a network architecture diagram of a deep learning based target tracking method according to the present invention is shown.

The invention relates to an iterative loop method, wherein the position of a target in a previous frame is known, the target center is included, a rectangular frame is set by taking the target position as the center to serve as a first enclosing box to mark a target object, and the target object is cut to obtain a target area after being expanded; predicting the search position of the current frame by using the target position of the previous frame, setting a rectangular frame as a second enclosure box by using the search position as a center, cutting the second enclosure box after the second enclosure box is expanded to obtain a search area, wherein the size of the target area and the size of the search area can be the same or different, and inputting the search area into a Convolutional Neural Network (CNN) for calculation to obtain the target position of the current frame.

In the invention, a Caffe (convergence Architecture For FeatureExtraction) framework is preferably used For calculation, the convolutional layers of the network adopt the first 5 convolutional layers of CaffeNet, the rear 3 layers are full connection layers, each full connection layer is provided with 4096 neural nodes, the last output layer of the full connection layer is provided with 4 neural nodes, and two pairs of coordinate values of the upper left and the lower right of a tracking target are respectively output, so that the target position of the current frame is calculated.

With further reference to fig. 2, there is shown a flow chart of the target tracking method based on deep learning according to the present invention, comprising the following steps:

picture reading step S110: and continuously reading two frames of pictures, including a previous frame of picture and a current frame of picture, wherein the previous frame of picture has a calculated target position, and the current frame of picture needs to calculate the target position.

As mentioned above, the present invention is a loop iteration algorithm, in step S110, one of two consecutive frames of pictures read at a time is repeated last time, for example: reading a t-1 th frame and a t-th frame at this time, wherein the target position of the t-1 th frame is known and needs to be calculated; the next reading is the t frame and the t +1 frame, and then the clipping center of the t +1 frame is calculated.

the setting and cutting of the target area of the previous frame specifically comprises the following steps: knowing the center point position c of the target from the previous frame (c ═ c)_x,c_y) Marking the target object by taking the rectangular frame as a first bounding box, wherein the height of the first bounding box is h, the width of the first bounding box is w, and the height and the width of the target area obtained after cutting are respectively k₁h and k₁w. Parameter k₁For controlling the size of the target area;

the setting and cutting of the search area of the current frame specifically comprises: judging whether the motion of the object in the image is stable or not, and if the motion is stable, determining that the center point position c 'of the search area of the current frame is (c'_x,c'_y) Equal to the center point position c of the known target of the previous frame ═ c_x,c_y) Adding the inter-frame displacement S of the two previous frame image objects, if the speed changes sharply, for example, decreases or increases rapidly, the center point position c 'of the search area of the current frame is (c'_x,c'_y) The center point position c of the known target of the previous frame is (c)_x,c_y) Namely, the position of the target center point of the previous frame is used as the clipping center of the current frame, the rectangular frame is used as the second bounding box for marking, the height of the second bounding box is h, the width of the second bounding box is w, and the height and the width of the clipped search area are respectively k₂h and k₂w. Parameter k₂For controlling the size of the search area.

In one embodiment, k is₂And k₁All take the value of 2.

Further, judging whether the object moves stably in the image is as follows: comparing the interframe displacement of the target of two adjacent frames in three continuous frames before the current frame, and if the interframe displacement difference of two adjacent frames in the three continuous frames is smaller, for example, smaller than 1/3 of the interframe displacement of the two previous frames, considering that the motion is stable in speed; if the interframe displacement of two adjacent frames in three consecutive frames is greatly different, for example greater than 1/3 of the interframe displacement of the previous two frames, the speed is considered to be changed drastically. The inter-frame displacement refers to the change of the relative position of two continuous frames of the target in the image.

Specifically, first, the picture of the previous frame (t-1 st frame) is cropped, and the tracking target is located at the middle position of the image block after cropping. In the tracking process, the target object is marked by using a rectangular frame as a first bounding box, and the coordinate of the center point of the bounding box is set as c ═ c (c)_x,c_y) H and w, and the height and width of the cut picture are respectively k₁h and k₁w. Parameter k₁Used for controlling the size of the target area and determining the amount of background information, k, in the clipped picture₁The larger the value is, the larger the area of the clipped picture is, and the included background information is correspondingly increased; likewise, k₁The smaller the value is taken, the smaller the area of the cropped picture becomes, and the less background information is included. For objects with strongly varying motion speed, k should be increased₁To enlarge the target region, k in the experimental environment of the present invention₁The value of (2) is taken.

For the current frame, different objects in the real scene generally have different moving speeds, and the moving speed of some objects is very fast and may also change dramatically (decrease rapidly or increase rapidly). After a target object moving rapidly is captured by a camera and shot into a video and is segmented into frames, a certain inter-frame difference exists between two continuous frames of pictures at the position (not the absolute position in a scene) of the target object in the pictures, the inter-frame difference is smaller when the moving speed is lower, and the inter-frame difference is correspondingly increased when the moving speed is higher.

Referring first to fig. 3, a schematic diagram of a motion model of a deep learning based target tracking method according to an embodiment of the present invention is shown.

Suppose that the current frame (t-th frame) target is located at x_tAt position, the t-1 frame target is located at x_t-1At position, the t-2 frame is located at x_t-2At position, the t-3 th frame is located at x_t-3At position, the t +1 th frame is located at x_t+1At position, let:

s_t-2＝x_t-2-x_t-3……………………………(1)

s_t-1＝x_t-1-x_t-2… … … … … … … … … … … … (2) wherein s_t-2Represents the displacement between the t-3 th frame and the t-2 th frame with the direction x_t-3Point of direction x_t-2；s_t-1Represents the displacement between the t-2 th frame and the t-1 th frame with the direction x_t-2Point of direction x_t-1。

The following will discuss the moving speed of the target object in two processes, namely deceleration and acceleration, respectively:

(1) when the movement of the target object is in the process of deceleration, e.g. x_t-3To x_t+1The motion trajectory of the segment is shown.

Wherein x is_t-3To x_t-1The speed of the segments not varying significantly, i.e. s_t-2And s_t-1Are not very different in size; and x_t-1To x_t+1And section, the speed is rapidly reduced to 0. For the speed degree of the change of the target motion speed, the invention obtains the speed degree through a plurality of experiments

As a criterion.

When in use

When the motion speed of the target object changes little, namely the displacement difference of the target object in three continuous frames is small, for example, x_t-3To x_tAnd (4) section. At this time, the clipping center x of the current frame_tThe value of' is obtained as follows:

x_t'＝x_t-1+s_t-1…………………………(3)

as can be seen in FIG. 3, the clipping center x_t' position and actual position x of the current frame (t-th frame)_tIs much smaller than the actual position x of the previous frame (t-1 th frame)_t-1With the actual position x of the current frame (t-th frame)_tThe distance between the target object and the target object is used for explaining that the motion model provided by the invention has more obvious advantages for tracking the target object which moves rapidly.

When in use

When the displacement of two continuous frames is different greatly, it indicates that the movement speed of the target object is changed greatly, such as x_t-1To x_t+1And (4) section. At this time, the clipping center x of the current frame_t' obtained by the following formula:

x_t'＝x_t………………………………(4)

that is, here, when the speed is changed drastically, the target center of the previous frame (t-1 th frame) is set as the clipping center of the current frame (t-th frame). In addition, the value range of t in the invention is t ≧ 4, and the tracking of the 2 nd frame and the 3 rd frame is also applicable to the formula 4.

(2) When the movement of the target object is in the process of acceleration, e.g. x_t+1To x_t+5The motion trajectory of the segment is shown. Wherein x is_t+1To x_t+3The speed of the segment increases rapidly from 0, when the center of the crop is compared with x_t-1To x_t+1Solving the segments; and x_t+3To x_t+5The speed of the segment does not change significantly, when the center of the cut is compared to x_t-3To x_t-1Is made by segmentsAnd (6) solving.

Assume that the center point coordinate of the target object in the current frame picture (t-th frame) is c '═ c'_x,c'_y) Calculating according to formula (3) and formula (4) to obtain the clipping center of the current frame, setting a second bounding box with the position as the center, the height as h and the width as w, and then setting k as₂h and k₂w sets the search area, k₂And k₁Again, the value is 2.

Therefore, in this step, firstly, whether the motion of the object is stable or not is judged through the interframe displacement of the adjacent three frames, if the difference value of the interframe displacement is small, namely the motion of the object is stable, the clipping center of the current frame (the t-th frame) is obtained by adding the target position of the previous frame (the t-1 th frame) and the displacement S between the previous two frames (the t-2 th frame and the t-1 th frame); when the speed of the current frame is changed severely (reduced or increased rapidly), the displacement between frames is changed greatly, and the prediction of the cutting center of the current frame by adding the target position and the displacement S between the previous two frames does not have reference meaning and may bring larger error.

Further, in order to avoid the situation that the actual position of the current frame target exceeds the second bounding box due to too fast change of the motion speed, when the speed is changed violently, the size of the second bounding box can be increased, namely, k is increased₂Thus increasing the area of search comparison to avoid the above-mentioned situation.

A feature extraction and comparison step S130: and inputting the target area and the search area into a Convolutional Neural Network (CNN), performing feature extraction and feature comparison, and calculating to obtain the target area of the current frame.

Specifically, firstly, feature extraction is carried out on a target area and a search area in the convolutional layer, then the extracted features are input into a full connection layer, feature comparison is carried out on the target area and the search area in the full connection layer, and finally the target area of the current frame is obtained after calculation.

This step is to use a convolutional neural network for the acquisition of the current frame target region, and before using, the convolutional neural network should use video and/or pictures for deep learning, i.e., training.

Interframe displacement calculation step S140: and calculating to obtain the interframe displacement of the current frame relative to the target of the previous frame by using the target area of the current frame and the target area of the previous frame.

This step is used in the iterative calculation, and is used in the region setting step to calculate whether the object is a subject whose moving speed is drastically changed, and to calculate the center position of the search region.

This step is for determining whether target tracking has ended or should continue.

The network training of the invention adopts the following method:

1. training set

The training set includes two parts, video from the ALOV300+ + dataset and pictures from the ImageNet2012 dataset.

The ALOV300+ + dataset is a video dataset that is often used to test the performance of various target tracking algorithms, and has the following address: http:// alov300pp. There are 314 segments of video in the ALOV300+ + dataset, containing 14 types of video: light, surface cover, Specularity, Transparency, Shape, MotionSmoothness, MotionCoherence, Clutter, fusion, LowContrast, occupancy, MovingCamera, ZoomingCamera, LongDuration, which are classified respectively for problems of illumination change, Occlusion, target deformation, camera movement, etc., and can effectively train the neural network for the problems, so as to better deal with and process the problems. Except that type 14 LongDuration contains 10 segments of long video of 1-2 minutes, the other videos are relatively short, with an average duration of 9.2 seconds per segment and a maximum duration of 35 seconds. These videos are segmented into frames and presented in the form of pictures, which are about 15 ten thousand frames of pictures, and contain 314 different types of target objects, and the positions of the target objects in all the pictures are manually marked with group route.

The present invention divides the 314 video sequence into two parts by decimating 1 segment every 5 segments of video. For example, 33 segments of Light type video, 7 segments of 1, 6, 11, 16, 21, 26 and 31 are extracted, and other types of video are divided according to the method. After the division is completed, the first 251 video sequence section comprises 11.8 ten thousand pictures and is used for training a network; the second part, 64 video sequence, contains 3.2 ten thousand pictures, which is used as a validation set for neural network hyper-parameter tuning (hyper-parameter tuning).

The ImageNet2012 data set is a massive picture data set containing 135 ten thousand pictures, wherein 120 thousand pictures are in the training set, 5 thousand pictures are in the verification set, and 10 ten thousand pictures are in the test set. In view of the huge data volume of the ImageNet2012 data set, the ImageNet2012 data set cannot be used for training the network, and 10 ten thousand test set pictures in the ImageNet2012 data set are used as the training set of the invention. The image training set is used for pre-training the neural network, so that massive image information of the ImageNet2012 data set is fully utilized, the classification and recognition capability of the neural network is improved, and the network learns the appearance model of the target object.

2. Test set

The test set uses a VOT2016 data set, which is also a video data set, and contains 60 video segments, including 2.1 ten thousand pictures, and the positions of the target objects in all the pictures are manually marked with group route, website: http:// www.votchallenge.net/vot2016/dataset. The VOT2016 dataset is a standard dataset for object tracking, which can be used for comparison and quantification with various object tracking algorithms currently in the state of the art. The VOT2016 data set contains rich object types, and specific detection labels are set for the problems of shielding, illumination change, target deformation, camera movement and the like in target tracking, so that the data set is adopted for testing the neural network of the algorithm.

3. Training strategy

Pre-training a neural network by using partial pictures in an ImageNet2012 data set, and training the capability of accurately positioning the position of a target object in an image B when the characteristics of the target object in the image A are known by the neural network so that the network learns an apparent model of the target object; then, 251 sections of video sequences in the training set are used for training the neural network, so that the neural network learns the continuous motion of different types of objects, the neural network obtains the capability of tracking moving objects in the video sequences, and the network learns the motion model of the target object; and finally, training the neural network again by using 64 video sequences in the verification set, and continuously adjusting the hyper-parameter (hyper-parameter tuning) of the neural network to ensure that the neural network obtains excellent target identification and tracking capability.

Example 1:

in the present embodiment, a comparative example of the method of the present invention with other target tracking methods is shown.

At present, algorithms for researching the target Tracking problem by adopting a deep learning method are mostly slower, and the fastest algorithm is a regression network-based general target Tracking algorithm goturn (general Object Tracking using regression networks) proposed in 2016. In order to evaluate the performance of the algorithm more accurately and objectively, the invention designs a plurality of groups of comparison experiments to be compared with a GOTURN algorithm, and evaluates the performance of the target tracking algorithm in three aspects of accuracy, instantaneity and robustness: and quantifying the tracking accuracy by using the tracking accuracy and the contact ratio, quantifying the real-time performance by using the tracking speed, and carrying out qualitative analysis on the robustness evaluation experiment.

The configuration of the PC used in the comparative experiments designed by the present invention is shown in Table 1:

TABLE 1 Experimental apparatus parameter configuration

(1) Difficulties and challenges of target tracking

The test set VOT2016 contains 60 video sequences, limited to space, and the present invention does not list all of the 60 video sequences, but rather picks 8 challenging video segments for presentation. These 8 video sequences include various challenges and difficulties that are often present in most of the target tracking problems, such as camera shake, illumination change, motion blur, occlusion, target scale change, etc., as shown in table 2:

TABLE 2 various challenges and difficulties in video sequences

(2) Tracking accuracy

The target tracking accuracy defined by the invention is calculated as follows: firstly, calculating the center point error S between the tracking result and the Ground Truth_errorThen calculating the center point error S_errorLess than a set threshold t₀(taking t in the invention)₀20 pixels) of the tracking target frame number F_tAnd the ratio E of the total number F of the video frames is called the target tracking accuracy, and the calculation formula is as follows:

error of center point S_errorThe calculation is carried out through the average Euclidean distance between the tracking result and the group Truth, and the calculation formula is as follows:

in the above formula, x and y represent coordinate values of the tracking result in the x direction and the y direction, respectively; x is the number of_gAnd y_gAnd the group Truth respectively represents the coordinate values of the tracking target in the x direction and the y direction.

The data of 8-group comparison experiments performed on the test set VOT2016 of the present invention are shown in Table 3:

TABLE 3 tracking accuracy (%)

Video sequence name	GOTURN algorithm	Algorithm of the invention
			ball1	87.34	91.75
gymnastics2	89.91	95.06
			gymnastics3	50.07	81.40
hand	37.85	77.28
			leaves	24.60	63.18
motocross1	92.59	94.30
			road	71.36	83.86
soccer2	45.77	71.56

Table 3 is a partial statistical result of comparison experiments of tracking accuracy of the GOTURN algorithm and the algorithm of the present invention on the test set VOT 2016. For tracking three sequences, namely ball1, gymnics 2 and motocrosss 1, the GOTURN algorithm has good effect, but the algorithm has more excellent performance, and the tracking accuracy is improved by a few percent. And for tracking the remaining 5 segments of videos, the GOTURN algorithm has poor performance, and particularly for tracking the hand sequence and the leave sequence, a serious frame loss phenomenon occurs. Because the target object is small, the search area obtained by cutting is relatively small, and for an object with a high moving speed, the target can run out of the search area due to overlarge interframe displacement, and the tracking of the GOTURN algorithm is invalid. The algorithm of the invention improves the tracking accuracy by a large extent after considering the influence of interframe displacement, wherein the tracking accuracy of the hand sequence and the leave sequence is improved by nearly 40%.

(3) Tracking contact ratio

The tracking contact ratio defined by the invention refers to the ratio between the tracking frame of the target object and the marking frame of the Ground Truth, and the calculation formula is as follows:

in the above formula, S represents a tracking contact ratio; r_gThe tracking accuracy of the algorithm is higher as the tracking contact degree is higher according to formula 7. table 4 lists the tracking contact degrees of 8 different algorithms in the test set VOT 2016.

TABLE 4 tracking contact ratio (%)

Table 4 is a partial statistical result of comparison experiments of tracking contact ratio between the GOTURN algorithm and the algorithm of the present invention on the test set VOT 2016. The tracking accuracy measure introduced in the previous section is the distance between the tracking target and the Ground Truth, and the tracking coincidence measure in the current section is the coincidence degree between the tracking frame of the tracking target and the marking frame of the Ground Truth. Generally, the closer the distance, the higher the degree of overlap, so the data in Table 4 generally appear consistent with Table 3. For tracking three sequences, namely ball1, gymnics 2 and motocross1, the performance of the two algorithms is relatively better, while for tracking the hand sequence and the leave sequence, the performance of the two algorithms is not ideal, but the tracking coincidence degree of the algorithm is higher than that of the GOTURN algorithm, which shows that the algorithm is more excellent than the GOTURN algorithm.

(4) Tracking speed

The tracking speed defined by the invention refers to the ratio of the total number of the tracked video frames to the tracking time, and the calculation formula is as follows:

in the above formula, V represents a tracking speed; n represents the total frame number of a certain section of tracked video; t denotes the duration of tracking the video. Table 5 lists the tracking speed of the different algorithms in the test set VOT 2016.

TABLE 5 tracking speed (Frames/sec)

The algorithm of the invention is different from the GOTURN algorithm in the adopted motion model, the motion model adopted by the GOTURN algorithm is too simple and has poor performance when tracking a fast moving target, and the motion model constructed by the invention is mainly used for solving the tracking problem of the fast moving target. The motion model designed by the invention only relates to some simple interframe coordinate operations, does not relate to complex image operations, and has little increased algorithm complexity, so that the tracking accuracy is improved, the tracking speed is not reduced basically, and the tracking speed equivalent to the GOTURN algorithm is maintained basically.

(5) Tracking robustness

The algorithm of the invention performs equivalently to the GOTURN algorithm for tracking video sequences with slow motion speed or relatively large target objects. For illustration of the tracking effect of the algorithm of the present invention, and at the same time, for space, 2 sets of tracking results of video sequences with faster motion speed or relatively smaller target object are selected, wherein the solid line boxes represent the tracking result of the GOTURN algorithm, and the dashed line boxes represent the tracking result of the algorithm herein.

Two video sequences including football and sports motorcycle contain most common difficulties in the target tracking process, and meanwhile, the target object is relatively small and moves fast, and the problems bring great challenges to the correct tracking of the target. For a sequence of soccer balls, the ball is relatively small and, after being struck or hitting the ground, moves at a fast speed, causing the target object to "run out" of the search area, resulting in failed tracking by the GOTURN algorithm. For a motorcycle in motion, the motorcycle has a relatively fast motion speed and a relatively long shooting distance, so that a target area in a shot picture is relatively small, and the problems are quite challenging for the GOTURN algorithm. According to the algorithm, after the influence of the inter-frame difference on target tracking is considered, a motion model based on the inter-frame difference is constructed, and experiments show that the algorithm has better robustness compared with the GOTURN algorithm.

Referring to fig. 6, the invention further discloses a target tracking device based on deep learning, comprising the following components:

the picture reading unit 210: continuously reading two frames of pictures, including a previous frame of picture and a current frame of picture, wherein the previous frame of picture has a calculated target position, and the current frame of picture needs to calculate the target position;

area setting section 220: respectively setting and cutting a target area of a previous frame and a search area of a current frame;

feature extraction and comparison unit 230: inputting the target area and the search area into a Convolutional Neural Network (CNN), performing feature extraction and feature comparison, and calculating to obtain the target area of the current frame;

the interframe displacement calculation unit 240: calculating to obtain the interframe displacement of the current frame relative to the target of the previous frame by using the target area of the current frame and the target area of the previous frame;

the judgment unit 250: and judging whether the current frame is the last frame, if so, finishing the tracking, otherwise, continuously reading two continuous frames of pictures by the picture reading unit, and performing iterative target tracking.

Further, in the region setting unit 220, determining whether the object moves smoothly in the image is: comparing the interframe displacement of the target of two adjacent frames in three continuous frames before the current frame, and if the interframe displacement difference of two adjacent frames in the three continuous frames is smaller, considering that the motion is stable; if the interframe displacement difference of two adjacent frames in the three continuous frames is large, the movement speed is considered to be changed violently.

Further, in the area setting unit 220, it is determined whether the inter-frame displacement difference between two adjacent frames in the three consecutive frames is smaller than 1/3 of the inter-frame displacement between the two previous frames;

controlling the size of the region k₂And k₁All take the value of 2.

Further, in the region setting unit 220, in order to avoid the situation that the actual position of the current frame target exceeds the second bounding box due to too fast change of the moving speed, when the speed is changed drastically, the size of the second bounding box is increased, that is, k is increased₂The value of (d); and/or

The feature extraction and comparison unit 230 first performs feature extraction on the target region and the search region in the convolutional layer, then performs feature comparison on the target region and the search region in the fully connected layer, and finally obtains the target region of the current frame after calculation.

The present invention still further discloses a storage medium for storing computer-executable instructions,

the computer executable instructions, when executed by a processor, perform the method described above.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, various aspects of the present invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Further, aspects of the invention may take the form of: a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following computer readable media: is not a computer readable storage medium and may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing in part on a user computer and in part on a remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider)

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A target tracking method based on deep learning comprises the following steps:

the setting and cutting of the target area of the previous frame specifically comprises the following steps: knowing the center point position c of the target from the previous frame (c ═ c)_x,c_y) Marking the target object by taking the rectangular frame as a first bounding box as a central point, wherein the height of the first bounding box is h, the width of the first bounding box is w, and the rectangular frame is cut to obtain the target objectRespectively, the height and width of the target region of (2) are k₁h and k₁w, parameter k₁For controlling the size of the target area;

the setting and cutting of the search area of the current frame specifically comprises: judging whether the motion of the object in the image is stable or not, if so, determining that the position c 'of the center point of the search area of the current frame is (c'_x,c′_y) Equal to the center point position c of the known target of the previous frame ═ c_x,c_y) Adding the interframe displacement S of the image targets of the previous two frames, and if the speed changes greatly, determining that the position c 'of the central point of the search area of the current frame is (c'_x,c′_y) The center point position c of the known target of the previous frame is (c)_x,c_y) Namely, the position of the target center point of the previous frame is used as the clipping center of the current frame, the rectangular frame is used as the second bounding box for marking, the height of the second bounding box is h, the width of the second bounding box is w, and the height and the width of the clipped search area are respectively k₂h and k₂w, parameter k₂For controlling the size of the search area;

a judgment step S150: judging whether the current frame is the last frame, if so, finishing the tracking, otherwise, entering a picture reading step S110, continuously reading two continuous frames of pictures, and continuously carrying out iterative target tracking;

in the region setting step S120, it is determined whether the target object moves smoothly in the image: comparing the interframe displacement of the target of two adjacent frames in three continuous frames before the current frame, and if the interframe displacement difference of two adjacent frames in the three continuous frames is smaller, considering that the motion is stable; if the interframe displacement difference of two adjacent frames in the continuous three frames is large, the movement speed is considered to be changed violently;

in the step S120 of setting the area, it is determined whether the inter-frame displacement difference between two adjacent frames in the three consecutive frames is smaller than 1/3 of the inter-frame displacement between the two previous frames;

controlling the size of the region k₂And k₁All take the value of 2.

2. The target tracking method of claim 1, wherein:

in the region setting step S120, in order to avoid the situation that the actual position of the current frame target exceeds the second bounding box due to too fast change of the moving speed, when the speed is changed drastically, the size of the second bounding box is increased, that is, k is increased₂The numerical value of (c).

3. The target tracking method of claim 1, wherein:

the step S130 of extracting and comparing features is to extract features of the target region and the search region in the convolutional layer, input the extracted features into the full link layer, compare the features of the target region and the search region in the full link layer, and calculate the target region of the current frame.

4. A target tracking device based on deep learning comprises the following components:

picture reading unit (210): continuously reading two frames of pictures, including a previous frame of picture and a current frame of picture, wherein the previous frame of picture has a calculated target position, and the current frame of picture needs to calculate the target position;

region setting unit (220): respectively setting and cutting a target area of a previous frame and a search area of a current frame;

the setting and cutting of the target area of the previous frame specifically comprises the following steps: knowing the center point position c of the target from the previous frame (c ═ c)_x,c_y) And as a central point, marking the target object by taking the rectangular frame as a first bounding box, wherein the height of the first bounding box is h, the width of the first bounding box is w, and the height and the width of the target area obtained after cutting are respectively k₁h and k₁w, parameter k₁For controlling the size of the target area;

a feature extraction and comparison step (230): inputting the target area and the search area into a Convolutional Neural Network (CNN), performing feature extraction and feature comparison, and calculating to obtain the target area of the current frame;

interframe displacement calculation unit (240): calculating to obtain the interframe displacement of the current frame relative to the target of the previous frame by using the target area of the current frame and the target area of the previous frame;

determination unit (250): judging whether the current frame is the last frame, if so, finishing the tracking, otherwise, continuously reading two continuous frames of pictures by the picture reading unit, and carrying out iterative target tracking;

in the area setting unit (220), whether the object moves smoothly in the image is judged as follows: comparing the interframe displacement of the target of two adjacent frames in three continuous frames before the current frame, and if the interframe displacement difference of two adjacent frames in the three continuous frames is smaller, considering that the motion is stable; if the interframe displacement difference of two adjacent frames in the continuous three frames is large, the movement speed is considered to be changed violently;

in the area setting unit (220), 1/3 judging whether the interframe displacement difference of two adjacent frames in the continuous three frames is smaller than that of the previous two frames;

controlling the size of the region k₂And k₁All take the value of 2.

5. The object tracking device of claim 4, wherein:

in the area setting unit (220), in order to avoid the situation that the actual position of the current frame target exceeds the second enclosing box due to the fact that the movement speed changes too fast, when the speed changes violently, the size of the second enclosing box is increased, namely k is increased₂The value of (d); and/or

6. A storage medium for storing computer-executable instructions,

the computer executable instructions, when executed by a processor, perform the target tracking method of any one of claims 1-3.